ast_toolbox.algos package¶
Algorithms for solving AST formulated RL problems.
-
class
ast_toolbox.algos.
GA
(top_paths=None, n_itr=2, batch_size=500, step_size=0.01, step_size_anneal=1.0, pop_size=5, truncation_size=2, keep_best=1, f_F='mean', log_interval=4000, init_step=1.0, **kwargs)[source]¶ Bases:
garage.tf.algos.batch_polopt.BatchPolopt
Deep Genetic Algorithm from Such et al. [1]_.
Parameters: - top_paths (
ast_toolbox.mcts.BoundedPriorityQueues
, optional) – The bounded priority queue to store top-rewarded trajectories. - step_size (float, optional) – Standard deviation for each mutation.
- step_size_anneal (float, optional) – The linear annealing rate of step_size after each iteration.
- pop_size (int, optional) – The population size
- truncation_size (int, optional) – The number of top-performed individuals that are chosen as parents.
- keep_best (int, optional) – The number of top-performed individuals that remain unchanged for next generation.
- f_F (string, optional) – The method used to calculate fitness: ‘mean’ for the average return, ‘max’ for the max return.
- log_interval (int, optional) – The log interval in terms of environment calls.
- kwargs – Keyword arguments passed to garage.tf.algos.BatchPolopt.
References
[1] Such, Felipe Petroski, et al. “Deep neuroevolution: Genetic algorithms are a competitive - alternative for training deep neural networks for reinforcement learning.”
- arXiv:1712.06567 (2017).
-
extra_recording
(itr)[source]¶ Record extra training statistics per-iteration.
Parameters: itr (int) – The iteration number.
-
get_fitness
(itr, all_paths)[source]¶ Calculate the fitness of the collexted paths.
Parameters: - itr (int) – The iteration number.
- all_paths (list[dict]) – The collected paths from the sampler.
Returns: fitness (list[float]) – The list of fitness of each individual.
-
get_itr_snapshot
(itr, samples_data)[source]¶ Get the snapshot of the current population.
Parameters: - itr (int) – The iteration number.
- samples_data (dict) – The processed data samples.
Returns: snaposhot (dict) – The training snapshot.
-
mutation
(itr, new_seeds, new_magnitudes, all_paths)[source]¶ Generate new random seeds and magnitudes for the next generation.
The first self.keep_best seeds are set to no-mutation value (0).
Parameters: - itr (int) – The iteration number.
- new_seeds (
numpy.ndarry
) – The original seeds. - new_magnitudes (
numpy.ndarry
) – The original magnitudes. - all_paths (list[dict]) – The collected paths from the sampler.
Returns: - new_seeds (
numpy.ndarry
) – The new seeds. - new_magnitudes (
numpy.ndarry
) – The new magnitudes.
-
obtain_samples
(itr, runner)[source]¶ Collect rollout samples using the current policy paramter.
Parameters: - itr (int) – The iteration number.
- runner (
garage.experiment.LocalRunner
) –LocalRunner
is passed to give algorithm the access torunner.obtain_samples()
, which collects rollout paths from the sampler.
Returns: paths (list[dict]) – The collected paths from the sampler.
-
optimize_policy
(itr, all_paths)[source]¶ Update the population represented by self.seeds and self.parents.
Parameters: - itr (int) – The iteration number.
- all_paths (list[dict]) – The collected paths from the sampler.
-
process_samples
(itr, paths)[source]¶ Return processed sample data based on the collected paths.
Parameters: - itr (int) – The iteration number.
- paths (list[dict]) – The collected paths from the sampler.
Returns: samples_data (dict) – Processed sample data with same trajectory length (padded with 0)
-
record_tabular
(itr)[source]¶ Record training performace per-iteration.
Parameters: itr (int) – The iteration number.
-
select_parents
(fitness)[source]¶ Select the individuals to be the parents of the next generation.
Parameters: fitness (list[float]) – The list of fitness of each individual.
-
set_params
(itr, p)[source]¶ Set the current policy paramter to the specified iteration and individual.
Parameters: - itr (int) – The iteration number.
- p (int) – The individual index.
-
train
(runner)[source]¶ Start training.
Parameters: runner ( garage.experiment.LocalRunner
) –LocalRunner
is passed to give algorithm the access torunner.step_epochs()
, which provides services such as snapshotting and sampler control.
- top_paths (
-
class
ast_toolbox.algos.
GASM
(step_size=0.01, **kwargs)[source]¶ Bases:
ast_toolbox.algos.ga.GA
Deep Genetic Algorithm [1]_ with Safe Mutation [2]_.
Parameters: - step_size (float, optional) – The constraint on the KL divergence of each mutation.
- kwargs – Keyword arguments passed to ast_toolbox.algos.ga.GA.
References
[1] Such, Felipe Petroski, et al. “Deep neuroevolution: Genetic algorithms are a competitive alternative for - training deep neural networks for reinforcement learning.”
- arXiv preprint arXiv:1712.06567 (2017).
[2] Lehman, Joel, et al. “Safe mutations for deep and recurrent neural networks through output gradients.” Proceedings of the Genetic and Evolutionary Computation Conference. 2018. -
data2inputs
(samples_data)[source]¶ Transfer the processed data samples to training inputs
Parameters: samples_data (dict) – The processed data samples Returns: all_input_values (tuple) – The input used in training
-
extra_recording
(itr)[source]¶ Record extra training statistics per-iteration.
Parameters: itr (int) – The iteration number.
-
mutation
(itr, new_seeds, new_magnitudes, all_paths)[source]¶ Generate new random seeds and magnitudes for the next generation.
The first self.keep_best seeds are set to no-mutation value (0).
Parameters: - itr (int) – The iteration number.
- new_seeds (
numpy.ndarry
) – The original seeds. - new_magnitudes (
numpy.ndarry
) – The original magnitudes. - all_paths (list[dict]) – The collected paths from the sampler.
Returns: - new_seeds (
numpy.ndarry
) – The new seeds. - new_magnitudes (
numpy.ndarry
) – The new magnitudes.
-
class
ast_toolbox.algos.
MCTS
(env, max_path_length, ec, n_itr, k, alpha, clear_nodes, log_interval, top_paths, log_dir, gamma=1.0, stress_test_mode=2, log_tabular=True, plot_tree=False, plot_path=None, plot_format='png')[source]¶ Bases:
object
Monte Carlo Tress Search (MCTS) with double progressive widening (DPW) [1]_ using the env’s action space as its action space.
Parameters: - env (
ast_toolbox.envs.go_explore_ast_env.GoExploreASTEnv
.) – The environment. - max_path_length (int) – The maximum search depth.
- ec (float) – The exploration constant used in UCT equation.
- n_itr (int) – The iteration number, the total numeber of environment call is approximately n_itr*max_path_length*max_path_length.
- k (float) – The constraint parameter used in DPW: |N(s,a)|<=kN(s)^alpha.
- alpha (float) – The constraint parameter used in DPW: |N(s,a)|<=kN(s)^alpha.
- clear_nodes (bool) – Whether to clear redundant nodes in tree. Set it to True for saving memoray. Set it to False to better tree plotting.
- log_interval (int) – The log interval in terms of environment calls.
- top_paths (
ast_toolbox.mcts.BoundedPriorityQueues
, optional) – The bounded priority queue to store top-rewarded trajectories. - gamma (float, optional) – The discount factor.
- stress_test_mode (int, optional) – The mode of the tree search. 1 for single tree. 2 for multiple trees.
- log_tabular (bool, optional) – Whether to log the training statistics into a tabular file.
- plot_tree (bool, optional) – Whether to plot the resulting searching tree.
- plot_path (str, optional) – The storing path for the tree plot.
- plot_format (str, optional) – The storing format for the tree plot
References
[1] Lee, Ritchie, et al. “Adaptive stress testing of airborne collision avoidance systems.” 2015 IEEE/AIAA 34th Digital Avionics Systems Conference (DASC). IEEE, 2015. -
train
(runner)[source]¶ Start training.
Parameters: runner ( garage.experiment.LocalRunner
) –LocalRunner
is passed to give algorithm the access torunner.step_epochs()
, which provides services such as snapshotting and sampler control.
- env (
-
class
ast_toolbox.algos.
MCTSBV
(M=10, **kwargs)[source]¶ Bases:
ast_toolbox.algos.mcts.MCTS
Monte Carlo Tress Search (MCTS) with double progressive widening (DPW) [1]_ using Blind Value search from Couetoux et al. [2]_.
Parameters: - M (int, optional) – The number of randon decisions generated for the action pool.
- kwargs – Keyword arguments passed to ast_toolbox.algos.mcts.MCTS.
References
[1] Lee, Ritchie, et al. “Adaptive stress testing of airborne collision avoidance systems.” 2015 IEEE/AIAA 34th Digital Avionics Systems Conference (DASC). IEEE, 2015. [2] Couetoux, Adrien, Hassen Doghmen, and Olivier Teytaud. “Improving the exploration in upper confidence trees.” International Conference on Learning and Intelligent Optimization. Springer, Berlin, Heidelberg, 2012.
-
class
ast_toolbox.algos.
MCTSRS
(seed=0, rsg_length=1, **kwargs)[source]¶ Bases:
ast_toolbox.algos.mcts.MCTS
Monte Carlo Tress Search (MCTS) with double progressive widening (DPW) [1]_ using the random seeds as its action space.
Parameters: - seed (int, optional) – The seed used to generate the initial random seed generator.
- rsg_length (int, optional) – The length of the state of the random seed generator. Set it to higher values for extreme large problems.
References
[1] Lee, Ritchie, et al. “Adaptive stress testing of airborne collision avoidance systems.” 2015 IEEE/AIAA 34th Digital Avionics Systems Conference (DASC). IEEE, 2015.
-
class
ast_toolbox.algos.
GoExplore
(db_filename, max_db_size, env, env_spec, policy, baseline, save_paths_gap=0, save_paths_path=None, overwrite_db=True, use_score_weight=True, **kwargs)[source]¶ Bases:
garage.tf.algos.batch_polopt.BatchPolopt
Implementation of the Go-Explore[1]_ algorithm that is compatible with AST[2]_. :Parameters: * db_filename (str) – The base path and name for the database files. The CellPool saves a [filename]_pool.dat and a [filename]_meta.dat.
- max_db_size (int) – Maximum allowable size (in GB) of the CellPool database. Algorithm will immediately stop and exit if this size is exceeded.
- env (
ast_toolbox.envs.go_explore_ast_env.GoExploreASTEnv
) – The environment. - env_spec (
garage.envs.EnvSpec
) – Environment specification. - policy (
garage.tf.policies.Policy
) – The policy. - baseline (
garage.np.baselines.Baseline
) – The baseline. - save_paths_gap (int, optional) – How many epochs to skip between saving out full paths. Set to 1 to save every epoch. Set to 0 to disable saving.
- save_paths_path (str, optional) – Path to the directory where paths should be saved. Set to None to disable saving.
- overwrite_db (bool, optional) – Indicates if an existing database should be overwritten if found.
- use_score_weight (bool) – Whether or not to scale the cell’s fitness by a function of the cell’s score
- kwargs – Keyword arguments passed to garage.tf.algos.BatchPolopt
References
[1] Ecoffet, Adrien, et al. “Go-explore: a new approach for hard-exploration problems.” arXiv preprint arXiv:1901.10995 (2019). https://arxiv.org/abs/1901.10995 [2] Koren, Mark, and Mykel J. Kochenderfer. “Adaptive Stress Testing without Domain Heuristics using Go-Explore.” arXiv preprint arXiv:2004.04292 (2020). https://arxiv.org/abs/2004.04292 -
downsample
(obs, step=None)[source]¶ Create a downsampled approximation of the observed simulation state.
Parameters: - obs (array_like) – The observed simulation state.
- step (int, optional) – The current iteration number
Returns: array_like – The downsampled approximation of the observed simulation state.
-
get_itr_snapshot
(itr)[source]¶ Returns all the data that should be saved in the snapshot for this iteration.
Parameters: itr (int) – The current epoch number. Returns: dict – A dict containing the current iteration number, the current policy, and the current baseline.
-
init_opt
()[source]¶ Initialize the optimization procedure. If using tensorflow, this may include declaring all the variables and compiling functions
-
optimize_policy
(itr, samples_data)[source]¶ Optimize the policy using the samples.
Parameters: - itr (int) – The current epoch number.
- samples_data (dict) – The data from the sampled rollouts.
-
train
(runner)[source]¶ Obtain samplers and start actual training for each epoch.
Parameters: runner ( garage.experiment.LocalRunner
) –LocalRunner
is passed to give algorithm the access torunner.step_epochs()
, which provides services such as snapshotting and sampler control.Returns: last_return ( ast_toolbox.algos.go_explore.Cell
) – The highest scoring cell found so far
-
train_once
(itr, paths)[source]¶ Perform one step of policy optimization given one batch of samples.
Parameters: - itr (int) – Iteration number.
- paths (list[dict]) – A list of collected paths.
Returns: best_cell (
ast_toolbox.algos.go_explore.Cell
) – The highest scoring cell found so far
-
class
ast_toolbox.algos.
BackwardAlgorithm
(env, policy, expert_trajectory, epochs_per_step=10, max_epochs=None, skip_until_step=0, max_path_length=500, **kwargs)[source]¶ Bases:
garage.tf.algos.ppo.PPO
Backward Algorithm from Salimans and Chen [1]_.
Parameters: env (
ast_toolbox.envs.go_explore_ast_env.GoExploreASTEnv
) – The environment.policy (
garage.tf.policies.Policy
) – The policy.expert_trajectory (array_like[dict]) – The expert trajectory, an array_like where each member represents a timestep in a trajectory. The array_like should be 1-D and in chronological order. Each member of the array_like is a dictionary with the following keys:
- state: The simulator state at that timestep (pre-action).
- reward: The reward at that timestep (post-action).
- observation: The simulation observation at that timestep (post-action).
- action: The action taken at that timestep.
epochs_per_step (int, optional) – Maximum number of epochs to run per step of the trajectory.
max_epochs (int, optional) – Maximum number of total epochs to run. If not set, defaults to
epochs_per_step
times the number of steps in theexpert_trajectory
.skip_until_step (int, optional) – Skip training for a certain number of steps at the start, counted backwards from the end of the trajectory. For example, if this is set to 3 for an
expert_trajectory
of length 10, training will start from step 7.max_path_length (int, optional) – Maximum length of a single rollout.
kwargs – Keyword arguments passed to garage.tf.algos.PPO
References
[1] Salimans, Tim, and Richard Chen. “Learning Montezuma’s Revenge from a Single Demonstration.” arXiv preprint arXiv:1812.03381 (2018). https://arxiv.org/abs/1812.03381 -
get_next_epoch
(runner)[source]¶ Wrapper of garage’s
runner.step_epochs()
generator to handle initialization to correct trajectory stateParameters: runner (
garage.experiment.LocalRunner
) –LocalRunner
is passed to give algorithm the access torunner.step_epochs()
, which provides services such as snapshotting and sampler control.Yields: - runner.step_itr (int) – The current epoch number.
- runner.obtain_samples(runner.step_itr) (list[dict]) – A list of sampled rollouts for the current epoch
-
set_env_to_expert_trajectory_step
()[source]¶ Updates the algorithm to use the data from
expert_trajectory
up to the current step.
-
train
(runner)[source]¶ Obtain samplers and start actual training for each epoch.
Parameters: runner ( garage.experiment.LocalRunner
) –LocalRunner
is passed to give algorithm the access torunner.step_epochs()
, which provides services such as snapshotting and sampler control.Returns: full_paths (array_like) – A list of the path data from each epoch.