ast_toolbox.algos package

Algorithms for solving AST formulated RL problems.

class ast_toolbox.algos.GA(top_paths=None, n_itr=2, batch_size=500, step_size=0.01, step_size_anneal=1.0, pop_size=5, truncation_size=2, keep_best=1, f_F='mean', log_interval=4000, init_step=1.0, **kwargs)[source]

Bases: garage.tf.algos.batch_polopt.BatchPolopt

Deep Genetic Algorithm from Such et al. [1]_.

Parameters:
  • top_paths (ast_toolbox.mcts.BoundedPriorityQueues, optional) – The bounded priority queue to store top-rewarded trajectories.
  • step_size (float, optional) – Standard deviation for each mutation.
  • step_size_anneal (float, optional) – The linear annealing rate of step_size after each iteration.
  • pop_size (int, optional) – The population size
  • truncation_size (int, optional) – The number of top-performed individuals that are chosen as parents.
  • keep_best (int, optional) – The number of top-performed individuals that remain unchanged for next generation.
  • f_F (string, optional) – The method used to calculate fitness: ‘mean’ for the average return, ‘max’ for the max return.
  • log_interval (int, optional) – The log interval in terms of environment calls.
  • kwargs – Keyword arguments passed to garage.tf.algos.BatchPolopt.

References

[1]Such, Felipe Petroski, et al. “Deep neuroevolution: Genetic algorithms are a competitive
alternative for training deep neural networks for reinforcement learning.”
arXiv:1712.06567 (2017).
extra_recording(itr)[source]

Record extra training statistics per-iteration.

Parameters:itr (int) – The iteration number.
get_fitness(itr, all_paths)[source]

Calculate the fitness of the collexted paths.

Parameters:
  • itr (int) – The iteration number.
  • all_paths (list[dict]) – The collected paths from the sampler.
Returns:

fitness (list[float]) – The list of fitness of each individual.

get_itr_snapshot(itr, samples_data)[source]

Get the snapshot of the current population.

Parameters:
  • itr (int) – The iteration number.
  • samples_data (dict) – The processed data samples.
Returns:

snaposhot (dict) – The training snapshot.

init_opt()[source]

Initiate trainer internal tensorflow operations.

initial()[source]

Initiate trainer internal parameters.

mutation(itr, new_seeds, new_magnitudes, all_paths)[source]

Generate new random seeds and magnitudes for the next generation.

The first self.keep_best seeds are set to no-mutation value (0).

Parameters:
  • itr (int) – The iteration number.
  • new_seeds (numpy.ndarry) – The original seeds.
  • new_magnitudes (numpy.ndarry) – The original magnitudes.
  • all_paths (list[dict]) – The collected paths from the sampler.
Returns:

  • new_seeds (numpy.ndarry) – The new seeds.
  • new_magnitudes (numpy.ndarry) – The new magnitudes.

obtain_samples(itr, runner)[source]

Collect rollout samples using the current policy paramter.

Parameters:
  • itr (int) – The iteration number.
  • runner (garage.experiment.LocalRunner) – LocalRunner is passed to give algorithm the access to runner.obtain_samples(), which collects rollout paths from the sampler.
Returns:

paths (list[dict]) – The collected paths from the sampler.

optimize_policy(itr, all_paths)[source]

Update the population represented by self.seeds and self.parents.

Parameters:
  • itr (int) – The iteration number.
  • all_paths (list[dict]) – The collected paths from the sampler.
process_samples(itr, paths)[source]

Return processed sample data based on the collected paths.

Parameters:
  • itr (int) – The iteration number.
  • paths (list[dict]) – The collected paths from the sampler.
Returns:

samples_data (dict) – Processed sample data with same trajectory length (padded with 0)

record_tabular(itr)[source]

Record training performace per-iteration.

Parameters:itr (int) – The iteration number.
select_parents(fitness)[source]

Select the individuals to be the parents of the next generation.

Parameters:fitness (list[float]) – The list of fitness of each individual.
set_params(itr, p)[source]

Set the current policy paramter to the specified iteration and individual.

Parameters:
  • itr (int) – The iteration number.
  • p (int) – The individual index.
train(runner)[source]

Start training.

Parameters:runner (garage.experiment.LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
class ast_toolbox.algos.GASM(step_size=0.01, **kwargs)[source]

Bases: ast_toolbox.algos.ga.GA

Deep Genetic Algorithm [1]_ with Safe Mutation [2]_.

Parameters:
  • step_size (float, optional) – The constraint on the KL divergence of each mutation.
  • kwargs – Keyword arguments passed to ast_toolbox.algos.ga.GA.

References

[1]Such, Felipe Petroski, et al. “Deep neuroevolution: Genetic algorithms are a competitive alternative for
training deep neural networks for reinforcement learning.”
arXiv preprint arXiv:1712.06567 (2017).
[2]Lehman, Joel, et al. “Safe mutations for deep and recurrent neural networks through output gradients.” Proceedings of the Genetic and Evolutionary Computation Conference. 2018.
data2inputs(samples_data)[source]

Transfer the processed data samples to training inputs

Parameters:samples_data (dict) – The processed data samples
Returns:all_input_values (tuple) – The input used in training
extra_recording(itr)[source]

Record extra training statistics per-iteration.

Parameters:itr (int) – The iteration number.
init_opt()[source]

Initiate trainer internal tensorflow operations.

mutation(itr, new_seeds, new_magnitudes, all_paths)[source]

Generate new random seeds and magnitudes for the next generation.

The first self.keep_best seeds are set to no-mutation value (0).

Parameters:
  • itr (int) – The iteration number.
  • new_seeds (numpy.ndarry) – The original seeds.
  • new_magnitudes (numpy.ndarry) – The original magnitudes.
  • all_paths (list[dict]) – The collected paths from the sampler.
Returns:

  • new_seeds (numpy.ndarry) – The new seeds.
  • new_magnitudes (numpy.ndarry) – The new magnitudes.

class ast_toolbox.algos.MCTS(env, max_path_length, ec, n_itr, k, alpha, clear_nodes, log_interval, top_paths, log_dir, gamma=1.0, stress_test_mode=2, log_tabular=True, plot_tree=False, plot_path=None, plot_format='png')[source]

Bases: object

Monte Carlo Tress Search (MCTS) with double progressive widening (DPW) [1]_ using the env’s action space as its action space.

Parameters:
  • env (ast_toolbox.envs.go_explore_ast_env.GoExploreASTEnv.) – The environment.
  • max_path_length (int) – The maximum search depth.
  • ec (float) – The exploration constant used in UCT equation.
  • n_itr (int) – The iteration number, the total numeber of environment call is approximately n_itr*max_path_length*max_path_length.
  • k (float) – The constraint parameter used in DPW: |N(s,a)|<=kN(s)^alpha.
  • alpha (float) – The constraint parameter used in DPW: |N(s,a)|<=kN(s)^alpha.
  • clear_nodes (bool) – Whether to clear redundant nodes in tree. Set it to True for saving memoray. Set it to False to better tree plotting.
  • log_interval (int) – The log interval in terms of environment calls.
  • top_paths (ast_toolbox.mcts.BoundedPriorityQueues, optional) – The bounded priority queue to store top-rewarded trajectories.
  • gamma (float, optional) – The discount factor.
  • stress_test_mode (int, optional) – The mode of the tree search. 1 for single tree. 2 for multiple trees.
  • log_tabular (bool, optional) – Whether to log the training statistics into a tabular file.
  • plot_tree (bool, optional) – Whether to plot the resulting searching tree.
  • plot_path (str, optional) – The storing path for the tree plot.
  • plot_format (str, optional) – The storing format for the tree plot

References

[1]Lee, Ritchie, et al. “Adaptive stress testing of airborne collision avoidance systems.” 2015 IEEE/AIAA 34th Digital Avionics Systems Conference (DASC). IEEE, 2015.
init()[source]

Initiate AST internal parameters

train(runner)[source]

Start training.

Parameters:runner (garage.experiment.LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
class ast_toolbox.algos.MCTSBV(M=10, **kwargs)[source]

Bases: ast_toolbox.algos.mcts.MCTS

Monte Carlo Tress Search (MCTS) with double progressive widening (DPW) [1]_ using Blind Value search from Couetoux et al. [2]_.

Parameters:
  • M (int, optional) – The number of randon decisions generated for the action pool.
  • kwargs – Keyword arguments passed to ast_toolbox.algos.mcts.MCTS.

References

[1]Lee, Ritchie, et al. “Adaptive stress testing of airborne collision avoidance systems.” 2015 IEEE/AIAA 34th Digital Avionics Systems Conference (DASC). IEEE, 2015.
[2]Couetoux, Adrien, Hassen Doghmen, and Olivier Teytaud. “Improving the exploration in upper confidence trees.” International Conference on Learning and Intelligent Optimization. Springer, Berlin, Heidelberg, 2012.
init()[source]

Initiate AST internal parameters

class ast_toolbox.algos.MCTSRS(seed=0, rsg_length=1, **kwargs)[source]

Bases: ast_toolbox.algos.mcts.MCTS

Monte Carlo Tress Search (MCTS) with double progressive widening (DPW) [1]_ using the random seeds as its action space.

Parameters:
  • seed (int, optional) – The seed used to generate the initial random seed generator.
  • rsg_length (int, optional) – The length of the state of the random seed generator. Set it to higher values for extreme large problems.

References

[1]Lee, Ritchie, et al. “Adaptive stress testing of airborne collision avoidance systems.” 2015 IEEE/AIAA 34th Digital Avionics Systems Conference (DASC). IEEE, 2015.
init()[source]

Initiate AST internal parameters

class ast_toolbox.algos.GoExplore(db_filename, max_db_size, env, env_spec, policy, baseline, save_paths_gap=0, save_paths_path=None, overwrite_db=True, use_score_weight=True, **kwargs)[source]

Bases: garage.tf.algos.batch_polopt.BatchPolopt

Implementation of the Go-Explore[1]_ algorithm that is compatible with AST[2]_. :Parameters: * db_filename (str) – The base path and name for the database files. The CellPool saves a [filename]_pool.dat and a [filename]_meta.dat.

  • max_db_size (int) – Maximum allowable size (in GB) of the CellPool database. Algorithm will immediately stop and exit if this size is exceeded.
  • env (ast_toolbox.envs.go_explore_ast_env.GoExploreASTEnv) – The environment.
  • env_spec (garage.envs.EnvSpec) – Environment specification.
  • policy (garage.tf.policies.Policy) – The policy.
  • baseline (garage.np.baselines.Baseline) – The baseline.
  • save_paths_gap (int, optional) – How many epochs to skip between saving out full paths. Set to 1 to save every epoch. Set to 0 to disable saving.
  • save_paths_path (str, optional) – Path to the directory where paths should be saved. Set to None to disable saving.
  • overwrite_db (bool, optional) – Indicates if an existing database should be overwritten if found.
  • use_score_weight (bool) – Whether or not to scale the cell’s fitness by a function of the cell’s score
  • kwargs – Keyword arguments passed to garage.tf.algos.BatchPolopt

References

[1]Ecoffet, Adrien, et al. “Go-explore: a new approach for hard-exploration problems.” arXiv preprint arXiv:1901.10995 (2019). https://arxiv.org/abs/1901.10995
[2]Koren, Mark, and Mykel J. Kochenderfer. “Adaptive Stress Testing without Domain Heuristics using Go-Explore.” arXiv preprint arXiv:2004.04292 (2020). https://arxiv.org/abs/2004.04292
downsample(obs, step=None)[source]

Create a downsampled approximation of the observed simulation state.

Parameters:
  • obs (array_like) – The observed simulation state.
  • step (int, optional) – The current iteration number
Returns:

array_like – The downsampled approximation of the observed simulation state.

get_itr_snapshot(itr)[source]

Returns all the data that should be saved in the snapshot for this iteration.

Parameters:itr (int) – The current epoch number.
Returns:dict – A dict containing the current iteration number, the current policy, and the current baseline.
init_opt()[source]

Initialize the optimization procedure. If using tensorflow, this may include declaring all the variables and compiling functions

optimize_policy(itr, samples_data)[source]

Optimize the policy using the samples.

Parameters:
  • itr (int) – The current epoch number.
  • samples_data (dict) – The data from the sampled rollouts.
train(runner)[source]

Obtain samplers and start actual training for each epoch.

Parameters:runner (garage.experiment.LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
Returns:last_return (ast_toolbox.algos.go_explore.Cell) – The highest scoring cell found so far
train_once(itr, paths)[source]

Perform one step of policy optimization given one batch of samples.

Parameters:
  • itr (int) – Iteration number.
  • paths (list[dict]) – A list of collected paths.
Returns:

best_cell (ast_toolbox.algos.go_explore.Cell) – The highest scoring cell found so far

class ast_toolbox.algos.BackwardAlgorithm(env, policy, expert_trajectory, epochs_per_step=10, max_epochs=None, skip_until_step=0, max_path_length=500, **kwargs)[source]

Bases: garage.tf.algos.ppo.PPO

Backward Algorithm from Salimans and Chen [1]_.

Parameters:
  • env (ast_toolbox.envs.go_explore_ast_env.GoExploreASTEnv) – The environment.

  • policy (garage.tf.policies.Policy) – The policy.

  • expert_trajectory (array_like[dict]) – The expert trajectory, an array_like where each member represents a timestep in a trajectory. The array_like should be 1-D and in chronological order. Each member of the array_like is a dictionary with the following keys:

    • state: The simulator state at that timestep (pre-action).
    • reward: The reward at that timestep (post-action).
    • observation: The simulation observation at that timestep (post-action).
    • action: The action taken at that timestep.
  • epochs_per_step (int, optional) – Maximum number of epochs to run per step of the trajectory.

  • max_epochs (int, optional) – Maximum number of total epochs to run. If not set, defaults to epochs_per_step times the number of steps in the expert_trajectory.

  • skip_until_step (int, optional) – Skip training for a certain number of steps at the start, counted backwards from the end of the trajectory. For example, if this is set to 3 for an expert_trajectory of length 10, training will start from step 7.

  • max_path_length (int, optional) – Maximum length of a single rollout.

  • kwargs – Keyword arguments passed to garage.tf.algos.PPO

References

[1]Salimans, Tim, and Richard Chen. “Learning Montezuma’s Revenge from a Single Demonstration.” arXiv preprint arXiv:1812.03381 (2018). https://arxiv.org/abs/1812.03381
get_next_epoch(runner)[source]

Wrapper of garage’s runner.step_epochs() generator to handle initialization to correct trajectory state

Parameters:

runner (garage.experiment.LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.

Yields:
  • runner.step_itr (int) – The current epoch number.
  • runner.obtain_samples(runner.step_itr) (list[dict]) – A list of sampled rollouts for the current epoch
set_env_to_expert_trajectory_step()[source]

Updates the algorithm to use the data from expert_trajectory up to the current step.

train(runner)[source]

Obtain samplers and start actual training for each epoch.

Parameters:runner (garage.experiment.LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
Returns:full_paths (array_like) – A list of the path data from each epoch.
train_once(itr, paths)[source]

Perform one step of policy optimization given one batch of samples.

Parameters:
  • itr (int) – Iteration number.
  • paths (list[dict]) – A list of collected paths.
Returns:

paths (list[dict]) – A list of processed paths