ast_toolbox.algos package¶

Algorithms for solving AST formulated RL problems.

class ast_toolbox.algos.GA(top_paths=None, n_itr=2, batch_size=500, step_size=0.01, step_size_anneal=1.0, pop_size=5, truncation_size=2, keep_best=1, f_F='mean', log_interval=4000, init_step=1.0, **kwargs)[source]¶

Bases: garage.tf.algos.batch_polopt.BatchPolopt

Deep Genetic Algorithm from Such et al. [1]_.

Parameters:

top_paths (ast_toolbox.mcts.BoundedPriorityQueues, optional) – The bounded priority queue to store top-rewarded trajectories.
step_size (float, optional) – Standard deviation for each mutation.
step_size_anneal (float, optional) – The linear annealing rate of step_size after each iteration.
pop_size (int, optional) – The population size
truncation_size (int, optional) – The number of top-performed individuals that are chosen as parents.
keep_best (int, optional) – The number of top-performed individuals that remain unchanged for next generation.
f_F (string, optional) – The method used to calculate fitness: ‘mean’ for the average return, ‘max’ for the max return.
log_interval (int, optional) – The log interval in terms of environment calls.
kwargs – Keyword arguments passed to garage.tf.algos.BatchPolopt.

References

[1]	Such, Felipe Petroski, et al. “Deep neuroevolution: Genetic algorithms are a competitive

alternative for training deep neural networks for reinforcement learning.”: arXiv:1712.06567 (2017).

extra_recording(itr)[source]¶

Record extra training statistics per-iteration.

Parameters:	itr (int) – The iteration number.

get_fitness(itr, all_paths)[source]¶

Calculate the fitness of the collexted paths.

Parameters:	itr (int) – The iteration number. all_paths (list[dict]) – The collected paths from the sampler.
Returns:	fitness (list[float]) – The list of fitness of each individual.

get_itr_snapshot(itr, samples_data)[source]¶

Get the snapshot of the current population.

Parameters:	itr (int) – The iteration number. samples_data (dict) – The processed data samples.
Returns:	snaposhot (dict) – The training snapshot.

init_opt()[source]¶: Initiate trainer internal tensorflow operations.

initial()[source]¶: Initiate trainer internal parameters.

mutation(itr, new_seeds, new_magnitudes, all_paths)[source]¶

Generate new random seeds and magnitudes for the next generation.

The first self.keep_best seeds are set to no-mutation value (0).

Parameters:

itr (int) – The iteration number.
new_seeds (numpy.ndarry) – The original seeds.
new_magnitudes (numpy.ndarry) – The original magnitudes.
all_paths (list[dict]) – The collected paths from the sampler.

Returns:

new_seeds (numpy.ndarry) – The new seeds.
new_magnitudes (numpy.ndarry) – The new magnitudes.

obtain_samples(itr, runner)[source]¶

Collect rollout samples using the current policy paramter.

Parameters:	itr (int) – The iteration number. runner (`garage.experiment.LocalRunner`) – `LocalRunner` is passed to give algorithm the access to `runner.obtain_samples()`, which collects rollout paths from the sampler.
Returns:	paths (list[dict]) – The collected paths from the sampler.

optimize_policy(itr, all_paths)[source]¶

Update the population represented by self.seeds and self.parents.

Parameters:	itr (int) – The iteration number. all_paths (list[dict]) – The collected paths from the sampler.

process_samples(itr, paths)[source]¶

Return processed sample data based on the collected paths.

Parameters:	itr (int) – The iteration number. paths (list[dict]) – The collected paths from the sampler.
Returns:	samples_data (dict) – Processed sample data with same trajectory length (padded with 0)

record_tabular(itr)[source]¶

Record training performace per-iteration.

Parameters:	itr (int) – The iteration number.

select_parents(fitness)[source]¶

Select the individuals to be the parents of the next generation.

Parameters:	fitness (list[float]) – The list of fitness of each individual.

set_params(itr, p)[source]¶

Set the current policy paramter to the specified iteration and individual.

Parameters:	itr (int) – The iteration number. p (int) – The individual index.

train(runner)[source]¶

Start training.

Parameters:	runner (`garage.experiment.LocalRunner`) – `LocalRunner` is passed to give algorithm the access to `runner.step_epochs()`, which provides services such as snapshotting and sampler control.

class ast_toolbox.algos.GASM(step_size=0.01, **kwargs)[source]¶

Bases: ast_toolbox.algos.ga.GA

Deep Genetic Algorithm [1]_ with Safe Mutation [2]_.

Parameters:	step_size (float, optional) – The constraint on the KL divergence of each mutation. kwargs – Keyword arguments passed to ast_toolbox.algos.ga.GA.

References

[1]	Such, Felipe Petroski, et al. “Deep neuroevolution: Genetic algorithms are a competitive alternative for

training deep neural networks for reinforcement learning.”: arXiv preprint arXiv:1712.06567 (2017).

[2]	Lehman, Joel, et al. “Safe mutations for deep and recurrent neural networks through output gradients.” Proceedings of the Genetic and Evolutionary Computation Conference. 2018.

data2inputs(samples_data)[source]¶

Transfer the processed data samples to training inputs

Parameters:	samples_data (dict) – The processed data samples
Returns:	all_input_values (tuple) – The input used in training

extra_recording(itr)[source]¶

Record extra training statistics per-iteration.

Parameters:	itr (int) – The iteration number.

init_opt()[source]¶: Initiate trainer internal tensorflow operations.

mutation(itr, new_seeds, new_magnitudes, all_paths)[source]¶

Generate new random seeds and magnitudes for the next generation.

The first self.keep_best seeds are set to no-mutation value (0).

Parameters:

itr (int) – The iteration number.
new_seeds (numpy.ndarry) – The original seeds.
new_magnitudes (numpy.ndarry) – The original magnitudes.
all_paths (list[dict]) – The collected paths from the sampler.

Returns:

new_seeds (numpy.ndarry) – The new seeds.
new_magnitudes (numpy.ndarry) – The new magnitudes.

class ast_toolbox.algos.MCTS(env, max_path_length, ec, n_itr, k, alpha, clear_nodes, log_interval, top_paths, log_dir, gamma=1.0, stress_test_mode=2, log_tabular=True, plot_tree=False, plot_path=None, plot_format='png')[source]¶

Bases: object

Monte Carlo Tress Search (MCTS) with double progressive widening (DPW) [1]_ using the env’s action space as its action space.

Parameters:

env (ast_toolbox.envs.go_explore_ast_env.GoExploreASTEnv.) – The environment.
max_path_length (int) – The maximum search depth.
ec (float) – The exploration constant used in UCT equation.
n_itr (int) – The iteration number, the total numeber of environment call is approximately n_itr*max_path_length*max_path_length.
k (float) – The constraint parameter used in DPW: |N(s,a)|<=kN(s)^alpha.
alpha (float) – The constraint parameter used in DPW: |N(s,a)|<=kN(s)^alpha.
clear_nodes (bool) – Whether to clear redundant nodes in tree. Set it to True for saving memoray. Set it to False to better tree plotting.
log_interval (int) – The log interval in terms of environment calls.
top_paths (ast_toolbox.mcts.BoundedPriorityQueues, optional) – The bounded priority queue to store top-rewarded trajectories.
gamma (float, optional) – The discount factor.
stress_test_mode (int, optional) – The mode of the tree search. 1 for single tree. 2 for multiple trees.
log_tabular (bool, optional) – Whether to log the training statistics into a tabular file.
plot_tree (bool, optional) – Whether to plot the resulting searching tree.
plot_path (str, optional) – The storing path for the tree plot.
plot_format (str, optional) – The storing format for the tree plot

References

[1]	Lee, Ritchie, et al. “Adaptive stress testing of airborne collision avoidance systems.” 2015 IEEE/AIAA 34th Digital Avionics Systems Conference (DASC). IEEE, 2015.

init()[source]¶: Initiate AST internal parameters

train(runner)[source]¶

Start training.

Parameters:	runner (`garage.experiment.LocalRunner`) – `LocalRunner` is passed to give algorithm the access to `runner.step_epochs()`, which provides services such as snapshotting and sampler control.

class ast_toolbox.algos.MCTSBV(M=10, **kwargs)[source]¶

Bases: ast_toolbox.algos.mcts.MCTS

Monte Carlo Tress Search (MCTS) with double progressive widening (DPW) [1]_ using Blind Value search from Couetoux et al. [2]_.

Parameters:	M (int, optional) – The number of randon decisions generated for the action pool. kwargs – Keyword arguments passed to ast_toolbox.algos.mcts.MCTS.

References

[1]	Lee, Ritchie, et al. “Adaptive stress testing of airborne collision avoidance systems.” 2015 IEEE/AIAA 34th Digital Avionics Systems Conference (DASC). IEEE, 2015.

[2]	Couetoux, Adrien, Hassen Doghmen, and Olivier Teytaud. “Improving the exploration in upper confidence trees.” International Conference on Learning and Intelligent Optimization. Springer, Berlin, Heidelberg, 2012.

init()[source]¶: Initiate AST internal parameters

class ast_toolbox.algos.MCTSRS(seed=0, rsg_length=1, **kwargs)[source]¶

Bases: ast_toolbox.algos.mcts.MCTS

Monte Carlo Tress Search (MCTS) with double progressive widening (DPW) [1]_ using the random seeds as its action space.

Parameters:	seed (int, optional) – The seed used to generate the initial random seed generator. rsg_length (int, optional) – The length of the state of the random seed generator. Set it to higher values for extreme large problems.

References

[1]	Lee, Ritchie, et al. “Adaptive stress testing of airborne collision avoidance systems.” 2015 IEEE/AIAA 34th Digital Avionics Systems Conference (DASC). IEEE, 2015.

init()[source]¶: Initiate AST internal parameters

class ast_toolbox.algos.GoExplore(db_filename, max_db_size, env, env_spec, policy, baseline, save_paths_gap=0, save_paths_path=None, overwrite_db=True, use_score_weight=True, **kwargs)[source]¶

Bases: garage.tf.algos.batch_polopt.BatchPolopt

Implementation of the Go-Explore[1]_ algorithm that is compatible with AST[2]_. :Parameters: * db_filename (str) – The base path and name for the database files. The CellPool saves a [filename]_pool.dat and a [filename]_meta.dat.

max_db_size (int) – Maximum allowable size (in GB) of the CellPool database. Algorithm will immediately stop and exit if this size is exceeded.

env (ast_toolbox.envs.go_explore_ast_env.GoExploreASTEnv) – The environment.

env_spec (garage.envs.EnvSpec) – Environment specification.

policy (garage.tf.policies.Policy) – The policy.

baseline (garage.np.baselines.Baseline) – The baseline.

save_paths_gap (int, optional) – How many epochs to skip between saving out full paths. Set to 1 to save every epoch. Set to 0 to disable saving.

save_paths_path (str, optional) – Path to the directory where paths should be saved. Set to None to disable saving.

overwrite_db (bool, optional) – Indicates if an existing database should be overwritten if found.

use_score_weight (bool) – Whether or not to scale the cell’s fitness by a function of the cell’s score

kwargs – Keyword arguments passed to garage.tf.algos.BatchPolopt

References

[1]	Ecoffet, Adrien, et al. “Go-explore: a new approach for hard-exploration problems.” arXiv preprint arXiv:1901.10995 (2019). https://arxiv.org/abs/1901.10995

[2]	Koren, Mark, and Mykel J. Kochenderfer. “Adaptive Stress Testing without Domain Heuristics using Go-Explore.” arXiv preprint arXiv:2004.04292 (2020). https://arxiv.org/abs/2004.04292

downsample(obs, step=None)[source]¶

Create a downsampled approximation of the observed simulation state.

Parameters:	obs (array_like) – The observed simulation state. step (int, optional) – The current iteration number
Returns:	array_like – The downsampled approximation of the observed simulation state.

get_itr_snapshot(itr)[source]¶

Returns all the data that should be saved in the snapshot for this iteration.

Parameters:	itr (int) – The current epoch number.
Returns:	dict – A dict containing the current iteration number, the current policy, and the current baseline.

init_opt()[source]¶: Initialize the optimization procedure. If using tensorflow, this may include declaring all the variables and compiling functions

optimize_policy(itr, samples_data)[source]¶

Optimize the policy using the samples.

Parameters:	itr (int) – The current epoch number. samples_data (dict) – The data from the sampled rollouts.

train(runner)[source]¶

Obtain samplers and start actual training for each epoch.

Parameters:	runner (`garage.experiment.LocalRunner`) – `LocalRunner` is passed to give algorithm the access to `runner.step_epochs()`, which provides services such as snapshotting and sampler control.
Returns:	last_return (`ast_toolbox.algos.go_explore.Cell`) – The highest scoring cell found so far

train_once(itr, paths)[source]¶

Perform one step of policy optimization given one batch of samples.

Parameters:	itr (int) – Iteration number. paths (list[dict]) – A list of collected paths.
Returns:	best_cell (`ast_toolbox.algos.go_explore.Cell`) – The highest scoring cell found so far

class ast_toolbox.algos.BackwardAlgorithm(env, policy, expert_trajectory, epochs_per_step=10, max_epochs=None, skip_until_step=0, max_path_length=500, **kwargs)[source]¶

Bases: garage.tf.algos.ppo.PPO

Backward Algorithm from Salimans and Chen [1]_.

Parameters:

env (ast_toolbox.envs.go_explore_ast_env.GoExploreASTEnv) – The environment.
policy (garage.tf.policies.Policy) – The policy.
expert_trajectory (array_like[dict]) – The expert trajectory, an array_like where each member represents a timestep in a trajectory. The array_like should be 1-D and in chronological order. Each member of the array_like is a dictionary with the following keys:
- state: The simulator state at that timestep (pre-action).
- reward: The reward at that timestep (post-action).
- observation: The simulation observation at that timestep (post-action).
- action: The action taken at that timestep.
epochs_per_step (int, optional) – Maximum number of epochs to run per step of the trajectory.
max_epochs (int, optional) – Maximum number of total epochs to run. If not set, defaults to epochs_per_step times the number of steps in the expert_trajectory.
skip_until_step (int, optional) – Skip training for a certain number of steps at the start, counted backwards from the end of the trajectory. For example, if this is set to 3 for an expert_trajectory of length 10, training will start from step 7.
max_path_length (int, optional) – Maximum length of a single rollout.
kwargs – Keyword arguments passed to garage.tf.algos.PPO

References

[1]	Salimans, Tim, and Richard Chen. “Learning Montezuma’s Revenge from a Single Demonstration.” arXiv preprint arXiv:1812.03381 (2018). https://arxiv.org/abs/1812.03381

get_next_epoch(runner)[source]¶

Wrapper of garage’s runner.step_epochs() generator to handle initialization to correct trajectory state

Parameters:	runner (`garage.experiment.LocalRunner`) – `LocalRunner` is passed to give algorithm the access to `runner.step_epochs()`, which provides services such as snapshotting and sampler control.
Yields:	runner.step_itr (int) – The current epoch number. runner.obtain_samples(runner.step_itr) (list[dict]) – A list of sampled rollouts for the current epoch

set_env_to_expert_trajectory_step()[source]¶: Updates the algorithm to use the data from expert_trajectory up to the current step.

train(runner)[source]¶

Obtain samplers and start actual training for each epoch.

Parameters:	runner (`garage.experiment.LocalRunner`) – `LocalRunner` is passed to give algorithm the access to `runner.step_epochs()`, which provides services such as snapshotting and sampler control.
Returns:	full_paths (array_like) – A list of the path data from each epoch.

train_once(itr, paths)[source]¶

Perform one step of policy optimization given one batch of samples.

Parameters:	itr (int) – Iteration number. paths (list[dict]) – A list of collected paths.
Returns:	paths (list[dict]) – A list of processed paths

ast_toolbox.algos package¶

Submodules¶