ast_toolbox.algos.backward_algorithm module

Backward Algorithm from Salimans and Chen.

class ast_toolbox.algos.backward_algorithm.BackwardAlgorithm(env, policy, expert_trajectory, epochs_per_step=10, max_epochs=None, skip_until_step=0, max_path_length=500, **kwargs)[source]

Bases: garage.tf.algos.ppo.PPO

Backward Algorithm from Salimans and Chen [1].

Parameters:
  • env (ast_toolbox.envs.go_explore_ast_env.GoExploreASTEnv) – The environment.

  • policy (garage.tf.policies.Policy) – The policy.

  • expert_trajectory (array_like[dict]) – The expert trajectory, an array_like where each member represents a timestep in a trajectory. The array_like should be 1-D and in chronological order. Each member of the array_like is a dictionary with the following keys:

    • state: The simulator state at that timestep (pre-action).
    • reward: The reward at that timestep (post-action).
    • observation: The simulation observation at that timestep (post-action).
    • action: The action taken at that timestep.
  • epochs_per_step (int, optional) – Maximum number of epochs to run per step of the trajectory.

  • max_epochs (int, optional) – Maximum number of total epochs to run. If not set, defaults to epochs_per_step times the number of steps in the expert_trajectory.

  • skip_until_step (int, optional) – Skip training for a certain number of steps at the start, counted backwards from the end of the trajectory. For example, if this is set to 3 for an expert_trajectory of length 10, training will start from step 7.

  • max_path_length (int, optional) – Maximum length of a single rollout.

  • kwargs – Keyword arguments passed to garage.tf.algos.PPO

References

[1]Salimans, Tim, and Richard Chen. “Learning Montezuma’s Revenge from a Single Demonstration.” arXiv preprint arXiv:1812.03381 (2018). https://arxiv.org/abs/1812.03381
get_next_epoch(runner)[source]

Wrapper of garage’s runner.step_epochs() generator to handle initialization to correct trajectory state

Parameters:

runner (garage.experiment.LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.

Yields:
  • runner.step_itr (int) – The current epoch number.
  • runner.obtain_samples(runner.step_itr) (list[dict]) – A list of sampled rollouts for the current epoch
set_env_to_expert_trajectory_step()[source]

Updates the algorithm to use the data from expert_trajectory up to the current step.

train(runner)[source]

Obtain samplers and start actual training for each epoch.

Parameters:runner (garage.experiment.LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
Returns:full_paths (array_like) – A list of the path data from each epoch.
train_once(itr, paths)[source]

Perform one step of policy optimization given one batch of samples.

Parameters:
  • itr (int) – Iteration number.
  • paths (list[dict]) – A list of collected paths.
Returns:

paths (list[dict]) – A list of processed paths