Code Monkey home page Code Monkey logo

hierarchical-actor-critc-hac-'s Introduction

Hierarchical Actor-Critc (HAC)

This repository contains the code to implement the Hierarchical Actor-Critic (HAC) algorithm. HAC helps agents learn tasks more quickly by enabling them to break problems down into short sequences of actions. For more information on the algorithm, please see our ICLR 2019 paper and blog post.

To run HAC, execute the command "python3 initialize_HAC.py --retrain". By default, this will train a UR5 agent with a 3-level hierarchy to learn to achieve certain poses. This UR5 agent should achieve a 90+% success rate in around 350 episodes. The following video shows how a 3-layered agent performed after 450 episodes of training. In order to watch your trained agent, execute the command "python3 initialize_HAC.py --test --show". To train agents in the inverted pendulum domain, swap the UR5 reacher "design_agent_and_env.py" file for an inverted pendulum "design_agent_and_env.py" file, which are located in "example_designs" folder folder. To train agents in the ant reacher and ant four rooms environments, execute the command "python3 initialize_HAC.py --retrain" in the appropriate folder within the ant_environments directory. In the near future, the code for the ant domains will be integrated with the code for the other domains.

Please note that in order to run this repository, you must have (i) a MuJoCo license, (ii) the required MuJoCo software libraries, and (iii) the MuJoCo Python wrapper from OpenAI.

Happy to answer any questions you have. Please email me at [email protected].

UPDATE LOG

5/20/2020 - Key Changes

  1. Added 2-level ant environments
  2. Centralized exploration hyperparameters for ant environments in design_agent_and_env.py file

2/25/2020 - Key Changes

  1. TensorFlow 2.x Compatible

  2. Fine-tuned exploration parameters of the Ant Reacher environment

10/1/2019 - Key Changes

  1. Added Ant Reacher and Ant Four Rooms Environments

The code for the ant environments has been temporaily added to the ant_environments folder. In the near future, the code for the ant domains will be integrated with the code for the other domains. Only minimal changes to the code are needed to run the ant environments.

10/12/2018 - Key Changes

  1. Bounded Q-Values

The Q-values output by the critic network at each level are now bounded between [-T,0], in which T is the max sequence length in which each policy specializes as well as the negative of the subgoal penalty. We use an upper bound of 0 because our code uses a nonpositive reward function. Consequently, Q-values should never be positive. However, we noticed that somtimes the critic function approximator would make small mistakes and assign positive Q-values, which occassionally proved harmful to results. In addition, we observed improved results when we used a tighter lower bound of -T (i.e., the subgoal penalty). The improved results may result from the increased flexibility the bounded Q-values provides the critic. The critic can assign a value of -T to any (state,action,goal) tuple, in which the action does not bring the agent close to the goal, instead of having to learn the exact value.

  1. Removed Target Networks

We also noticed improved results when we used the regular Q-networks to determine the Bellman target updates (i.e., reward + Q(next state,pi(next state),goal)) instead of the separate target networks that are used in DDPG. The default setting of our code base thus no longer uses target networks. However, the target networks can be easily activated by making the changes specified in (i) the "learn" method in the "layer.py" file and (ii) the "update" method in the "critic.py" file.

  1. Centralized Design Template

Users can now configure the agent and environment in the single file, "design_agent_and_env.py". This template file contains most of the significant hyperparameters in HAC. We have removed the command-line options that can change the architecture of the agent's hierarchy.

  1. Added UR5 Reacher Environment

We have added a new UR5 reacher environment, in which a UR5 agent can learn to achieve various poses. The "ur5.xml" MuJoCo file also contains commented code for a Robotiq gripper if you would like to augment the agent. Additional environments will hopefully be added shortly.

hierarchical-actor-critc-hac-'s People

Contributors

andrew-j-levy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hierarchical-actor-critc-hac-'s Issues

Question regarding hyper parameter tuning for gym environments

Hi, I have been working on my HAC PyTorch implementation (https://github.com/nikhilbarhate99/Hierarchical-Actor-Critic-HAC-PyTorch) for the Mountain Car env, but I found the results to be inconsistent. (learns a good policy ~70% of the time).

Can this be fixed by changing some hyper parameters ?
I tried reducing noise, but then it never learns anything useful

Are there any specific things that I should look for while setting the thresholds ? (I am using same threshold for goal and subgoals)

Also, are the thresholds for the 3 level and 2 level policies same?

Question regarding finalize_goal_replay

Follow-up question (thanks for taking the time to answer btw). trans_copy here (see link below) is a copy of all of the temp_goal_replay_storage, which we set the new target goal to a random goal-state from a transition from that buffer. However, then by iterating over num_trans elements (which is all of the elements in the copy), and setting the goal to that random goal from potentially the middle of the buffer, doesn't this effectively set some of the states (which are reached after the new hindsight "goal") as being transitions which "lead" to this hindsight goal?

Sorry, that was a bit hard to write in words, please let me know if I lost you with that question.

https://github.com/andrew-j-levy/Hierarchical-Actor-Critc-HAC-/blob/master/layer.py#L229-L247

Clarification question regarding project_state_to_end_goal/subgoal

Hi, I have a quick clarification question regarding project_state_to_end_goal/subgoal.

Currently they are defined to be functions of arguments sim and state.

In

project_state_to_end_goal = lambda sim, state: np.array([bound_angle(sim.data.qpos[0]), 15 if state[2] > 15 else -15 if state[2] < -15 else state[2]])
joint velocities are extracted from state.

Whereas in

project_state_to_end_goal = lambda sim, state: np.array([bound_angle(sim.data.qpos[i]) for i in range(len(sim.data.qpos))])
the joint velocities are extracted from sim.

I wonder whether this is an intentional design choice and whether project_state_to_end_goal/subgoal can just be function of state. Why do we need to pass sim into the function?

Thanks a lot! I appreciate your help.

Confusion about the intention

Hi! @andrew-j-levy

I'm new to reinforcement learning and interested in your work.

After I read your article thoroughly, I'm confused about the intention to solve the long horizon task with the goal-conditioned reward scheme.

In my opinion, the goal-conditioned reward can be treated as the sparse reward, which performs badly in long horizon tasks.

Thus, why not use the dense reward with differentiable functions which can lead the training process to convergence? Sometimes, some tasks don't require a lot of goals.

I don't know if I'm on the right point and this may seem meaningless to you, but I'd like to get a response from you.

Thanks!

Training will continue even when the end goal is achieved?

Thank you so much for sharing your work!

I just have a quick question about the following comment:

# Return to previous level when any higher level goal achieved. NOTE: if not testing and agent achieves end goal, training will continue until out of time (i.e., out of time steps or highest level runs out of attempts). This will allow agent to experience being around the end goal.
if max_lay_achieved is not None and max_lay_achieved >= self.layer_number:

I am wondering why you would want the episode to continue even when the end goal is achieved. Is it because the agent will collect more transitions around the end goal and add them to the experience buffer, and for some reason this helps the training process?

My second question is that which part of the code implements this mechanism. I haven't got the chance to run your code extensively, so please correct me if I am wrong, but my understanding is that max_lay_achieved will be equal to the highest layer (i.e. self.layer_number) if end goal is achieved, so return_to_higher_level will still return true and terminates the current episode? Why will the highest level policy still be able to set subgoals afterwards?

Thanks a lot!

Discrete Gridworld

Hi! Thank you for sharing your work! In the paper you are mentioning that you tested the HAC in a discrete gridworld environment. I assume that you do not need Mujoco for this. Is the code you are sharing here compatible with discrete gridworlds as well?

Code for HierQ

Hey, maybe I'm dumb but I can't find code for the HierQ part of the paper. If it's there, how to use it? If not, is there another repo for it?

Thanks!

the Code for Grid World Environment

Hello my friend! I am interested in your HAC code for the grid world 4-room environment, but unfortunately I cannot find it in the code. I am doing work to figure out how the upper level output the goal for grid-world-like environment, according to your paper of HAC.
Thanks for reading my issue.

NUM_BATCH and num_test_episodes

I'd like to know the role of parameters: NUM_BATCH and num_test_episodes in "run_HAC.py",

Is there any difference for the learning effect between procedure A and B?
A) NUM_BATCH = 1 and num_test_episodes = 1000
B) NUM_BATCH = 10 and num_test_episodes = 100

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.