Code Monkey home page Code Monkey logo

rail-berkeley / softlearning Goto Github PK

View Code? Open in Web Editor NEW
1.2K 37.0 235.0 13.43 MB

Softlearning is a reinforcement learning framework for training maximum entropy policies in continuous domains. Includes the official implementation of the Soft Actor-Critic algorithm.

Home Page: https://sites.google.com/view/sac-and-applications

License: Other

Shell 0.60% Python 99.40%
reinforcement-learning soft-actor-critic deep-learning deep-reinforcement-learning deep-neural-networks machine-learning

softlearning's Introduction

Softlearning

Softlearning is a deep reinforcement learning toolbox for training maximum entropy policies in continuous domains. The implementation is fairly thin and primarily optimized for our own development purposes. It utilizes the tf.keras modules for most of the model classes (e.g. policies and value functions). We use Ray for the experiment orchestration. Ray Tune and Autoscaler implement several neat features that enable us to seamlessly run the same experiment scripts that we use for local prototyping to launch large-scale experiments on any chosen cloud service (e.g. GCP or AWS), and intelligently parallelize and distribute training for effective resource allocation.

This implementation uses Tensorflow. For a PyTorch implementation of soft actor-critic, take a look at rlkit.

Getting Started

Prerequisites

The environment can be run either locally using conda or inside a docker container. For conda installation, you need to have Conda installed. For docker installation you will need to have Docker and Docker Compose installed. Also, most of our environments currently require a MuJoCo license.

Conda Installation

  1. Download and install MuJoCo 1.50 and 2.00 from the MuJoCo website. We assume that the MuJoCo files are extracted to the default location (~/.mujoco/mjpro150 and ~/.mujoco/mujoco200_{platform}). Unfortunately, gym and dm_control expect different paths for MuJoCo 2.00 installation, which is why you will need to have it installed both in ~/.mujoco/mujoco200_{platform} and ~/.mujoco/mujoco200. The easiest way is to create a symlink from ~/.mujoco/mujoco200_{plaftorm} -> ~/.mujoco/mujoco200 with: ln -s ~/.mujoco/mujoco200_{platform} ~/.mujoco/mujoco200.

  2. Copy your MuJoCo license key (mjkey.txt) to ~/.mujoco/mjkey.txt:

  3. Clone softlearning

git clone https://github.com/rail-berkeley/softlearning.git ${SOFTLEARNING_PATH}
  1. Create and activate conda environment, install softlearning to enable command line interface.
cd ${SOFTLEARNING_PATH}
conda env create -f environment.yml
conda activate softlearning
pip install -e ${SOFTLEARNING_PATH}

The environment should be ready to run. See examples section for examples of how to train and simulate the agents.

Finally, to deactivate and remove the conda environment:

conda deactivate
conda remove --name softlearning --all

Docker Installation

docker-compose

To build the image and run the container:

export MJKEY="$(cat ~/.mujoco/mjkey.txt)" \
    && docker-compose \
        -f ./docker/docker-compose.dev.cpu.yml \
        up \
        -d \
        --force-recreate

You can access the container with the typical Docker exec-command, i.e.

docker exec -it softlearning bash

See examples section for examples of how to train and simulate the agents.

Finally, to clean up the docker setup:

docker-compose \
    -f ./docker/docker-compose.dev.cpu.yml \
    down \
    --rmi all \
    --volumes

Examples

Training and simulating an agent

  1. To train the agent
softlearning run_example_local examples.development \
    --algorithm SAC \
    --universe gym \
    --domain HalfCheetah \
    --task v3 \
    --exp-name my-sac-experiment-1 \
    --checkpoint-frequency 1000  # Save the checkpoint to resume training later
  1. To simulate the resulting policy: First, find the absolute path that the checkpoint is saved to. By default (i.e. without specifying the log-dir argument to the previous script), the data is saved under ~/ray_results/<universe>/<domain>/<task>/<datatimestamp>-<exp-name>/<trial-id>/<checkpoint-id>. For example: ~/ray_results/gym/HalfCheetah/v3/2018-12-12T16-48-37-my-sac-experiment-1-0/mujoco-runner_0_seed=7585_2018-12-12_16-48-37xuadh9vd/checkpoint_1000/. The next command assumes that this path is found from ${SAC_CHECKPOINT_DIR} environment variable.
python -m examples.development.simulate_policy \
    ${SAC_CHECKPOINT_DIR} \
    --max-path-length 1000 \
    --num-rollouts 1 \
    --render-kwargs '{"mode": "human"}'

examples.development.main contains several different environments and there are more example scripts available in the /examples folder. For more information about the agents and configurations, run the scripts with --help flag: python ./examples/development/main.py --help

optional arguments:
  -h, --help            show this help message and exit
  --universe {robosuite,dm_control,gym}
  --domain DOMAIN
  --task TASK
  --checkpoint-replay-pool CHECKPOINT_REPLAY_POOL
                        Whether a checkpoint should also saved the replay
                        pool. If set, takes precedence over
                        variant['run_params']['checkpoint_replay_pool']. Note
                        that the replay pool is saved (and constructed) piece
                        by piece so that each experience is saved only once.
  --algorithm ALGORITHM
  --policy {gaussian}
  --exp-name EXP_NAME
  --mode MODE
  --run-eagerly RUN_EAGERLY
                        Whether to run tensorflow in eager mode.
  --local-dir LOCAL_DIR
                        Destination local folder to save training results.
  --confirm-remote [CONFIRM_REMOTE]
                        Whether or not to query yes/no on remote run.
  --video-save-frequency VIDEO_SAVE_FREQUENCY
                        Save frequency for videos.
  --cpus CPUS           Cpus to allocate to ray process. Passed to `ray.init`.
  --gpus GPUS           Gpus to allocate to ray process. Passed to `ray.init`.
  --resources RESOURCES
                        Resources to allocate to ray process. Passed to
                        `ray.init`.
  --include-webui INCLUDE_WEBUI
                        Boolean flag indicating whether to start theweb UI,
                        which is a Jupyter notebook. Passed to `ray.init`.
  --temp-dir TEMP_DIR   If provided, it will specify the root temporary
                        directory for the Ray process. Passed to `ray.init`.
  --resources-per-trial RESOURCES_PER_TRIAL
                        Resources to allocate for each trial. Passed to
                        `tune.run`.
  --trial-cpus TRIAL_CPUS
                        CPUs to allocate for each trial. Note: this is only
                        used for Ray's internal scheduling bookkeeping, and is
                        not an actual hard limit for CPUs. Passed to
                        `tune.run`.
  --trial-gpus TRIAL_GPUS
                        GPUs to allocate for each trial. Note: this is only
                        used for Ray's internal scheduling bookkeeping, and is
                        not an actual hard limit for GPUs. Passed to
                        `tune.run`.
  --trial-extra-cpus TRIAL_EXTRA_CPUS
                        Extra CPUs to reserve in case the trials need to
                        launch additional Ray actors that use CPUs.
  --trial-extra-gpus TRIAL_EXTRA_GPUS
                        Extra GPUs to reserve in case the trials need to
                        launch additional Ray actors that use GPUs.
  --num-samples NUM_SAMPLES
                        Number of times to repeat each trial. Passed to
                        `tune.run`.
  --upload-dir UPLOAD_DIR
                        Optional URI to sync training results to (e.g.
                        s3://<bucket> or gs://<bucket>). Passed to `tune.run`.
  --trial-name-template TRIAL_NAME_TEMPLATE
                        Optional string template for trial name. For example:
                        '{trial.trial_id}-seed={trial.config[run_params][seed]
                        }' Passed to `tune.run`.
  --checkpoint-frequency CHECKPOINT_FREQUENCY
                        How many training iterations between checkpoints. A
                        value of 0 (default) disables checkpointing. If set,
                        takes precedence over
                        variant['run_params']['checkpoint_frequency']. Passed
                        to `tune.run`.
  --checkpoint-at-end CHECKPOINT_AT_END
                        Whether to checkpoint at the end of the experiment. If
                        set, takes precedence over
                        variant['run_params']['checkpoint_at_end']. Passed to
                        `tune.run`.
  --max-failures MAX_FAILURES
                        Try to recover a trial from its last checkpoint at
                        least this many times. Only applies if checkpointing
                        is enabled. Passed to `tune.run`.
  --restore RESTORE     Path to checkpoint. Only makes sense to set if running
                        1 trial. Defaults to None. Passed to `tune.run`.
  --server-port SERVER_PORT
                        Port number for launching TuneServer. Passed to
                        `tune.run`.

Resume training from a saved checkpoint

This feature is currently broken!

In order to resume training from previous checkpoint, run the original example main-script, with an additional --restore flag. For example, the previous example can be resumed as follows:

softlearning run_example_local examples.development \
    --algorithm SAC \
    --universe gym \
    --domain HalfCheetah \
    --task v3 \
    --exp-name my-sac-experiment-1 \
    --checkpoint-frequency 1000 \
    --restore ${SAC_CHECKPOINT_PATH}

References

The algorithms are based on the following papers:

Soft Actor-Critic Algorithms and Applications.
Tuomas Haarnoja*, Aurick Zhou*, Kristian Hartikainen*, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. arXiv preprint, 2018.
paper | videos

Latent Space Policies for Hierarchical Reinforcement Learning.
Tuomas Haarnoja*, Kristian Hartikainen*, Pieter Abbeel, and Sergey Levine. International Conference on Machine Learning (ICML), 2018.
paper | videos

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. International Conference on Machine Learning (ICML), 2018.
paper | videos

Composable Deep Reinforcement Learning for Robotic Manipulation.
Tuomas Haarnoja, Vitchyr Pong, Aurick Zhou, Murtaza Dalal, Pieter Abbeel, Sergey Levine. International Conference on Robotics and Automation (ICRA), 2018.
paper | videos

Reinforcement Learning with Deep Energy-Based Policies.
Tuomas Haarnoja*, Haoran Tang*, Pieter Abbeel, Sergey Levine. International Conference on Machine Learning (ICML), 2017.
paper | videos

If Softlearning helps you in your academic research, you are encouraged to cite our paper. Here is an example bibtex:

@techreport{haarnoja2018sacapps,
  title={Soft Actor-Critic Algorithms and Applications},
  author={Tuomas Haarnoja and Aurick Zhou and Kristian Hartikainen and George Tucker and Sehoon Ha and Jie Tan and Vikash Kumar and Henry Zhu and Abhishek Gupta and Pieter Abbeel and Sergey Levine},
  journal={arXiv preprint arXiv:1812.05905},
  year={2018}
}

softlearning's People

Contributors

alacarter avatar azhou42 avatar ben-eysenbach avatar brandontrabucco avatar dependabot[bot] avatar haarnoja avatar hartikainen avatar henry-zhang-bohan avatar hrtang avatar johannespitz avatar nflu avatar sjoerdvansteenkiste avatar vitchyr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

softlearning's Issues

Code slower than before refactoring

For some reason, the training is currently tens of percents slower (in terms of wall clock time) than it was prior to the latest refactor. Need to figure out what slows it down and fix it.

Error on Docker/GPU installation

Hi, thank you for sharing your source code and interesting results!

I've run the following command for Docker/GPU installation:

export MJKEY="$(cat ~/.mujoco/mjkey.txt)" \
    && docker-compose \
        -f ./docker/docker-compose.dev.gpu.yml \
        up \
        -d \
        --force-recreate

After then, I got the following error message:

Step 19/23 : RUN echo "${MJKEY}" > /root/.mujoco/mjkey.txt     && sed -i -e 's/^tensorflow==/tensorflow-gpu==/g' /tmp/requirements.txt     && conda env update -f /tmp/environment.yml     && rm /root/.mujoco/mjkey.txt     && rm /tmp/requirements.txt     && rm /tmp/environment.yml
 ---> Running in 9d088bf80325
Solving environment: ...working... done
ruamel_yaml-0.15.46  | 245 KB    | ########## | 100% 
ncurses-6.1          | 958 KB    | ########## | 100% 
python-3.6.5         | 29.4 MB   | ########## | 100% 
pip-18.1             | 1.8 MB    | ########## | 100% 
chardet-3.0.4        | 189 KB    | ########## | 100% 
pycosat-0.6.3        | 104 KB    | ########## | 100% 
requests-2.21.0      | 85 KB     | ########## | 100% 
six-1.12.0           | 22 KB     | ########## | 100% 
wheel-0.32.3         | 35 KB     | ########## | 100% 
certifi-2018.11.29   | 146 KB    | ########## | 100% 
urllib3-1.24.1       | 149 KB    | ########## | 100% 
cryptography-2.3.1   | 585 KB    | ########## | 100% 
zlib-1.2.11          | 120 KB    | ########## | 100% 
setuptools-40.6.3    | 625 KB    | ########## | 100% 
cffi-1.11.5          | 212 KB    | ########## | 100% 
patchelf-0.9         | 71 KB     | ########## | 100% 
pycparser-2.19       | 174 KB    | ########## | 100% 
idna-2.8             | 133 KB    | ########## | 100% 
pysocks-1.6.8        | 22 KB     | ########## | 100% 
asn1crypto-0.24.0    | 155 KB    | ########## | 100% 
pyopenssl-18.0.0     | 82 KB     | ########## | 100% 
conda-4.5.12         | 1.0 MB    | ########## | 100% 
Downloading and Extracting Packages
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Collecting git+https://github.com/openai/gym.git@49cd48020f6760630a7317cb3529a22de6f12f2e#[all] (from -r /tmp/./requirements.txt (line 36))
  Cloning https://github.com/openai/gym.git (to revision 49cd48020f6760630a7317cb3529a22de6f12f2e) to ./pip-req-build-8ky3z5dn
Collecting git+https://github.com/vitchyr/multiworld.git@d76b3dae2e8cbca02924f93d6cc0239c552f6408 (from -r /tmp/./requirements.txt (line 50))
  Cloning https://github.com/vitchyr/multiworld.git (to revision d76b3dae2e8cbca02924f93d6cc0239c552f6408) to ./pip-req-build-g3i3y_w5
Collecting git+https://github.com/hartikainen/serializable.git@76516385a3a716ed4a2a9ad877e2d5cbcf18d4e6 (from -r /tmp/./requirements.txt (line 83))
  Cloning https://github.com/hartikainen/serializable.git (to revision 76516385a3a716ed4a2a9ad877e2d5cbcf18d4e6) to ./pip-req-build-q72w5nqc
Collecting absl-py==0.6.1 (from -r /tmp/./requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/0c/63/f505d2d4c21db849cf80bad517f0065a30be6b006b0a5637f1b95584a305/absl-py-0.6.1.tar.gz (94kB)
Requirement already satisfied: asn1crypto==0.24.0 in /opt/conda/envs/softlearning/lib/python3.6/site-packages (from -r /tmp/./requirements.txt (line 2)) (0.24.0)
Collecting astor==0.7.1 (from -r /tmp/./requirements.txt (line 3))
  Downloading https://files.pythonhosted.org/packages/35/6b/11530768cac581a12952a2aad00e1526b89d242d0b9f59534ef6e6a1752f/astor-0.7.1-py2.py3-none-any.whl
Collecting atomicwrites==1.2.1 (from -r /tmp/./requirements.txt (line 4))
  Downloading https://files.pythonhosted.org/packages/3a/9a/9d878f8d885706e2530402de6417141129a943802c084238914fa6798d97/atomicwrites-1.2.1-py2.py3-none-any.whl
Collecting attrs==18.2.0 (from -r /tmp/./requirements.txt (line 5))
  Downloading https://files.pythonhosted.org/packages/3a/e1/5f9023cc983f1a628a8c2fd051ad19e76ff7b142a0faf329336f9a62a514/attrs-18.2.0-py2.py3-none-any.whl
Collecting awscli==1.16.67 (from -r /tmp/./requirements.txt (line 6))
  Downloading https://files.pythonhosted.org/packages/aa/e5/ebd5896ad5ae353d23bea05ebb8edd3d49f1471784f6afa12a9cf11710de/awscli-1.16.67-py2.py3-none-any.whl (1.4MB)
Collecting boto3==1.9.57 (from -r /tmp/./requirements.txt (line 7))
  Downloading https://files.pythonhosted.org/packages/bf/a1/2fedb80d3eefe024580aaff3e81106058b6f99698295edfca51199162bd5/boto3-1.9.57-py2.py3-none-any.whl (128kB)
Collecting botocore==1.12.57 (from -r /tmp/./requirements.txt (line 8))
  Downloading https://files.pythonhosted.org/packages/f1/37/eb8f5a76e1cb16ecabb7c92f7504c37030c8b727d550021b2bb34dc2a082/botocore-1.12.57-py2.py3-none-any.whl (5.1MB)
Collecting cachetools==3.0.0 (from -r /tmp/./requirements.txt (line 9))
  Downloading https://files.pythonhosted.org/packages/76/7e/08cd3846bebeabb6b1cfc4af8aae649d90249b4aeed080bddb5297f1d73b/cachetools-3.0.0-py2.py3-none-any.whl
Requirement already satisfied: certifi==2018.11.29 in /opt/conda/envs/softlearning/lib/python3.6/site-packages (from -r /tmp/./requirements.txt (line 10)) (2018.11.29)
Requirement already satisfied: cffi==1.11.5 in /opt/conda/envs/softlearning/lib/python3.6/site-packages (from -r /tmp/./requirements.txt (line 11)) (1.11.5)
Requirement already satisfied: chardet==3.0.4 in /opt/conda/envs/softlearning/lib/python3.6/site-packages (from -r /tmp/./requirements.txt (line 12)) (3.0.4)
Collecting Click==7.0 (from -r /tmp/./requirements.txt (line 13))
  Downloading https://files.pythonhosted.org/packages/fa/37/45185cb5abbc30d7257104c434fe0b07e5a195a6847506c074527aa599ec/Click-7.0-py2.py3-none-any.whl (81kB)
Collecting cloudpickle==0.6.1 (from -r /tmp/./requirements.txt (line 14))
  Downloading https://files.pythonhosted.org/packages/fc/87/7b7ef3038b4783911e3fdecb5c566e3a817ce3e890e164fc174c088edb1e/cloudpickle-0.6.1-py2.py3-none-any.whl
Collecting colorama==0.3.9 (from -r /tmp/./requirements.txt (line 15))
  Downloading https://files.pythonhosted.org/packages/db/c8/7dcf9dbcb22429512708fe3a547f8b6101c0d02137acbd892505aee57adf/colorama-0.3.9-py2.py3-none-any.whl
Collecting conda==4.5.11 (from -r /tmp/./requirements.txt (line 16))
  Could not find a version that satisfies the requirement conda==4.5.11 (from -r /tmp/./requirements.txt (line 16)) (from versions: 3.0.6, 3.5.0, 3.7.0, 3.17.0, 4.0.0, 4.0.1, 4.0.2, 4.0.3, 4.0.4, 4.0.5, 4.0.7, 4.0.8, 4.0.9, 4.1.2, 4.1.6, 4.2.6, 4.2.7, 4.3.13, 4.3.16)
No matching distribution found for conda==4.5.11 (from -r /tmp/./requirements.txt (line 16))


CondaValueError: pip returned an error

ERROR: Service 'softlearning-dev-gpu' failed to build: The command '/bin/sh -c echo "${MJKEY}" > /root/.mujoco/mjkey.txt     && sed -i -e 's/^tensorflow==/tensorflow-gpu==/g' /tmp/requirements.txt     && conda env update -f /tmp/environment.yml     && rm /root/.mujoco/mjkey.txt     && rm /tmp/requirements.txt     && rm /tmp/environment.yml' returned a non-zero code: 1

I solved this issue by erasing conda==4.5.11 that is in requirements.txt #16.

Bound std of Gaussian policy via beta-sigmoid ?

If I got it correctly, the logstd of the Gaussian policy is clipped via min/max range and std is retrieved by exponentiation.

I am curious if using beta-sigmiodal function to model the logstd would be a tiny bit more stable, because it allows smooth lower/upper bound and less sharp gradient for larger magnitude.

e.g.

logvar = network output
var = 1/(1 + self.beta*torch.exp(-logvar))
var = min_var + (max_var - min_var)*var

Environment seeding doesn't allow reproducibility

Hi Hartikainen,

Thank you for maintaining the repo. Sorry if this is done elsewhere in the code, but shouldn't we set the seed of the environments (both training and eval) after creating them here for reproducibility purposes?

Gym maintains its own internal copy of numpy and use that internal copy to sample the initial state. Setting the seed of the global numpy module does not affect this internal copy of numpy.

Thanks!

Unable to reproduce result on HalfCheetah-v2

I am unable to obtain the result as reported in the paper on the openai environment HalfCheetah-v2. The commit used to obtain this result is 1f6147c, which isn't too long ago. The result is averaged over 5 random initial seeds.

halfcheetah

Do you know what might be causing this issue? Thank you!

I am able to obtain the result as reported (or close to it) in the paper on the remaining environments, posted here for reference.

ant
walker
humanoid
hopper

unstable training curve for default SQL

Hi,

First of all, thanks for the brilliant papers and making the codes open-source.

I was running SQL on half-cheetah with default setting using the command:
--universe=gym --domain=HalfCheetah --task=v2 --algorithm=SQL --exp-name=my-sql-experiment-2 --checkpoint-frequency=1000.

It uses Gaussian policy and a reward scale of 30, which I think implies a very low entropy regularization.

However, I obtained very unstable training return curve and evaluation return curve as below:

image

image

I was wondering if there is anything wrong with the default SQL setting and how do you test the SQL? I tried to lower the reward scale, and it is leading to a lower but a little bit stabler return curve.

Thanks!

No module named examples.development.simulate_policy

In the README.md, the following command is mentioned in order to simulate the resulting policy:
python -m examples.development.simulate_policy [โ€ฆ]
However, I could not find this module in the development directory.
When trying to run, python outputs the following error:
/home/user/miniconda3/envs/softlearning/bin/python: No module named examples.development.simulate_policy

GPU issues

I run the command line 'CUDA_VISIBLE_DEVICES=3 python -m examples.development.main --mode=local --universe=gym --domain=Hopper --task=v2 --exp-name=test --checkpoint-frequency=1000 --cpus=16 --gpus 1 --trial-cpus 16 --trial-gpus 1'

But when I look up the 'nvidia-smi', no gpu is used.

My question is how to use gpu to run the codes?

Thanks!

invalidgitrepository error

Invalid git repository (last line of image).

Please tell me which path (repositry) it is looking for.
20190602_111750

self._Serializable__initialize(locals()) missing. serializable package missing

Trying to install...

I seem to be stuck because self._Serializable__initialize does not exist. I believe it is because the serializable package does not exist. And... much google searching doesn't turn it up either.
git+https://github.com/hartikainen/serializable.git@76516385a3a716ed4a2a9ad877e2d5cbcf18d4e6 in the requirements.txt file does not install and doesn't seem to exist anywhere on the web

Where can I get it???

Concatenating dm_control observations causes error due to uneven shapes

The code exits with error when I tried running with humanoid run task because of this line

flattened_observation = np.concatenate([

The reason is that the 'head_height' attribute of the observation has 0 dimension, so calling np.cat would complain that all the items in the concatenating tuple has to have the same dimension.

Originally posted by @quanvuong in #69 (comment)

SAC checkpointing fails when using fixed entropy coefficient

@aviralkumar2907 ran into a problem where setting the target entropy to a fixed value makes the checkpointing crash:

Traceback (most recent call last):
  File "/home/kristian/github/hartikainen/ray/python/ray/tune/trial_runner.py", line 399, in _process_events
    self._checkpoint_trial_if_needed(trial)
  File "/home/kristian/github/hartikainen/ray/python/ray/tune/trial_runner.py", line 430, in _checkpoint_trial_if_needed
    self.trial_executor.save(trial, storage=Checkpoint.DISK)
  File "/home/kristian/github/hartikainen/ray/python/ray/tune/ray_trial_executor.py", line 317, in save
    trial._checkpoint.value = ray.get(trial.runner.save.remote())
  File "/home/kristian/github/hartikainen/ray/python/ray/worker.py", line 2211, in get
    raise value
ray.worker.RayTaskError: ray_ExperimentRunner:save() (pid=4371, host=jensen2)
  File "/home/kristian/github/hartikainen/ray/python/ray/tune/trainable.py", line 226, in save
    checkpoint = self._save(checkpoint_dir)
  File "/home/kristian/github/hartikainen/softlearning/examples/development/main.py", line 120, in _save
    tf_checkpoint = self._get_tf_checkpoint()
  File "/home/kristian/github/hartikainen/softlearning/examples/development/main.py", line 87, in _get_tf_checkpoint
    tf_checkpoint = tf.train.Checkpoint(**self.algorithm.tf_saveables)
  File "/home/kristian/github/hartikainen/softlearning/softlearning/algorithms/sac.py", line 430, in tf_saveables
    '_alpha_optimizer': self._alpha_optimizer,
AttributeError: 'SAC' object has no attribute '_alpha_optimizer'

Error occurs while runing

When I run the command given in README :

python -m examples.development.main \ --mode=local \ --universe=gym \ --domain=HalfCheetah \ --task=v2 \ --exp-name=my-sac-experiment-1 \ --checkpoint-frequency=1000 # Save the checkpoint to resume training later

to train the agent , however error occurs .

pygame 1.9.4 Hello from the pygame community. https://www.pygame.org/contribute.html Traceback (most recent call last): File "/home/deepglint/anaconda2/envs/softlearning/lib/python3.5/runpy.py", line 184, in _run_module_as_main "__main__", mod_spec) File "/home/deepglint/anaconda2/envs/softlearning/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/deepglint/softlearning/examples/development/main.py", line 14, in <module> from softlearning.algorithms.utils import get_algorithm_from_variant File "/home/deepglint/softlearning/softlearning/algorithms/__init__.py", line 2, in <module> from .sac import SAC File "/home/deepglint/softlearning/softlearning/algorithms/sac.py", line 47 **kwargs, ^ SyntaxError: invalid syntax
I would very appreciate it if you could give me a solution.

Conda issues installing patchelf

I'll try to find a work-around. Here's the output. Thanks!

11:04:39 ~/git_repos/softlearning:master
$ conda env create -f environment.yml
Solving environment: failed

ResolvePackageNotFound:

  • patchelf=0.9

How to set random seed?

Thanks for the repo!

How can I set the random seed for a run? There doesnโ€™t seem to be an option to set the seed using command line arguments.

Run on real robot

Hi,
is there currently some implementation to run this outside of simulation?

[question] Action Smoothing

Hello,
In the blog and in the code, action smoothing is mentioned but never explained...
In the code, one does apparently something like:

# smoothing coeff (0 -> no smoothing, 1 -> max smoothing)
alpha = some_value
beta = sqrt(1 - alpha ** 2)
# raw latents is sampled from a MultivariateNormalDiag with zero mean and unit std
smoothing_latent = alpha * smoothing_latent + raw_latents
latent = beta * smoothing_latent

action = mean + std * latent
action = tanh(action)

My question is where does this smoothing comes from? and why not, for instance, an exponential smoothing?

Also, can raw_latent be seen as the noise?

Progress on Deepmind control suite?

Hi!

I noticed that a recent commit was pushed to the repo to support running SAC on the Deepmind control suite.

I was wondering if the current code base is ready to run on the Deepmind control suite and if not, what else remains to be done? Maybe I could help. Thanks!

[Enhancement] Make replay buffer memory-efficient

Currently, the replay buffer stores each observation twice (since it stores a tuple of state, action, next_state, reward, done for every transition). The buffer consumes large amounts of memory for environments with high dimensional observations (like images). For example, the memory consumption for an experiment with 48x48 RGB images and 1 million timesteps is about ~56GB, and this could be cut down to ~28GB.

Gaussian mixture policies

Are GMM policies stable with the Q-only formulation (without V function)? I see that this repository doesn't contain GMM policies while the old one (haarnoja/sac) does.
I am trying to get it working on rlkit but it seems like GMM policies are difficult to train without the V function.

Module 'gym' has no attribute 'register' on MacOS Mojave 10.14.4

Hi All, when I tried to run a reward learning task (https://github.com/avisingh599/reward-learning-rl) with softlearning environment, the following error occurred: "AttributeError: module 'gym' has no attribute 'register'"

However when I ran import gym and gym.register() on a separate python script on Pycharm it works fine, e.g. able to find the register module in gym. I had a look at the previous issues posted for Softlearning and think this is a gym adapter issue? But I am not sure how to manually add this environment/task onto gym_adapter in the Softlearning package? Many thanks for your help!!

image

SAC Hyperparameters MountainCarContinuous-v0 - Env with deceptive reward

Hello,

I've tried in vain to find suitable hyperparameters for SAC in order to solve MountainCarContinuous-v0.

Even with hyperparameter tuning (see "add-trpo" branch of rl baselines zoo), I was not able to solve it consistently (if during random exploration it finds the goal, then it will work, otherwise, it will be stuck in a local minima).
I also encountered that issue when trying SAC on another environment with deceptive reward (bit flipping env, trying to apply HER + SAC, see here).

Did you manage to solve that problem? If so, what hyperparameters did you use?

Note: I am using the SAC implementation from stable-baselines that works pretty well on all others problems (but where the reward is dense).

Import error. Trying to rebuild mujoco_py.

$  softlearning run_example_local examples.development \
>     --universe=gym \
>     --domain=HalfCheetah \
>     --task=v3 \
>     --exp-name=my-sac-experiment-1 \
>     --checkpoint-frequency=1000

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

WARNING: Logging before flag parsing goes to stderr.
I0418 01:08:50.825603 140032189581056 acceleratesupport.py:13] OpenGL_accelerate module loaded
I0418 01:08:50.832047 140032189581056 arraydatatype.py:270] Using accelerated ArrayDatatype
I0418 01:08:51.017610 140032189581056 __init__.py:34] MuJoCo library version is: 200
2019-04-18 01:08:51,105 INFO node.py:439 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-18_01-08-51_27162/logs.
2019-04-18 01:08:51,211 INFO services.py:364 -- Waiting for redis server at 127.0.0.1:20587 to respond...
2019-04-18 01:08:51,320 INFO services.py:364 -- Waiting for redis server at 127.0.0.1:52635 to respond...
2019-04-18 01:08:51,321 INFO services.py:761 -- Starting Redis shard with 10.0 GB max memory.
2019-04-18 01:08:51,337 WARNING services.py:1301 -- Warning: Capping object memory store to 20.0GB. To increase this further, specify `object_store_memory` when calling ray.init() or ray start.
2019-04-18 01:08:51,337 INFO services.py:1449 -- Starting the Plasma object store with 20.0 GB memory using /dev/shm.
2019-04-18 01:08:51,885 INFO tune.py:139 -- Did not find checkpoint file in /home/yrli/ray_results/gym/HalfCheetah/v3/2019-04-18T01-08-51-my-sac-experiment-1.
2019-04-18 01:08:51,885 INFO tune.py:145 -- Starting a new experiment.
2019-04-18 01:08:51,892 INFO web_server.py:241 -- Starting Tune Server...
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/56 CPUs, 0/8 GPUs
Memory usage on this node: 4.8/270.1 GB

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 56/56 CPUs, 0/8 GPUs
Memory usage on this node: 4.9/270.1 GB
Result logdir: /home/yrli/ray_results/gym/HalfCheetah/v3/2019-04-18T01-08-51-my-sac-experiment-1
Number of trials: 1 ({'RUNNING': 1})
RUNNING trials:
 - id=f24a78d2-seed=4956:       RUNNING

(pid=27322) 
(pid=27322) WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
(pid=27322) For more information, please see:
(pid=27322)   * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
(pid=27322)   * https://github.com/tensorflow/addons
(pid=27322) If you depend on functionality not listed there, please file an issue.
(pid=27322) 
(pid=27322) 2019-04-18 01:08:55.340353: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
(pid=27322) Using seed 4956
(pid=27322) 2019-04-18 01:08:55.424360: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
(pid=27322) 2019-04-18 01:08:55.424399: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:161] retrieving CUDA diagnostic information for host: 64.site
(pid=27322) 2019-04-18 01:08:55.424408: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:168] hostname: 64.site
(pid=27322) 2019-04-18 01:08:55.424460: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:192] libcuda reported version is: 410.104.0
(pid=27322) 2019-04-18 01:08:55.424498: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:196] kernel reported version is: 410.104.0
(pid=27322) 2019-04-18 01:08:55.424507: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:303] kernel version seems to match DSO: 410.104.0
(pid=27322) 2019-04-18 01:08:55.426316: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2400060000 Hz
(pid=27322) 2019-04-18 01:08:55.429028: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5d18810 executing computations on platform Host. Devices:
(pid=27322) 2019-04-18 01:08:55.429054: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
(pid=27322) Import error. Trying to rebuild mujoco_py.
(pid=27322) Compiling /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/cymj.pyx because it depends on /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/pxd/mujoco.pxd.
(pid=27322) [1/1] Cythonizing /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/cymj.pyx
(pid=27322) running build_ext
(pid=27322) building 'mujoco_py.cymj' extension
(pid=27322) creating /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/generated/_pyxbld_2.0.2.0_36_linuxcpuextensionbuilder
(pid=27322) creating /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/generated/_pyxbld_2.0.2.0_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6
(pid=27322) creating /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/generated/_pyxbld_2.0.2.0_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/home
(pid=27322) creating /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/generated/_pyxbld_2.0.2.0_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/home/yrli
(pid=27322) creating /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/generated/_pyxbld_2.0.2.0_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/home/yrli/anaconda3
(pid=27322) creating /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/generated/_pyxbld_2.0.2.0_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/home/yrli/anaconda3/envs
(pid=27322) creating /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/generated/_pyxbld_2.0.2.0_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/home/yrli/anaconda3/envs/softlearning
(pid=27322) creating /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/generated/_pyxbld_2.0.2.0_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/home/yrli/anaconda3/envs/softlearning/lib
(pid=27322) creating /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/generated/_pyxbld_2.0.2.0_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/home/yrli/anaconda3/envs/softlearning/lib/python3.6
(pid=27322) creating /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/generated/_pyxbld_2.0.2.0_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages
(pid=27322) creating /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/generated/_pyxbld_2.0.2.0_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py
(pid=27322) creating /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/generated/_pyxbld_2.0.2.0_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/gl
(pid=27322) gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py -I/home/yrli/.mujoco/mujoco200/include -I/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/numpy/core/include -I/home/yrli/anaconda3/envs/softlearning/include/python3.6m -c /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/cymj.c -o /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/generated/_pyxbld_2.0.2.0_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/cymj.o -fopenmp -w
(pid=27322) gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py -I/home/yrli/.mujoco/mujoco200/include -I/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/numpy/core/include -I/home/yrli/anaconda3/envs/softlearning/include/python3.6m -c /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/gl/osmesashim.c -o /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/generated/_pyxbld_2.0.2.0_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/gl/osmesashim.o -fopenmp -w
(pid=27322) creating /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/generated/_pyxbld_2.0.2.0_36_linuxcpuextensionbuilder/lib.linux-x86_64-3.6
(pid=27322) creating /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/generated/_pyxbld_2.0.2.0_36_linuxcpuextensionbuilder/lib.linux-x86_64-3.6/mujoco_py
(pid=27322) gcc -pthread -shared -L/home/yrli/anaconda3/envs/softlearning/lib -Wl,-rpath=/home/yrli/anaconda3/envs/softlearning/lib,--no-as-needed -L/home/yrli/anaconda3/envs/softlearning/lib -Wl,-rpath=/home/yrli/anaconda3/envs/softlearning/lib,--no-as-needed /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/generated/_pyxbld_2.0.2.0_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/cymj.o /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/generated/_pyxbld_2.0.2.0_36_linuxcpuextensionbuilder/temp.linux-x86_64-3.6/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/gl/osmesashim.o -L/home/yrli/.mujoco/mujoco200/bin -L/home/yrli/anaconda3/envs/softlearning/lib -Wl,--enable-new-dtags,-R/home/yrli/.mujoco/mujoco200/bin -lmujoco200 -lglewosmesa -lOSMesa -lGL -lpython3.6m -o /home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/generated/_pyxbld_2.0.2.0_36_linuxcpuextensionbuilder/lib.linux-x86_64-3.6/mujoco_py/cymj.cpython-36m-x86_64-linux-gnu.so -fopenmp
2019-04-18 01:09:53,990 ERROR trial_runner.py:426 -- Error processing event.
Traceback (most recent call last):
  File "/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 389, in _process_events
    result = self.trial_executor.fetch_result(trial)
  File "/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 252, in fetch_result
    result = ray.get(trial_future[0])
  File "/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/worker.py", line 2288, in get
    raise value
ray.exceptions.RayTaskError: ray_ExperimentRunner:train() (pid=27322, host=64.site)
  File "/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/gym/envs/mujoco/mujoco_env.py", line 11, in <module>
    import mujoco_py
  File "/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/__init__.py", line 3, in <module>
    from mujoco_py.builder import cymj, ignore_mujoco_warnings, functions, MujocoException
  File "/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/builder.py", line 503, in <module>
    cymj = load_cython_ext(mujoco_path)
  File "/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/builder.py", line 106, in load_cython_ext
    mod = load_dynamic_ext('cymj', cext_so_path)
  File "/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/mujoco_py/builder.py", line 124, in load_dynamic_ext
    return loader.load_module()
ImportError: dlopen: cannot load any more object with static TLS

During handling of the above exception, another exception occurred:

ray_ExperimentRunner:train() (pid=27322, host=64.site)
  File "/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/tune/trainable.py", line 150, in train
    result = self._train()
  File "/data1/yrli/softlearning/examples/development/main.py", line 77, in _train
    self._build()
  File "/data1/yrli/softlearning/examples/development/main.py", line 44, in _build
    get_environment_from_params(environment_params['training']))
  File "/data1/yrli/softlearning/softlearning/environments/utils.py", line 33, in get_environment_from_params
    return get_environment(universe, domain, task, environment_kwargs)
  File "/data1/yrli/softlearning/softlearning/environments/utils.py", line 24, in get_environment
    return ADAPTERS[universe](domain, task, **environment_params)
  File "/data1/yrli/softlearning/softlearning/environments/adapters/gym_adapter.py", line 66, in __init__
    env = gym.envs.make(env_id, **kwargs)
  File "/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/gym/envs/registration.py", line 183, in make
    return registry.make(id, **kwargs)
  File "/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/gym/envs/registration.py", line 125, in make
    env = spec.make(**kwargs)
  File "/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/gym/envs/registration.py", line 88, in make
    cls = load(self._entry_point)
  File "/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/gym/envs/registration.py", line 17, in load
    mod = importlib.import_module(mod_name)
  File "/home/yrli/anaconda3/envs/softlearning/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 941, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/gym/envs/mujoco/__init__.py", line 1, in <module>
    from gym.envs.mujoco.mujoco_env import MujocoEnv
  File "/home/yrli/anaconda3/envs/softlearning/lib/python3.6/site-packages/gym/envs/mujoco/mujoco_env.py", line 13, in <module>
    raise error.DependencyNotInstalled("{}. (HINT: you need to install mujoco_py, and also perform the setup instructions here: https://github.com/openai/mujoco-py/.)".format(e))
gym.error.DependencyNotInstalled: dlopen: cannot load any more object with static TLS. (HINT: you need to install mujoco_py, and also perform the setup instructions here: https://github.com/openai/mujoco-py/.)

Resume training

Hi,
I am trying resuming a training and I think this works over the --restore parameter? But when I try this I get the error message that a file with ...tune_metadata was not found. And indeed in my checkpoints is no file with this ending? What is the best way to resume experiments?!

Simulate policy not working anymore

In simulate_policy.py I had to replace
render_kwargs={'mode': args.render_mode}
on line 70

and in base_policy.py line 83 replaced it to
super(LatentSpacePolicy, self).init(kwargs['observation_keys'])

Additional information for baseline algorithms

Hi,

Would you please share the information for the baselines, e.g., DDPG, TD3, PPO, with their hyper-parameter settings? (If you used open-source codes, would you please give me the link?)

While this repository only considers soft-learning algorithms, I think refactoring previous baseline algorithms in this framework might be quite useful.

Thanks.

Bug in HER replay pool (and multi goal setup not finished)

Hi,

I noticed there is a bug in the HER replay pool. I would send a PR, but it seems that the whole multi goal setup is not nearly ready at the moment. There is no support in the policies (no policy has goal_keys attributes) and I couldn't really figure out that env.goal_key_map is supposed to represent. Therefore all the tests in HER replay pool are failing and the bug wasn't caught by the tests.

The bug:

def REPLACE_FULL_OBSERVATION(original_batch,
                             resampled_batch,
                             where_resampled,
                             environment):
    batch_flat = flatten(original_batch)
    resampled_batch_flat = flatten(original_batch)  # wrong
    goal_keys = [
        key for key in batch_flat.keys()
        if key[0] == 'goals'
    ]
    for key in goal_keys:
        assert (batch_flat[key][where_resampled].shape
                == resampled_batch_flat[key].shape)
        batch_flat[key][where_resampled] = (
            resampled_batch_flat[key])

    return unflatten(batch_flat)

should be

def REPLACE_FULL_OBSERVATION(original_batch,
                             resampled_batch,
                             where_resampled,
                             environment):
    batch_flat = flatten(original_batch)
    resampled_batch_flat = flatten(resampled_batch)  # correct
    goal_keys = [
        key for key in batch_flat.keys()
        if key[0] == 'goals'
    ]
    for key in goal_keys:
        assert (batch_flat[key][where_resampled].shape
                == resampled_batch_flat[key].shape)
        batch_flat[key][where_resampled] = (
            resampled_batch_flat[key])

    return unflatten(batch_flat)

Since I am trying to experiment with Hierarchical RL with multiple goals, I am more than happy to contribute with the mutli goal setup. From the existing code, I couldn't figure out what's the big picture.

All the best,

Lukas

No module named 'softlearning.utils'

I used virtualenv instead of conda on ubuntu 16.04
Ran python setup.py install after installing all the requirements.
When I ran the first example to train the agent I got the following error:
No module named softlearning.utils
Any idea what caused this? Thanks!

(venv) jc@jc-Precision-5510:~/research/softlearning$ softlearning run_example_local examples.development     --universe=gym     --domain=HalfCheetah     --task=v3     --exp-name=my-sac-experiment-1     --checkpoint-frequency=1000

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Traceback (most recent call last):
  File "/home/jc/research/softlearning/venv/bin/softlearning", line 11, in <module>
    load_entry_point('softlearning==0.0.1', 'console_scripts', 'softlearning')()
  File "/home/jc/research/softlearning/venv/lib/python3.6/site-packages/softlearning-0.0.1-py3.6.egg/softlearning/scripts/console_scripts.py", line 202, in main
  File "/home/jc/research/softlearning/venv/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/jc/research/softlearning/venv/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/jc/research/softlearning/venv/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/jc/research/softlearning/venv/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jc/research/softlearning/venv/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/jc/research/softlearning/venv/lib/python3.6/site-packages/softlearning-0.0.1-py3.6.egg/softlearning/scripts/console_scripts.py", line 71, in run_example_local_cmd
  File "/home/jc/research/softlearning/venv/lib/python3.6/site-packages/softlearning-0.0.1-py3.6.egg/examples/instrument.py", line 203, in run_example_local
  File "/home/jc/research/softlearning/venv/lib/python3.6/site-packages/softlearning-0.0.1-py3.6.egg/examples/development/__init__.py", line 21, in get_parser
  File "/home/jc/research/softlearning/venv/lib/python3.6/site-packages/softlearning-0.0.1-py3.6.egg/examples/utils.py", line 8, in <module>
  File "/home/jc/research/softlearning/venv/lib/python3.6/site-packages/softlearning-0.0.1-py3.6.egg/softlearning/algorithms/__init__.py", line 1, in <module>
  File "/home/jc/research/softlearning/venv/lib/python3.6/site-packages/softlearning-0.0.1-py3.6.egg/softlearning/algorithms/sql.py", line 9, in <module>
  File "/home/jc/research/softlearning/venv/lib/python3.6/site-packages/softlearning-0.0.1-py3.6.egg/softlearning/algorithms/rl_algorithm.py", line 12, in <module>
  File "/home/jc/research/softlearning/venv/lib/python3.6/site-packages/softlearning-0.0.1-py3.6.egg/softlearning/samplers/__init__.py", line 4, in <module>
  File "/home/jc/research/softlearning/venv/lib/python3.6/site-packages/softlearning-0.0.1-py3.6.egg/softlearning/samplers/remote_sampler.py", line 10, in <module>
  File "/home/jc/research/softlearning/venv/lib/python3.6/site-packages/softlearning-0.0.1-py3.6.egg/softlearning/samplers/utils.py", line 5, in <module>
  File "/home/jc/research/softlearning/venv/lib/python3.6/site-packages/softlearning-0.0.1-py3.6.egg/softlearning/replay_pools/__init__.py", line 4, in <module>
  File "/home/jc/research/softlearning/venv/lib/python3.6/site-packages/softlearning-0.0.1-py3.6.egg/softlearning/replay_pools/trajectory_replay_pool.py", line 8, in <module>
ModuleNotFoundError: No module named 'softlearning.utils'

Checkpointing should not store cumulative replay pool

Right now our checkpointing code saves the full replay pool on every single checkpoint. This has become a problem with the image experiments since the snapshot size grows to gigabytes. One solution could be to just save the experience since the latest checkpoint and construct the replay pool from the previous checkpoints when restoring.

Parallelization

Hi,
as far as I understand it, SAC currently works for training with a single agent?

Are there plans to support distributed training like done in Surreal?

Pusher low level policy for ('any',-1) is not learning

Hi,

I am running your code. Pusher low-level policy trained for (-1, 'any') works fine but ('any',-1) doesn't do anything. Is there any fix for it or do you have your low-level trained models that I can use for my work?

Hierarchical training and reward set

Hi,
I found your paper "Latent Space Policies for Hierarchical Reinforcement Learning" very interesting and was glad you published the code. Motivated by your results, I'd like to implement the ant maze with hierarchical policies and compound skills / different rewards.
I didn't come up with an answer to the following questions. It would be great if you could help me out!

I assume that I have to pretrain a lower level policy first. How do I freeze the low level weights in the next step and how can I add a high level policy on top?

In the paper you mentioned a set of K reward functions. Where can I define the reward set ?

Thank you!

MultivariateNormalDiag log_prob with target_entropy and alpha

Thank you for sharing the code.

For Gaussian policies, in this implementation as well as the original SAC repository, the log_prob method of the MultivariateNormalDiag class is used to compute the log probability of an action. This method returns the probability density and not the probability so log probabilities can be greater than 0. The issues arise in the learning objective for alpha. I can set a target_entropy = 0.0 in which you'd expect alpha to go to 0 (an entropy of 0 indicates a deterministic policy) but this is not the case since log_pi can be greater than or less than 0.

Is there something simple I'm missing here?

Thank you again.

Union pool

Hi,

Seems like the union pool implementation is unfinished. Is that going to be done soon?

Thanks!

Automating alpha

Alpha loss is defined as (code):
alpha_loss = -tf.reduce_mean(log_alpha * tf.stop_gradient(log_pis + self._target_entropy))

In other words:
alpha_loss = log_alpha * (- <Negative constant>)
alpha_loss = log_alpha * (<Positive constant>)

So minimizing alpha_loss means minimizing log_alpha, this means that alphaalways goes to zero no matter what, and this is indeed what I'm confirming in my experiments.

I'm obviously forgeting something, but I'm not being able to figure it out.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.