Code Monkey home page Code Monkey logo

morl's Introduction

Deep Multi-Objective Reinforcement Learning

A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation, NeurIPS'19.

Abstract

We introduce a new algorithm for multi-objective reinforcement learning (MORL) with linear preferences, with the goal of enabling few-shot adaptation to new tasks. In MORL, the aim is to learn policies over multiple competing objectives whose relative importance (preferences) is unknown to the agent. While this alleviates dependence on scalar reward design, the expected return of a policy can change significantly with varying preferences, making it challenging to learn a single model to produce optimal policies under different preference conditions. We propose a generalized version of the Bellman equation to learn a single parametric representation for optimal policies over the space of all possible preferences. After this initial learning phase, our agent can execute the optimal policy under any given preference, or automatically infer an underlying preference with very few samples. Experiments across four different domains demonstrate the effectiveness of our approach.

Instructions

The experiments on two synthetic domains, Deep Sea Treasure (DST) and Fruit Tree Navigation (FTN), as well as two complex real domains, Task-Oriented Dialog Policy Learning (Dialog) and SuperMario Game (SuperMario).

synthetic

PyTorch version for the code in synthetic was torch 0.4.0 (sorry for the 2 years old code) with Python 3.5, and the visdom version is 0.1.6.3

  • Example - train envelope MOQ-learning on FTN domain:
    python train.py --env-name ft --method crl-envelope --model linear --gamma 0.99 --mem-size 4000 --batch-size 256 --lr 1e-3 --epsilon 0.5 --epsilon-decay --weight-num 32 --episode-num 5000 --optimizer Adam --save crl/envelope/saved/ --log crl/envelope/logs/ --update-freq 100 --beta 0.01 --name 0

The code for our envelope MOQ-learning algorithm is in synthetic/crl/envelope/meta.py, neural network architecture is configurable in synthetic/crl/envelope/models. Two synthetic environments are under the file synthetic/envs.

pydial

Code for Task-Oriented Dialog Policy Leanring. The environment is modified from PyDial.

PyTorch version for the code in pydial was torch 0.4.1 with Python 2.7 (since the PyDial requires Python 2)

  • Example - train envelope MOQ-learning on Dialog domain:
    pydial train config/MORL/Train/envelope.cfg

The code for our envelope MOQ-learning algorithm is in pydial/policy/envelope.py.

multimario

The multi-objective version SuperMario Game. The environment is modified from Kautenja/gym-super-mario-bros.

PyTorch version for the code in multimario was torch 1.1.0 with Python 3.5.

  • Example - train envelope MOQ-learning on SuperMario domain:
    python run_e3c_double.py --env-id SuperMarioBros-v2 --use-cuda --use-gae --life-done --single-stage --training --standardization --num-worker 16 --sample-size 8 --beta 0.05 --name e3c_b05

The code for our envelope MOQ-learning algorithm is in multimario/agent.py. Two multi-objective version environment is in multimario/env.py.

Citation

@incollection{yang2019morl,
  title = {A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation},
  author = {Yang, Runzhe and Sun, Xingyuan and Narasimhan, Karthik},
  booktitle = {Advances in Neural Information Processing Systems 32},
  editor = {H. Wallach and H. Larochelle and A. Beygelzimer and F. d\textquotesingle Alch\'{e}-Buc and E. Fox and R. Garnett},
  pages = {14610--14621},
  year = {2019},
  publisher = {Curran Associates, Inc.},
  url = {http://papers.nips.cc/paper/9605-a-generalized-algorithm-for-multi-objective-reinforcement-learning-and-policy-adaptation.pdf}
}

morl's People

Contributors

kasimte avatar runzheyang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

morl's Issues

The results for DST environment

Hello, I tried the synthetic code on FTN env and the test results are close to the paper. However, for the DST env, I get 0.77 and 0.66 F1 for two runs. But in the paper, it is much higher. Do the settings below (the setting in the readme file) apply to the DST env too?
(I only changed --env-name to dst and --episode-num to 2000.)

python train.py --env-name dst --method crl-envelope --model linear --gamma 0.99 --mem-size 4000 --batch-size 256 --lr 1e-3 --epsilon 0.5 --epsilon-decay --weight-num 32 --episode-num 2000 --optimizer Adam --save crl/envelope/saved/ --log crl/envelope/logs/ --update-freq 100 --beta 0.01 --name 0

What is envemask doing inside the envelop_operator()?

Hi Runzhe, I couldn't understand what did you do with this envelope-mask. Why did you add that np.array here?

envemask = envemask.reshape(-1) * ofs + np.array(list(range(ofs))*args.sample_size)

Is the envemask related to this part in the pseudo code?
image
why did you choose the w' not w_i inside the value function?

I think this is the core of your algorithm but I am a bit confused TT

Agents are stuck by the green pilar

I'm trying to run 'run_e2c_double.py'. And I found that all of the agents were stuck by the same green pilar. Any idea about this problem? I have run the code for about 12 hours. Is it possible that I need to spend more time training the net? Or it is caused by the instability of the training process?
supermario

Issue with synthetic\roijers_train.py

Hello,

In your file: roijers_train.py, you have:

agent.memorize(state, action, next_state, reward, terminal, roi=True)
loss += agent.learn(corner_w)
But your code in the agent does not take input roi in memorize and corner_w in learn. It can only be run like this:
loss += agent.learn()
agent.memorize(state, action, next_state, reward, terminal)

Thanks!

The operation of synthetic/crl/envelope/meta.py

Hello,

I am trying to read the MORL paper and your code.

I'm currently looking at synthetic/crl/envelope/meta.py,
but in learn() function, it seems to the H() operator is implemented differently from your paper.

# detach since we don't want gradients to propagate
# HQ, _ = self.model_(Variable(torch.cat(next_state_batch, dim=0), volatile=True),
# Variable(w_batch, volatile=True), w_num=self.weight_num)
_, DQ = self.model(Variable(torch.cat(next_state_batch, dim=0), requires_grad=False),
Variable(w_batch, requires_grad=False))
w_ext = w_batch.unsqueeze(2).repeat(1, action_size, 1)
w_ext = w_ext.view(-1, self.model.reward_size)
_, tmpQ = self.model_(Variable(torch.cat(next_state_batch, dim=0), requires_grad=False),
Variable(w_batch, requires_grad=False))
tmpQ = tmpQ.view(-1, reward_size)
# print(torch.bmm(w_ext.unsqueeze(1),
# tmpQ.data.unsqueeze(2)).view(-1, action_size))
act = torch.bmm(Variable(w_ext.unsqueeze(1), requires_grad=False),
tmpQ.unsqueeze(2)).view(-1, action_size).max(1)[1]
HQ = DQ.gather(1, act.view(-1, 1, 1).expand(DQ.size(0), 1, DQ.size(2))).squeeze()

In the above code, it just takes an inner product of w_ext and tmpQ, then argmax over actions, like the equation below.

image

I think it behaves differently from the paper, which is described as

image

Actually, I see the commented line 191~192 in your code, and they seem to behave consistently with your paper.
It seems that you intentionally commented it out and made a new version, for some purpose.

Could you explain why you did this? Thanks!

RuntimeWarning: overflow encountered in ubyte_scalars

I'm trying to run the sample code setup from the README for multimario and getting this error message repeatedly from the console:

(multimario) Kasim:multimario Kasim$ python run_e3c_double.py --env-id SuperMarioBros-v2 --use-gae --life-done --single-stage --training --standardization --num-worker 16 --sample-size 8 --beta 0.05 --name e3c_b05
/Users/Kasim/opt/anaconda3/envs/multimario/lib/python3.7/site-packages/gym_super_mario_bros/smb_env.py:148: RuntimeWarning: overflow encountered in ubyte_scalars
return (self.ram[0x86] - self.ram[0x071c]) % 256
/Users/Kasim/opt/anaconda3/envs/multimario/lib/python3.7/site-packages/gym_super_mario_bros/smb_env.py:148: RuntimeWarning: overflow encountered in ubyte_scalars
return (self.ram[0x86] - self.ram[0x071c]) % 256

Is this of concern?

Connection error

I got a connection error when running the examples in the synthetic directory. It seems the training process is going on but the connection error appears in each episode.

error

TypeError: mul(): argument 'other' (position 1) must be Tensor, not numpy.ndarray

@RunzheYang: Any chance you saw this error when running the evaluation script for the FTN environment?

Traceback (most recent call last):
  File "test/eval_ft.py", line 387, in <module>
    qc = hq.data[0] * w_e
TypeError: mul(): argument 'other' (position 1) must be Tensor, not numpy.ndarray

It looks like it isn't affecting the pareto frontier plots and could also be fixed by just converting the numpy array to a pytorch tensor, but I'm wondering if has to do with my environment setup.

issue with probe and preference

  1. In train.py,
    if args.env_name == "dst":
    probe = FloatTensor([0.8, 0.2])

A probe variable is manually defined here. My personal understanding is that it is the preference omiga mentioned in the paper, but in the function “agent.act()” of meta.py, where “self.w_kept” is generated randomly and assigned to "preference = self.w_kept", then the preference is passed into the self.model_() function (instantiated by EnvelopeLinearCQN) as a parameter to learn the q value. Moreover, the function agent.memorize() in train.py also set a random initialization "preference = torch.randn(self.model_.reward_size)", which is different from the preference used in “agent.act()” .

It is so confused me that if both probe and preference represent the weighted vector omiga, why not directly use the probe variable as a parameter and pass it directly into the agent.act() and agent.memorize() functions? Why does agent.act() and agent.memorize() randomly generate preference in every cycle?

  1. In meta.py,
    , Q = self.model(
    Variable(state.unsqueeze(0)),
    Variable(preference.unsqueeze(0)))
    In this function, self.model_(state, preference) got two return values by instantiating the class EnvelopeLinearCQN in linear.py, which is hq and q. I argue that the hq should be used as the result of the operator H, but the agent.act() function in meta.py only take the q value, that is another confusion.

  2. In train.py,
    When the while loop ends, there is a command:
    _, q = agent.predict(probe),
    why the preset probe is used as the parameter to predict q? Whe the predict() function is :
    def predict(self, probe):
    return self.model(Variable(FloatTensor([0, 0]).unsqueeze(0), requires_grad=False),
    Variable(probe.unsqueeze(0), requires_grad=False))
    ,which used the static variable FloatTensor([0, 0]) as the state?

Why the transpose on total_state?

In the multimario double envelope implementation, here,

total_state = np.stack(total_state).transpose(

, the total_state data is transposed like so:

            total_state = np.stack(total_state).transpose(
                [1, 0, 2, 3, 4]).reshape([-1, 4, 84, 84])

What is the intention here?

As I understand it, total_state is 5 batches of states from the GAE loop prior to the training section, which is then unbatched into a single array of states for processing. But it isn't clear to me what the transpose is there for.

Thanks in advance.

What's the difference between the synthetic/crl/energy and synthetic/crl/envelope

Sorry, I have a problem why don't directly output HQ in the synthetic/crl/envelope/meta.py learn(),

__, Q = self.model_(Variable(torch.cat(state_batch, dim=0)),
                                Variable(w_batch), w_num=self.weight_num)
# detach since we don't want gradients to propagate
# HQ, _    = self.model_(Variable(torch.cat(next_state_batch, dim=0), volatile=True),
# 					  Variable(w_batch, volatile=True), w_num=self.weight_num)
 _, DQ = self.model(Variable(torch.cat(next_state_batch, dim=0), requires_grad=False),
                               Variable(w_batch, requires_grad=False))

but in the synthetic/crl/energy/meta.py learn()

__, Q = self.model(Variable(torch.cat(state_batch, dim=0)),
                               Variable(preference_batch), w_num=self.weight_num)
 # detach since we don't want gradients to propagate
HQ, _ = self.model(Variable(torch.cat(next_state_batch, dim=0)),
                               Variable(preference_batch), w_num=self.weight_num)

Why getting HQ takes two different approaches and what is the difference between them

Issue with synthetic\roijers_train.py

Hello,

In your file: roijers_train.py, you have:
agent.memorize(state, action, next_state, reward, terminal, roi=True)
loss += agent.learn(corner_w)
But your code in the agent does not take input roi in memorize and corner_w in learn. It can only be run like this:
loss += agent.learn()
agent.memorize(state, action, next_state, reward, terminal)

Thanks!

UserWarning: masked_select received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead

I repeatedly get this deprecation warning in the logs during training.

../aten/src/ATen/native/LegacyDefinitions.cpp:67: UserWarning: masked_select received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead.

Here is my pip list:

(morl) Kasim:MORL Kasim$ pip list
Package          Version            
---------------- -------------------
certifi          2019.9.11          
chardet          3.0.4              
idna             2.8                
jsonpatch        1.24               
jsonpointer      2.0                
numpy            1.17.4             
Pillow           6.2.1              
pip              19.3.1             
pyzmq            18.1.1             
requests         2.22.0             
scipy            1.3.3              
setuptools       42.0.1.post20191125
six              1.13.0             
torch            1.3.1              
torchfile        0.1.0              
tornado          6.0.3              
urllib3          1.25.7             
visdom           0.1.8.9            
websocket-client 0.56.0             
wheel            0.33.6             
(morl) Kasim:MORL Kasim$ 

Should I be using a different version of torch?

Error when running synthetic example line: FileNotFoundError: [Errno 2] No such file or directory: 'crl/envelope/logs/m.linear_e.ft_n.0.log'

I get the following errors when running the synthetic example line from the README. Is this an issue with my environment?

Environment:

  • Python 3.7
  • Installed numpy, torch, visdom.

Error log:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/Kasim/opt/anaconda3/envs/morl/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/Users/Kasim/opt/anaconda3/envs/morl/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/Users/Kasim/opt/anaconda3/envs/morl/lib/python3.7/site-packages/urllib3/util/retry.py", line 436, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8097): Max retries exceeded with url: /events (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x123635fd0>: Failed to establish a new connection: [Errno 61] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/Kasim/opt/anaconda3/envs/morl/lib/python3.7/site-packages/visdom/__init__.py", line 711, in _send
    data=json.dumps(msg),
  File "/Users/Kasim/opt/anaconda3/envs/morl/lib/python3.7/site-packages/visdom/__init__.py", line 677, in _handle_post
    r = self.session.post(url, data=data)
  File "/Users/Kasim/opt/anaconda3/envs/morl/lib/python3.7/site-packages/requests/sessions.py", line 581, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/Users/Kasim/opt/anaconda3/envs/morl/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/Kasim/opt/anaconda3/envs/morl/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/Users/Kasim/opt/anaconda3/envs/morl/lib/python3.7/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8097): Max retries exceeded with url: /events (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x123635fd0>: Failed to establish a new connection: [Errno 61] Connection refused'))
Traceback (most recent call last):
  File "train.py", line 155, in <module>
    train(env, agent, args)
  File "train.py", line 60, in train
    monitor.init_log(args.log, "m.{}_e.{}_n.{}".format(args.model, args.env_name, args.name))
  File "/Users/Kasim/Projects/ml/MORL/synthetic/utils/monitor.py", line 54, in init_log
    self.log_file = open("{}{}.log".format(save_path, name), 'w')
FileNotFoundError: [Errno 2] No such file or directory: 'crl/envelope/logs/m.linear_e.ft_n.0.log'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.