runzheyang / morl Goto Github PK

Multi-Objective Reinforcement Learning

Python 59.01% Jupyter Notebook 1.94% Shell 11.69% Makefile 0.18% HTML 27.18%

morl's Introduction

Deep Multi-Objective Reinforcement Learning

A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation, NeurIPS'19.

Abstract

We introduce a new algorithm for multi-objective reinforcement learning (MORL) with linear preferences, with the goal of enabling few-shot adaptation to new tasks. In MORL, the aim is to learn policies over multiple competing objectives whose relative importance (preferences) is unknown to the agent. While this alleviates dependence on scalar reward design, the expected return of a policy can change significantly with varying preferences, making it challenging to learn a single model to produce optimal policies under different preference conditions. We propose a generalized version of the Bellman equation to learn a single parametric representation for optimal policies over the space of all possible preferences. After this initial learning phase, our agent can execute the optimal policy under any given preference, or automatically infer an underlying preference with very few samples. Experiments across four different domains demonstrate the effectiveness of our approach.

Instructions

The experiments on two synthetic domains, Deep Sea Treasure (DST) and Fruit Tree Navigation (FTN), as well as two complex real domains, Task-Oriented Dialog Policy Learning (Dialog) and SuperMario Game (SuperMario).

`synthetic`

PyTorch version for the code in synthetic was torch 0.4.0 (sorry for the 2 years old code) with Python 3.5, and the visdom version is 0.1.6.3

Example - train envelope MOQ-learning on FTN domain:
python train.py --env-name ft --method crl-envelope --model linear --gamma 0.99 --mem-size 4000 --batch-size 256 --lr 1e-3 --epsilon 0.5 --epsilon-decay --weight-num 32 --episode-num 5000 --optimizer Adam --save crl/envelope/saved/ --log crl/envelope/logs/ --update-freq 100 --beta 0.01 --name 0

The code for our envelope MOQ-learning algorithm is in synthetic/crl/envelope/meta.py, neural network architecture is configurable in synthetic/crl/envelope/models. Two synthetic environments are under the file synthetic/envs.

`pydial`

Code for Task-Oriented Dialog Policy Leanring. The environment is modified from PyDial.

PyTorch version for the code in pydial was torch 0.4.1 with Python 2.7 (since the PyDial requires Python 2)

Example - train envelope MOQ-learning on Dialog domain:
pydial train config/MORL/Train/envelope.cfg

The code for our envelope MOQ-learning algorithm is in pydial/policy/envelope.py.

`multimario`

The multi-objective version SuperMario Game. The environment is modified from Kautenja/gym-super-mario-bros.

PyTorch version for the code in multimario was torch 1.1.0 with Python 3.5.

Example - train envelope MOQ-learning on SuperMario domain:
python run_e3c_double.py --env-id SuperMarioBros-v2 --use-cuda --use-gae --life-done --single-stage --training --standardization --num-worker 16 --sample-size 8 --beta 0.05 --name e3c_b05

The code for our envelope MOQ-learning algorithm is in multimario/agent.py. Two multi-objective version environment is in multimario/env.py.

Citation

@incollection{yang2019morl,
  title = {A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation},
  author = {Yang, Runzhe and Sun, Xingyuan and Narasimhan, Karthik},
  booktitle = {Advances in Neural Information Processing Systems 32},
  editor = {H. Wallach and H. Larochelle and A. Beygelzimer and F. d\textquotesingle Alch\'{e}-Buc and E. Fox and R. Garnett},
  pages = {14610--14621},
  year = {2019},
  publisher = {Curran Associates, Inc.},
  url = {http://papers.nips.cc/paper/9605-a-generalized-algorithm-for-multi-objective-reinforcement-learning-and-policy-adaptation.pdf}
}

morl's People

Contributors

Stargazers

Watchers

Forkers

sundycoders kasimte madgmma spartmanavon erzhu419 sunyong2016 wwxfromtju keshava zhenchangxia hexi519 thisishale davidzhang2018 kudzuyu chang88ye hanbaoan123 thxsxth yingli2009 kiminh schr476 chriszonghaoli say4n michaelsvdz pa-wan testmonkey02 typ123456789 radum2275 jarinneverstop janweh duico johan-kallstrom adityagupta961 hemanthsrisai963 sairam-1248 whoismanoj iuhhhisme sanelyx tahuubinh yuling91 linshan-ding shercklo curryliu30 tiffinyk yjyeh0 jk-shin-pg zc0315 githubwhutfan ritvikmahajan01 wyq199321

morl's Issues

The results for DST environment

Hello, I tried the synthetic code on FTN env and the test results are close to the paper. However, for the DST env, I get 0.77 and 0.66 F1 for two runs. But in the paper, it is much higher. Do the settings below (the setting in the readme file) apply to the DST env too?
(I only changed --env-name to dst and --episode-num to 2000.)

python train.py --env-name dst --method crl-envelope --model linear --gamma 0.99 --mem-size 4000 --batch-size 256 --lr 1e-3 --epsilon 0.5 --epsilon-decay --weight-num 32 --episode-num 2000 --optimizer Adam --save crl/envelope/saved/ --log crl/envelope/logs/ --update-freq 100 --beta 0.01 --name 0

What is envemask doing inside the envelop_operator()?

Hi Runzhe, I couldn't understand what did you do with this envelope-mask. Why did you add that np.array here?

MORL/multimario/run_e3c_double.py

Line 128 in d8ce3e3

    
           envemask = envemask.reshape(-1) * ofs + np.array(list(range(ofs))*args.sample_size)

Is the envemask related to this part in the pseudo code?

why did you choose the w' not w_i inside the value function?

I think this is the core of your algorithm but I am a bit confused TT

Agents are stuck by the green pilar

I'm trying to run 'run_e2c_double.py'. And I found that all of the agents were stuck by the same green pilar. Any idea about this problem? I have run the code for about 12 hours. Is it possible that I need to spend more time training the net? Or it is caused by the instability of the training process?

Issue with synthetic\roijers_train.py

Hello,

In your file: roijers_train.py, you have:

agent.memorize(state, action, next_state, reward, terminal, roi=True)
loss += agent.learn(corner_w)
But your code in the agent does not take input roi in memorize and corner_w in learn. It can only be run like this:
loss += agent.learn()
agent.memorize(state, action, next_state, reward, terminal)

Thanks!

The operation of synthetic/crl/envelope/meta.py

Hello,

I am trying to read the MORL paper and your code.

I'm currently looking at synthetic/crl/envelope/meta.py,
but in learn() function, it seems to the H() operator is implemented differently from your paper.

MORL/synthetic/crl/envelope/meta.py

Lines 190 to 206 in 6860200

    
           # detach since we don't want gradients to propagate 
        
           # HQ, _    = self.model_(Variable(torch.cat(next_state_batch, dim=0), volatile=True), 
        
           # 					  Variable(w_batch, volatile=True), w_num=self.weight_num) 
        
           _, DQ = self.model(Variable(torch.cat(next_state_batch, dim=0), requires_grad=False), 
        
                              Variable(w_batch, requires_grad=False)) 
        
           w_ext = w_batch.unsqueeze(2).repeat(1, action_size, 1) 
        
           w_ext = w_ext.view(-1, self.model.reward_size) 
        
           _, tmpQ = self.model_(Variable(torch.cat(next_state_batch, dim=0), requires_grad=False), 
        
                                 Variable(w_batch, requires_grad=False)) 
        
           tmpQ = tmpQ.view(-1, reward_size) 
        
           # print(torch.bmm(w_ext.unsqueeze(1), 
        
           # 			    tmpQ.data.unsqueeze(2)).view(-1, action_size)) 
        
           act = torch.bmm(Variable(w_ext.unsqueeze(1), requires_grad=False), 
        
                           tmpQ.unsqueeze(2)).view(-1, action_size).max(1)[1] 
        
           HQ = DQ.gather(1, act.view(-1, 1, 1).expand(DQ.size(0), 1, DQ.size(2))).squeeze()

In the above code, it just takes an inner product of w_ext and tmpQ, then argmax over actions, like the equation below.

I think it behaves differently from the paper, which is described as

Actually, I see the commented line 191~192 in your code, and they seem to behave consistently with your paper.
It seems that you intentionally commented it out and made a new version, for some purpose.

Could you explain why you did this? Thanks!

What is the difference between envelope, energy and naive?

Envelope is the main paper's approach. However, I do not understand what energy and naive means in the code. Are they baselines (Q fitted iteration and scalarized)?

RuntimeWarning: overflow encountered in ubyte_scalars

I'm trying to run the sample code setup from the README for multimario and getting this error message repeatedly from the console:

(multimario) Kasim:multimario Kasim$ python run_e3c_double.py --env-id SuperMarioBros-v2 --use-gae --life-done --single-stage --training --standardization --num-worker 16 --sample-size 8 --beta 0.05 --name e3c_b05
/Users/Kasim/opt/anaconda3/envs/multimario/lib/python3.7/site-packages/gym_super_mario_bros/smb_env.py:148: RuntimeWarning: overflow encountered in ubyte_scalars
return (self.ram[0x86] - self.ram[0x071c]) % 256
/Users/Kasim/opt/anaconda3/envs/multimario/lib/python3.7/site-packages/gym_super_mario_bros/smb_env.py:148: RuntimeWarning: overflow encountered in ubyte_scalars
return (self.ram[0x86] - self.ram[0x071c]) % 256

Is this of concern?

Is this a typo or the design of the run_e3c_double.py?

I noticed on line 245 of multimario/run_e3c_double.py readsrewards.append(fixed_w.dot(mor)).

Is this supposed to be explore_w.dot(mor) like in run_e3c.py? Or it's a different design for the algorithm?

Thanks!

What is the difference between run_e3c_double.py and run_e3c.py?

In the multimario part of the project, there are the following files:

run_e3c_double.py
run_e3c.py

What's the difference between the two?

Connection error

I got a connection error when running the examples in the synthetic directory. It seems the training process is going on but the connection error appears in each episode.

TypeError: mul(): argument 'other' (position 1) must be Tensor, not numpy.ndarray

@RunzheYang: Any chance you saw this error when running the evaluation script for the FTN environment?

Traceback (most recent call last):
  File "test/eval_ft.py", line 387, in <module>
    qc = hq.data[0] * w_e
TypeError: mul(): argument 'other' (position 1) must be Tensor, not numpy.ndarray

It looks like it isn't affecting the pareto frontier plots and could also be fixed by just converting the numpy array to a pytorch tensor, but I'm wondering if has to do with my environment setup.

issue with probe and preference

In train.py,
if args.env_name == "dst":
probe = FloatTensor([0.8, 0.2])

A probe variable is manually defined here. My personal understanding is that it is the preference omiga mentioned in the paper, but in the function “agent.act()” of meta.py, where “self.w_kept” is generated randomly and assigned to "preference = self.w_kept", then the preference is passed into the self.model_() function (instantiated by EnvelopeLinearCQN) as a parameter to learn the q value. Moreover, the function agent.memorize() in train.py also set a random initialization "preference = torch.randn(self.model_.reward_size)", which is different from the preference used in “agent.act()” .

It is so confused me that if both probe and preference represent the weighted vector omiga, why not directly use the probe variable as a parameter and pass it directly into the agent.act() and agent.memorize() functions? Why does agent.act() and agent.memorize() randomly generate preference in every cycle?

In meta.py,
, Q = self.model(
Variable(state.unsqueeze(0)),
Variable(preference.unsqueeze(0)))
In this function, self.model_(state, preference) got two return values by instantiating the class EnvelopeLinearCQN in linear.py, which is hq and q. I argue that the hq should be used as the result of the operator H, but the agent.act() function in meta.py only take the q value, that is another confusion.
In train.py,
When the while loop ends, there is a command:
_, q = agent.predict(probe),
why the preset probe is used as the parameter to predict q? Whe the predict() function is :
def predict(self, probe):
return self.model(Variable(FloatTensor([0, 0]).unsqueeze(0), requires_grad=False),
Variable(probe.unsqueeze(0), requires_grad=False))
,which used the static variable FloatTensor([0, 0]) as the state?

Why the transpose on total_state?

In the multimario double envelope implementation, here,

MORL/multimario/run_e3c_double.py

Line 297 in d8ce3e3

total_state = np.stack(total_state).transpose(

, the total_state data is transposed like so:

            total_state = np.stack(total_state).transpose(
                [1, 0, 2, 3, 4]).reshape([-1, 4, 84, 84])

What is the intention here?

As I understand it, total_state is 5 batches of states from the GAE loop prior to the training section, which is then unbatched into a single array of states for processing. But it isn't clear to me what the transpose is there for.

Thanks in advance.

What's the difference between the synthetic/crl/energy and synthetic/crl/envelope

Sorry, I have a problem why don't directly output HQ in the synthetic/crl/envelope/meta.py learn()，

__, Q = self.model_(Variable(torch.cat(state_batch, dim=0)),
                                Variable(w_batch), w_num=self.weight_num)
# detach since we don't want gradients to propagate
# HQ, _    = self.model_(Variable(torch.cat(next_state_batch, dim=0), volatile=True),
# 					  Variable(w_batch, volatile=True), w_num=self.weight_num)
 _, DQ = self.model(Variable(torch.cat(next_state_batch, dim=0), requires_grad=False),
                               Variable(w_batch, requires_grad=False))

but in the synthetic/crl/energy/meta.py learn()

__, Q = self.model(Variable(torch.cat(state_batch, dim=0)),
                               Variable(preference_batch), w_num=self.weight_num)
 # detach since we don't want gradients to propagate
HQ, _ = self.model(Variable(torch.cat(next_state_batch, dim=0)),
                               Variable(preference_batch), w_num=self.weight_num)

Why getting HQ takes two different approaches and what is the difference between them

Issue with synthetic\roijers_train.py

Hello,

In your file: roijers_train.py, you have:
agent.memorize(state, action, next_state, reward, terminal, roi=True)
loss += agent.learn(corner_w)
But your code in the agent does not take input roi in memorize and corner_w in learn. It can only be run like this:
loss += agent.learn()
agent.memorize(state, action, next_state, reward, terminal)

Thanks!

UserWarning: masked_select received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead

I repeatedly get this deprecation warning in the logs during training.

../aten/src/ATen/native/LegacyDefinitions.cpp:67: UserWarning: masked_select received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead.

Here is my pip list:

(morl) Kasim:MORL Kasim$ pip list
Package          Version            
---------------- -------------------
certifi          2019.9.11          
chardet          3.0.4              
idna             2.8                
jsonpatch        1.24               
jsonpointer      2.0                
numpy            1.17.4             
Pillow           6.2.1              
pip              19.3.1             
pyzmq            18.1.1             
requests         2.22.0             
scipy            1.3.3              
setuptools       42.0.1.post20191125
six              1.13.0             
torch            1.3.1              
torchfile        0.1.0              
tornado          6.0.3              
urllib3          1.25.7             
visdom           0.1.8.9            
websocket-client 0.56.0             
wheel            0.33.6             
(morl) Kasim:MORL Kasim$

Should I be using a different version of torch?

What is the purpose of total_utility in run_e3c_double.py?

On line 308 of run_e3c_double.py, total_utility is defined as so:

MORL/multimario/run_e3c_double.py

Line 308 in 9e01c30

total_utility = np.sum(total_moreward * total_update_w, axis=-1).reshape([-1])

I couldn't find where it was used, however. What is the purpose of this?

Thanks in advance.

Error when running synthetic example line: FileNotFoundError: [Errno 2] No such file or directory: 'crl/envelope/logs/m.linear_e.ft_n.0.log'

I get the following errors when running the synthetic example line from the README. Is this an issue with my environment?

Environment:

Python 3.7
Installed numpy, torch, visdom.

Error log:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/Kasim/opt/anaconda3/envs/morl/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/Users/Kasim/opt/anaconda3/envs/morl/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/Users/Kasim/opt/anaconda3/envs/morl/lib/python3.7/site-packages/urllib3/util/retry.py", line 436, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8097): Max retries exceeded with url: /events (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x123635fd0>: Failed to establish a new connection: [Errno 61] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/Kasim/opt/anaconda3/envs/morl/lib/python3.7/site-packages/visdom/__init__.py", line 711, in _send
    data=json.dumps(msg),
  File "/Users/Kasim/opt/anaconda3/envs/morl/lib/python3.7/site-packages/visdom/__init__.py", line 677, in _handle_post
    r = self.session.post(url, data=data)
  File "/Users/Kasim/opt/anaconda3/envs/morl/lib/python3.7/site-packages/requests/sessions.py", line 581, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/Users/Kasim/opt/anaconda3/envs/morl/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/Kasim/opt/anaconda3/envs/morl/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/Users/Kasim/opt/anaconda3/envs/morl/lib/python3.7/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8097): Max retries exceeded with url: /events (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x123635fd0>: Failed to establish a new connection: [Errno 61] Connection refused'))
Traceback (most recent call last):
  File "train.py", line 155, in <module>
    train(env, agent, args)
  File "train.py", line 60, in train
    monitor.init_log(args.log, "m.{}_e.{}_n.{}".format(args.model, args.env_name, args.name))
  File "/Users/Kasim/Projects/ml/MORL/synthetic/utils/monitor.py", line 54, in init_log
    self.log_file = open("{}{}.log".format(save_path, name), 'w')
FileNotFoundError: [Errno 2] No such file or directory: 'crl/envelope/logs/m.linear_e.ft_n.0.log'

	# detach since we don't want gradients to propagate
	# HQ, _ = self.model_(Variable(torch.cat(next_state_batch, dim=0), volatile=True),
	# Variable(w_batch, volatile=True), w_num=self.weight_num)
	_, DQ = self.model(Variable(torch.cat(next_state_batch, dim=0), requires_grad=False),
	Variable(w_batch, requires_grad=False))
	w_ext = w_batch.unsqueeze(2).repeat(1, action_size, 1)
	w_ext = w_ext.view(-1, self.model.reward_size)
	_, tmpQ = self.model_(Variable(torch.cat(next_state_batch, dim=0), requires_grad=False),
	Variable(w_batch, requires_grad=False))

	tmpQ = tmpQ.view(-1, reward_size)
	# print(torch.bmm(w_ext.unsqueeze(1),
	# tmpQ.data.unsqueeze(2)).view(-1, action_size))
	act = torch.bmm(Variable(w_ext.unsqueeze(1), requires_grad=False),
	tmpQ.unsqueeze(2)).view(-1, action_size).max(1)[1]

	HQ = DQ.gather(1, act.view(-1, 1, 1).expand(DQ.size(0), 1, DQ.size(2))).squeeze()

runzheyang / morl Goto Github PK

morl's Introduction

Deep Multi-Objective Reinforcement Learning

Abstract

Instructions

synthetic

pydial

multimario

Citation

morl's People

Contributors

Stargazers

Watchers

Forkers

morl's Issues

Recommend Projects

Recommend Topics

Recommend Org

`synthetic`

`pydial`

`multimario`