hongzimao / deeprm Goto Github PK
View Code? Open in Web Editor NEWResource Management with Deep Reinforcement Learning (HotNets '16)
License: MIT License
Resource Management with Deep Reinforcement Learning (HotNets '16)
License: MIT License
HI
This is an excellent work and I would like to carry out some experiments (non-commercial of course) using this repository. Could you please provide the license for the same.
Thanks
Mr. Mao,
Could you explain more what exactly the below parameter do in the parameters class? (I appreciate you if you give me an example.)
Line 31 in b42eff0
In the paper, there is a discussion of the synthetic database. If I want to use a different database that reflects my workload, I think that I need to create my own copy of generate_sequence_work
. Is this the correct place to make the change?
Dear Hongzi,
I am trying to reproduce all the results that you reported on paper. From the source code, it is unclear to plot the slowdown from 10% to 190% cluster load. When I run the run_script.py, I am able to see the generated logs and nothing corresponds to the Figure 4.
Can you please give a detailed explanation on how you are plotting the slowdown for cluster load from 10% to 190%. Again from the source code, It is clear that you are relying on the job rate from 0.1 to 1.0 to vary the load from 10% to 190%, but when I tried to rely on just the job rate from 0.1 to 1.0 , and varied the cluster load from 10 to 190%, the slowdown after 100% was constant till 190%.
Thank You.
Following the recommendation to post an e-mail conversation (adapted) on the issues page so others can also learn from this and discuss. Regarding studies of how job's runtime unaccuracies could affect the RL and the overall scheduler performance:
I'm very interested in investigating the ability of the reinforcement learning agent in performing well when considering different job models, in particular, at first, jobs with uncertain length, i.e., experimenting on some of the partial observability discussion in the paper's section 5. Because it would be needed a set of conditional observation probabilities to cast the problem as a POMDP, I thought of a preliminary methodology that would involve, at first, randomly choosing the reward as either the original one, which uses the original job length, or a modified one, which would use the original job length + 1 in its calculations. With this I'm planning to test the RL robustness to some uncertainty in the job length.
I saw you were very solicitous to answer some questions in the GitHub code's issues section (reading your answers helped me a lot through the comprehension of the code) and decided to write this e-mail to ask if you could, if you have the time, reply with any immediate methodological flaws you can see in my approach, I really appreciate any thoughts you can provide.
Sincerely,
Vinícius [...]
Hi Vinicius,
I see what are trying to do. The high level goal for training a robust agent makes a pretty decent sense. I wonder if consistently +1 in the reward will create enough disturbance. You might want to perturb the reward signal with a noise sampled from some distribution (which can have some bias, as in your +1 case). You can vary the distribution and see how it affect the system.
Would be nice if you can post on the GitHub issue page so that others can also learn from it.
Thanks,
Hongzi
Since then I've had some very interesting results creating disturbance in reward using normal distributions, i.e, changing the reward like , but my intention is to also check for uniform and halfnormal distributions, since it's known that users runtime estimates are almost always overestimated, although some very interesting concerns and issues are appearing:
Hi, I have read your paper and want to have a try. but when I run the demo's command, the training process seems to be very slow.
The demo's command:
# python launcher.py --exp_type=pg_su --simu_len=50 --num_ex=1000 --ofile=data/pg_su --out_freq=10
And here is the output log:
Epoch 1 of 10000 took 297.221s
training loss: 0.838964
training accuracy: 78.30 %
test loss: 0.802981
test accuracy: 79.96 %
...
So if each epoch costs 5 minutes, the whole 10000 epoches will cost about one month. Is this normal?
And When I ran the command, I found the process only cost my server machine one physical core, this seemed to be the key of the slow training process problem.
Any suggestion to make the training process faster? Or is there anything wrong with my understanding about the training process?
Thanks in advance!
File "C:\Users\76047\Desktop\学习材料\cloud computing\p3code\deeprm\pg_su.py", line 157, in launch
net_file = open(pa.output_filename + 'net_file' + str(epoch) + '.pkl', 'wb')
FileNotFoundError: [Errno 2] No such file or directory: 'data/pg_su_net_file_0.pkl'
Dear Hongzi,
Following you suggestion, I have opened this issue to share my questions and your answers exchanged via email. Further to the information provided below, I just wanted to clarify whether running the supervised part of the experiment is optional. The way I see it, it is probably better to provide the agent with some kind of heuristic policy to use in order to "kick off" its learning process. Following this assumption, I can see that you have used a specific pkl file generated during the supervised process to feed into the algorithm in the reinforcement learning process. How did you select that? Did you compare the accurracy and error of the training vs. testing sets and used the one with the min difference? Similar, did you choose a specific pkl file generated during the reinforcement learning process, i.e. equal to 1600, since after 1000 iterations the algorithm has already converged? Last, I just wanted to clarify whether using a larger working space is expected to increase the complexity of the algorithm and whether once the backlog size has been reached, any further incoming jobs are simply rejected.
I apologize for all the questions ;) I do hope they help further too. Thank you so much!
Suggested basic RL reading: https://docs.google.com/document/d/1H8lDmHlj5_BHwaQeGSXfyjwf4ball9f1VutNBXCOsJE/edit?usp=sharing
Q1: What is the difference between the 1st and the 2nd type of training, i.e. --exp_type=pg_su vs. pg_re? As far as I understand, the 1st one is used to create a sample of experiments num_ex, each of which consists of a number of jobs that have arrived within a given timeframe, also called episode_max_length, performed using the JSF algorithm. The results are then fed into the DeepRM algorithm to re-adjust the weights of the network/parameters. The 2nd one is used to allow the RL algorithm to be trained using the weights of the DNN learned above and the penalties defined.
Answer: You are basically correct. The first type is supervised learning, where we generate the state-action pair from existing heuristics (e.g., SJF) and ask the agent to mimic the policy. The second type is RL training—the agent will explore and see which policy is better and automatically adjust its policy parameter to get larger rewards.
Q2: How did you decide what type of network to use? You have used a DNN with one dense, hidden layer of 20 neurons. Were there any particular reasons for these choices? Have you tried different variations of them?
Answer: We did some parameter search but not too much. As long as the model is rich enough to express strong scheduling policies (e.g., can learn existing heuristics with supervised learning), we will use the network model for RL.
Q3: Was there any problem with overfitting the data and if so, would you have any further suggestions on this issue?
Answer: If the system dynamics change dramatically, there will be overfitting. In our paper, we evaluate on different job combinations but those jobs were generated from the same distribution. You might need to adapt (from a meta-learned policy) or learn a family of robust policies if you need the policy to work well with distribution shift.
Q4: I would expect the input neurons to be equal to: (res_slot + max_job_slot * num_nw ). * num_res. However, you also take into account the backlog_width and a bias. Could you please justify why have you made that decision and what its purpose is? Also, what the backlog_width represents? I can understand that the backlog is used to store jobs that arrive for service, yet they can not fit in the current working space, but I can not understand whether this is just a number, why it is important to include as an input to the DNN, and why not store the extra jobs in eg a file for later usage.
Answer: The backlogged jobs are just represented as a number. DNN needs to know a rough number of know the current system load. We only provide a number for the neural network to handle. The job information is kept in the environment (it’s just that the agent doesn’t see it).
Dear Mr.hongzi
I was interested in your resource scheduling method. Now, I stuck in your network class. I can't understand why you used the blow function:
loss = T.log(prob_act[T.arange(N), actions]).dot(values) / N
Did you calculate the special loss function? If you didn't, what's the name of this loss function?
I noticed that when "seq_idx" is initialized to 0, the first job in the job sequence seems to be skipped.
Should it be initialized to -1?
Hey there !
Can you please tell me how to create a graph for the accuracy and loss.
Thanks in advance.
Hey there,
I tried to regenerate the graphs depicted in the article, Tetris is showing significantly lower performance than what is expected based on these graphs.
Another issue is that Packer is absent from all generated graphs ( why? ).
DeepRM's (or PG's in this case) average job slowdown is asymptotically around 2 in the article graphs while in mine is around 4.
All trainings and tests were done using the default commands described in the README.md file.
Is anyone here who could reproduce the exact results described in the article? I'd be thankful if you could help me with this issue.
Regards
Hi,
I'm trying to rerun your experiments in order to recreate the graphs. However, I'm getting errors when running the github.
I'm running the command presented there:
python launcher.py --exp_type=pg_re --pg_re=data/pg_su_net_file_20.pkl --simu_len=50 --num_ex=10 --ofile=data/pg_re
But I'm getting an error:
Traceback (most recent call last):
File "launcher.py", line 10, in
import pg_su
File "/home/arik/deeprm/pg_su.py", line 12, in
np.set_printoptions(threshold='nan')
File "/home/arik/.local/lib/python2.7/site-packages/numpy/core/arrayprint.py", line 246, in set_printoptions
floatmode, legacy)
File "/home/arik/.local/lib/python2.7/site-packages/numpy/core/arrayprint.py", line 93, in _make_options_dict
raise ValueError("threshold must be numeric and non-NAN, try "
ValueError: threshold must be numeric and non-NAN, try sys.maxsize for untruncated representation
The versions:
pip list
asn1crypto (0.24.0)
backports.functools-lru-cache (1.4)
cryptography (2.1.4)
cycler (0.10.0)
decorator (4.1.2)
enum34 (1.1.6)
idna (2.6)
ipaddress (1.0.17)
keyring (10.6.0)
keyrings.alt (3.0)
Lasagne (0.2.dev1)
matplotlib (2.1.1)
nose (1.3.7)
numpy (1.16.2)
olefile (0.45.1)
Pillow (5.1.0)
pip (9.0.1)
pycrypto (2.6.1)
pygobject (3.26.1)
pyparsing (2.2.0)
python-dateutil (2.6.1)
pytz (2018.3)
pyxdg (0.25)
scipy (1.2.1)
SecretStorage (2.3.1)
setuptools (39.0.1)
six (1.12.0)
subprocess32 (3.2.7)
Theano (1.0.4)
wheel (0.30.0)
Do you know how to fix it?
Thanks.
Hello sir
I converted your code, but now I need to be sure my conversion is true. Could you insert some picture of your results(Monte Carlo) which you print in your code (iteration, numtrajs, numtimesteps, loss,...)
Hi,
I'm running the Example for the very first time.
I'm running the command presented there:
python launcher.py --exp_type=pg_re --pg_re=data/pg_su_net_file_20.pkl --simu_len=50 --num_ex=10 --ofile=data/pg_re
But I'm getting an error:
Traceback (most recent call last):
File "launcher.py", line 163, in
main()
File "launcher.py", line 149, in main
pg_re.launch(pa, pg_resume, render, repre='image', end='all_done')
File "/home/k8s-master/Desktop/Deeprm/deeprm-master/pg_re.py", line 315, in launch
for r in manager_result:
File "", line 2, in getitem
File "/usr/lib/python2.7/multiprocessing/managers.py", line 755, in _callmethod
self._connect()
File "/usr/lib/python2.7/multiprocessing/managers.py", line 742, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 169, in Client
c = SocketClient(address)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 308, in SocketClient
s.connect(address)
File "/usr/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 111] Connection refused
Do you konw where I went wrong,and how to fix it?
Thanks.
Hello, I am a college student and I have seen your paper also run your code. At present I have some questions, may not be professional, but hope to get your answers.
Thanks for your answer. :)
由于我猜测您是**人,所以我也用中文问一遍,我的英语可能表达的不好。
你好,我是一名大学生,我看了您的论文也运行了您的代码。目前我有一些疑问,可能不太专业,但希望得到您的解答。
感谢您的回答。:)
hello author
what is the time_horizon mean? actually, I don't know the graph means.
another question is that how to define the time? and you using the current_time to replace the actual time, how's the time moving(or increasing)?
thank you for opening source, this code is the first that I can find for resource management using DRL,thank you very mach
I am a student of BeijingJiaotong university, my email is [email protected], we can talk with email,looking forward your reply.
Hi,
I'd like to test some scheduling algorithms on multiple machines. How to pick number of machines ranging from 1 to 5 in the code?
Thanks,
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.