rstrivedi / melting-pot-contest-2023 Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
2023-09-11 10:21:49,337 ERROR tune_controller.py:911 -- Trial task failed for trial PPO_meltingpot_fcb07_00000
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2495, in get
raise value
File "python/ray/_raylet.pyx", line 1787, in ray._raylet.task_execution_handler
File "python/ray/_raylet.pyx", line 1684, in ray._raylet.execute_task_with_cancellation_handler
File "python/ray/_raylet.pyx", line 1366, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1367, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 1583, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 864, in ray._raylet.store_task_errors
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::PPO.init() (pid=13902, ip=172.28.0.12, actor_id=7e027fca141b6dc2cdd8f15501000000, repr=PPO)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/algorithms/algorithm.py", line 517, in init
super().init(
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 169, in init
self.setup(copy.deepcopy(self.config))
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/algorithms/algorithm.py", line 639, in setup
self.workers = WorkerSet(
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/evaluation/worker_set.py", line 179, in init
raise e.args[0].args[2]
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 525, in init
self._update_policy_map(policy_dict=self.policy_dict)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 1727, in _update_policy_map
self._build_policy_map(
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 1838, in _build_policy_map
new_policy = create_policy_for_framework(
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/utils/policy.py", line 142, in create_policy_for_framework
return policy_class(observation_space, action_space, merged_config)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/algorithms/ppo/ppo_torch_policy.py", line 64, in init
self._initialize_loss_from_dummy_batch()
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/policy/policy.py", line 1418, in _initialize_loss_from_dummy_batch
actions, state_outs, extra_outs = self.compute_actions_from_input_dict(
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/policy/torch_policy_v2.py", line 571, in compute_actions_from_input_dict
return self._compute_action_helper(
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/utils/threading.py", line 24, in wrapper
return func(self, *a, **k)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/policy/torch_policy_v2.py", line 1291, in _compute_action_helper
dist_inputs, state_out = self.model(input_dict, state_batches, seq_lens)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/models/modelv2.py", line 259, in call
res = self.forward(restored, state or [], seq_lens)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/models/torch/recurrent_net.py", line 259, in forward
return super().forward(input_dict, state, seq_lens)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/models/torch/recurrent_net.py", line 98, in forward
output, new_state = self.forward_rnn(inputs, state, seq_lens)
File "/usr/local/lib/python3.10/dist-packages/ray/rllib/models/torch/recurrent_net.py", line 274, in forward_rnn
self._features, [h, c] = self.lstm(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/rnn.py", line 810, in forward
self.check_forward_args(input, hx, batch_sizes)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/rnn.py", line 730, in check_forward_args
self.check_input(input, batch_sizes)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/rnn.py", line 218, in check_input
raise RuntimeError(
RuntimeError: input.size(-1) must be equal to input_size. Expected 148, got 24
Can't find file:params.json
Traceback (most recent call last):
File "/home/zhcao/Melting-Pot-Contest-2023/baselines/evaluation/evaluate.py", line 137, in
results, scenario = run_evaluation(args)
File "/home/zhcao/Melting-Pot-Contest-2023/baselines/evaluation/evaluate.py", line 47, in run_evaluation
f = open(config_file)
FileNotFoundError: [Errno 2] No such file or directory: 'None/params.json'
I am trying to run the baseline. when running the rendering rutine I get the following error:
Call:
python baselines/train/render_models.py --config_dir ./
Error:
Traceback (most recent call last):
File "/home/ildefons/aicrowd/Melting-Pot-Contest-2023/baselines/train/render_models.py", line 94, in <module>
render_model(args)
File "/home/ildefons/aicrowd/Melting-Pot-Contest-2023/baselines/train/render_models.py", line 17, in render_model
f = open(config_file)
I basically cannot find the params.json file. Where is it?
File "/home/tess/anaconda3/envs/marlEnv/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm_config.py", line 3623, in _check_if_correct_nn_framework_installed
raise ImportError(
ImportError: PyTorch was specified as the framework to use (via config.framework('torch')
)! However, no installation was found. You can install PyTorch via pip install torch
.
Ubuntu 20.04
Python 3.10.13 (from anaconda3)
Pytorch 2.0.1
Ray 2.6.1
I have followed the installation guidelines for setting up TensorFlow in a WSL2 environment, but I'm encountering an issue where TensorFlow is unable to detect my GPU. I would like assistance in resolving this issue.
The following command python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
return
2023-09-04 22:53:51.646492: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-04 22:53:51.723849: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/wsl/lib::/lib:/home/tyren/miniconda3/lib/
2023-09-04 22:53:51.723880: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-09-04 22:53:52.212623: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/wsl/lib::/lib:/home/tyren/miniconda3/lib/
2023-09-04 22:53:52.212707: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/wsl/lib::/lib:/home/tyren/miniconda3/lib/
2023-09-04 22:53:52.212726: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-09-04 22:53:53.303714: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/wsl/lib::/lib:/home/tyren/miniconda3/lib/
2023-09-04 22:53:53.303816: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/wsl/lib::/lib:/home/tyren/miniconda3/lib/
2023-09-04 22:53:53.303870: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/wsl/lib::/lib:/home/tyren/miniconda3/lib/
2023-09-04 22:53:53.303919: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/wsl/lib::/lib:/home/tyren/miniconda3/lib/
2023-09-04 22:53:53.400425: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/wsl/lib::/lib:/home/tyren/miniconda3/lib/
2023-09-04 22:53:53.400533: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/wsl/lib::/lib:/home/tyren/miniconda3/lib/
2023-09-04 22:53:53.400558: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]
While the torch code can reports that GPU device is available:
import torch
if torch.cuda.is_available():
num_gpus = torch.cuda.device_count()
print(f"PyTorch can use {num_gpus} GPU(s).")
for i in range(num_gpus):
gpu_properties = torch.cuda.get_device_properties(i)
print(f"GPU {i}: {gpu_properties.name}, Memory: {gpu_properties.total_memory / (1024**3):.2f}GB")
else:
print("PyTorch cannot use GPU. Running on CPU.")
# PyTorch can use 1 GPU(s).
# GPU 0: NVIDIA RTX A1000 Laptop GPU, Memory: 4.00GB
I have already checked the following:
Any assistance in resolving this issue would be greatly appreciated.
Thank you for your help!
Hello,
I have run the setup.py
file and ray_patch.sh
but still got the following error when running python baselines/train/run_ray_train.py --framework torch
.
Traceback (most recent call last):
File "/ccn2/u/ziyxiang/Melting-Pot-Contest-2023/baselines/train/run_ray_train.py", line 173, in <module>
).fit()
File "/data/ziyxiang/anaconda3/envs/mpc_main/lib/python3.10/site-packages/ray/tune/tuner.py", line 347, in fit
return self._local_tuner.fit()
File "/data/ziyxiang/anaconda3/envs/mpc_main/lib/python3.10/site-packages/ray/tune/impl/tuner_internal.py", line 588, in fit
analysis = self._fit_internal(trainable, param_space)
File "/data/ziyxiang/anaconda3/envs/mpc_main/lib/python3.10/site-packages/ray/tune/impl/tuner_internal.py", line 703, in _fit_internal
analysis = run(
File "/data/ziyxiang/anaconda3/envs/mpc_main/lib/python3.10/site-packages/ray/tune/tune.py", line 1107, in run
runner.step()
File "/data/ziyxiang/anaconda3/envs/mpc_main/lib/python3.10/site-packages/ray/tune/execution/tune_controller.py", line 280, in step
self._maybe_update_trial_queue()
File "/data/ziyxiang/anaconda3/envs/mpc_main/lib/python3.10/site-packages/ray/tune/execution/tune_controller.py", line 411, in _maybe_update_trial_queue
if not self._update_trial_queue(blocking=not dont_wait_for_trial):
File "/data/ziyxiang/anaconda3/envs/mpc_main/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py", line 1112, in _update_trial_queue
self.add_trial(trial)
File "/data/ziyxiang/anaconda3/envs/mpc_main/lib/python3.10/site-packages/ray/tune/execution/tune_controller.py", line 383, in add_trial
super().add_trial(trial)
File "/data/ziyxiang/anaconda3/envs/mpc_main/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py", line 597, in add_trial
trial.create_placement_group_factory()
File "/data/ziyxiang/anaconda3/envs/mpc_main/lib/python3.10/site-packages/ray/tune/experiment/trial.py", line 553, in create_placement_group_factory
default_resources = trainable_cls.default_resource_request(self.config)
File "/data/ziyxiang/anaconda3/envs/mpc_main/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2193, in default_resource_request
cf.validate()
File "/data/ziyxiang/anaconda3/envs/mpc_main/lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo.py", line 315, in validate
super().validate()
File "/data/ziyxiang/anaconda3/envs/mpc_main/lib/python3.10/site-packages/ray/rllib/algorithms/pg/pg.py", line 100, in validate
super().validate()
File "/data/ziyxiang/anaconda3/envs/mpc_main/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm_config.py", line 773, in validate
self._check_if_correct_nn_framework_installed(_tf1, _tf, _torch)
File "/data/ziyxiang/anaconda3/envs/mpc_main/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm_config.py", line 3623, in _check_if_correct_nn_framework_installed
raise ImportError(
ImportError: PyTorch was specified as the framework to use (via `config.framework('torch')`)! However, no installation was found. You can install PyTorch via `pip install torch`.
Double checked that torch is indeed installed, and here is output from nvidia-smi
-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44 Driver Version: 495.44 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
Hi, @rstrivedi
Here is my arguments:
Running trails with the following arguments: Namespace(num_workers=2, num_gpus=0, local=False, no_tune=False, algo='ppo', framework='torch', exp='clean_up', seed=123, results_dir='./results', logging='INFO', wandb=False, downsample=True, as_test=False)
After starting the training, the code automatically ended around 400 steps (nearly 2 minutes), and it seems that no errors were thrown. Do you have any suggestions for modification?
(PPO pid=299168) 2023-10-19 14:23:40,547 INFO rollout_worker.py:786 -- Training on concatenated sample batches:
(PPO pid=299168)
(PPO pid=299168) { 'count': 32,
(PPO pid=299168) 'policy_batches': { 'agent_3': { 'action_dist_inputs': np.ndarray((32, 9), dtype=float32, min=-0.176, max=0.307, mean=0.013),...
(PPO pid=299168)
(PPO pid=299168) 2023-10-19 14:23:40,553 INFO rnn_sequencing.py:178 -- Padded input for RNN/Attn.Nets/MA:
....
(RolloutWorker pid=302375) /home/ldp/anaconda3/envs/mpc_main/lib/python3.10/site-packages/gymnasium/spaces/box.py:227: UserWarning: WARN: Casting input x to numpy array.
(RolloutWorker pid=302375) logger.warn("Casting input x to numpy array.")
...
(RolloutWorker pid=302375) 2023-10-19 14:23:33,831 INFO policy.py:1294 -- Policy (worker=2) running on CPU. [repeated 7x across cluster]
(PPO pid=299168) 2023-10-19 14:23:34,325 INFO torch_policy_v2.py:113 -- Found 0 visible cuda devices. [repeated 14x across cluster]
...
(PPO pid=299168) 2023-10-19 14:23:34,339 INFO util.py:118 -- Using connectors: [repeated 14x across cluster]
(PPO pid=299168) 2023-10-19 14:23:34,339 INFO util.py:119 -- AgentConnectorPipeline [repeated 14x across cluster]
(PPO pid=299168) StateBufferConnector [repeated 14x across cluster]
(PPO pid=299168) ViewRequirementAgentConnector [repeated 14x across cluster]
(PPO pid=299168) 2023-10-19 14:23:34,339 INFO util.py:120 -- ActionConnectorPipeline [repeated 14x across cluster]
(PPO pid=299168) ConvertToNumpyConnector [repeated 14x across cluster]
(PPO pid=299168) NormalizeActionsConnector [repeated 14x across cluster]
(PPO pid=299168) ImmutableActionsConnector [repeated 14x across cluster]
(RolloutWorker pid=302374) 2023-10-19 14:23:40,526 INFO rollout_worker.py:732 -- Completed sample batch:
(RolloutWorker pid=302374) 'agent_1': { 'action_dist_inputs': np.ndarray((200, 9), dtype=float32, min=-0.0, max=0.0, mean=-0.0),
(RolloutWorker pid=302374) 'agent_2': { 'action_dist_inputs': np.ndarray((200, 9), dtype=float32, min=-0.43, max=0.916, mean=0.128),
(RolloutWorker pid=302374) 'agent_3': { 'action_dist_inputs': np.ndarray((200, 9), dtype=float32, min=-0.314, max=0.355, mean=0.004),
(RolloutWorker pid=302374) 'agent_4': { 'action_dist_inputs': np.ndarray((200, 9), dtype=float32, min=-0.433, max=0.446, mean=-0.019),
(RolloutWorker pid=302374) 'agent_5': { 'action_dist_inputs': np.ndarray((200, 9), dtype=float32, min=-0.573, max=1.049, mean=0.05),
(RolloutWorker pid=302374) 'agent_6': { 'action_dist_inputs': np.ndarray((200, 9), dtype=float32, min=-0.464, max=0.5, mean=-0.004).Result(
metrics={'custom_metrics': {}, 'episode_media': {}, 'info': {'learner': {'agent_3': {'learner_stats': {'allreduce_latency': 0.0, 'grad_gnorm': 0.2938595721563488, 'cur_kl_coeff': 0.19999999999999998, 'cur_lr': 5.000000000000001e-05, 'total_loss': 0.01326687481046783, 'policy_loss': 0.011805192082956956, 'vf_loss': 0.0014429467907768848, 'vf_explained_var': -1.0, 'kl': 9.368463734633353e-05, 'entropy': 2.1968671936737865, 'entropy_coeff': 0.0}, 'model': {}, 'custom_metrics': {}, 'num_agent_steps_trained': 32.0, 'num_grad_updates_lifetime': 285.5, 'diff_num_grad_updates_vs_sampler_policy': 284.5}, 'agent_6': {'learner_stats': {'allreduce_latency': 0.0, 'grad_gnorm': 0.44844337551151975, 'cur_kl_coeff': 0.19999999999999998, 'cur_lr': 5.000000000000001e-05, 'total_loss': 0.063614134408795, 'policy_loss': 0.062265382422820516, 'vf_loss': 0.0006033147813166013, 'vf_explained_var': -1.0, 'kl': 0.0037272102948426424, 'entropy': 2.197162473829169, 'entropy_coeff': 0.0}, 'model': {}, 'custom_metrics': {}, 'num_agent_steps_trained': 32.0, 'num_grad_updates_lifetime': 285.5, 'diff_num_grad_updates_vs_sampler_policy': 284.5}, 'agent_2': {'learner_stats': {'allreduce_latency': 0.0, 'grad_gnorm': 0.33651486742391923, 'cur_kl_coeff': 0.19999999999999998, 'cur_lr': 5.000000000000001e-05, 'total_loss': 0.020297989991836643, 'policy_loss': 0.01227701953693963, 'vf_loss': 0.004015607813974049, 'vf_explained_var': -0.9549619204119633, 'kl': 0.020026806541649823, 'entropy': 2.196402873072708, 'entropy_coeff': 0.0}, 'model': {}, 'custom_metrics': {}, 'num_agent_steps_trained': 32.0, 'num_grad_updates_lifetime': 285.5, 'diff_num_grad_updates_vs_sampler_policy': 284.5}, 'agent_0': {'learner_stats': {'allreduce_latency': 0.0, 'grad_gnorm': 0.08610846985768723, 'cur_kl_coeff': 0.19999999999999998, 'cur_lr': 5.000000000000001e-05, 'total_loss': 0.00414218608486025, 'policy_loss': 0.002945413388181151, 'vf_loss': 0.0011898500088354863, 'vf_explained_var': -1.0, 'kl': 3.462271995048046e-05, 'entropy': 2.1969731778429265, 'entropy_coeff': 0.0}, 'model': {}, 'custom_metrics': {}, 'num_agent_steps_trained': 32.0, 'num_grad_updates_lifetime': 285.5, 'diff_num_grad_updates_vs_sampler_policy': 284.5}, 'agent_4': {'learner_stats': {'allreduce_latency': 0.0, 'grad_gnorm': 0.7889667204074692, 'cur_kl_coeff': 0.19999999999999998, 'cur_lr': 5.000000000000001e-05, 'total_loss': 0.04514940154264893, 'policy_loss': -0.005434698990562506, 'vf_loss': 0.04920632253490846, 'vf_explained_var': -1.0, 'kl': 0.006888877354785194, 'entropy': 2.195473243897421, 'entropy_coeff': 0.0}, 'model': {}, 'custom_metrics': {}, 'num_agent_steps_trained': 32.0, 'num_grad_updates_lifetime': 285.5, 'diff_num_grad_updates_vs_sampler_policy': 284.5}, 'agent_1': {'learner_stats': {'allreduce_latency': 0.0, 'grad_gnorm': 0.20199027515032836, 'cur_kl_coeff': 0.19999999999999998, 'cur_lr': 5.000000000000001e-05, 'total_loss': 0.021706062150106096, 'policy_loss': 0.01644499337202624, 'vf_loss': 0.005171904300403789, 'vf_explained_var': -1.0, 'kl': 0.00044581278011054644, 'entropy': 2.195561667074237, 'entropy_coeff': 0.0}, 'model': {}, 'custom_metrics': {}, 'num_agent_steps_trained': 32.0, 'num_grad_updates_lifetime': 285.5, 'diff_num_grad_updates_vs_sampler_policy': 284.5}, 'agent_5': {'learner_stats': {'allreduce_latency': 0.0, 'grad_gnorm': 2.785927937036021, 'cur_kl_coeff': 0.19999999999999998, 'cur_lr': 5.000000000000001e-05, 'total_loss': 0.356755890442761, 'policy_loss': 0.024646596205339096, 'vf_loss': 0.3239888458541317, 'vf_explained_var': -0.735680664945067, 'kl': 0.040602242583161224, 'entropy': 2.195885472130357, 'entropy_coeff': 0.0}, 'model': {}, 'custom_metrics': {}, 'num_agent_steps_trained': 32.0, 'num_grad_updates_lifetime': 285.5, 'diff_num_grad_updates_vs_sampler_policy': 284.5}}, 'num_env_steps_sampled': 400, 'num_env_steps_trained': 400, 'num_agent_steps_sampled': 2800, 'num_agent_steps_trained': 2800}, 'sampler_results': {'episode_reward_max': nan, 'episode_reward_min': nan, 'episode_reward_mean': nan, 'episode_len_mean': nan, 'episode_media': {}, 'episodes_this_iter': 0, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [], 'episode_lengths': []}, 'sampler_perf': {}, 'num_faulty_episodes': 0, 'connector_metrics': {}}, 'episode_reward_max': nan, 'episode_reward_min': nan, 'episode_reward_mean': nan, 'episode_len_mean': nan, 'episodes_this_iter': 0, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'hist_stats': {'episode_reward': [], 'episode_lengths': []}, 'sampler_perf': {}, 'num_faulty_episodes': 0, 'connector_metrics': {}, 'num_healthy_workers': 2, 'num_in_flight_async_reqs': 0, 'num_remote_worker_restarts': 0, 'num_agent_steps_sampled': 2800, 'num_agent_steps_trained': 2800, 'num_env_steps_sampled': 400, 'num_env_steps_trained': 400, 'num_env_steps_sampled_this_iter': 400, 'num_env_steps_trained_this_iter': 400, 'num_env_steps_sampled_throughput_per_sec': 7.399711776182035, 'num_env_steps_trained_throughput_per_sec': 7.399711776182035, 'num_steps_trained_this_iter': 400, 'agent_timesteps_total': 2800, 'timers': {'training_iteration_time_ms': 54056.106, 'sample_time_ms': 6118.289, 'learn_time_ms': 47902.557, 'learn_throughput': 8.35, 'synch_weights_time_ms': 32.938}, 'counters': {'num_env_steps_sampled': 400, 'num_env_steps_trained': 400, 'num_agent_steps_sampled': 2800, 'num_agent_steps_trained': 2800}, 'done': True, 'trial_id': '00b18_00000', 'perf': {'cpu_util_percent': 1.6532467532467536, 'ram_util_percent': 8.0}, 'experiment_tag': '0'},
path='/home/ldp/competitions/meltingpot/Melting-Pot-Contest-2023/results/torch/clean_up/PPO_meltingpot_00b18_00000_0_2023-10-19_14-23-22',
checkpoint=Checkpoint(local_path=/home/ldp/competitions/meltingpot/Melting-Pot-Contest-2023/results/torch/clean_up/PPO_meltingpot_00b18_00000_0_2023-10-19_14-23-22/checkpoint_000001)
)
(RolloutWorker pid=302374) [repeated 30x across cluster]
(RolloutWorker pid=302374) { 'count': 200,
(RolloutWorker pid=302374) 'policy_batches': { 'agent_0': { 'action_dist_inputs': np.ndarray((200, 9), dtype=float32, min=-0.0, max=0.001, mean=0.0),
(RolloutWorker pid=302374) 'action_logp': np.ndarray((200,), dtype=float32, min=-2.615, max=-1.737, mean=-2.185), [repeated 7x across cluster]
(RolloutWorker pid=302374) 'actions': np.ndarray((200,), dtype=int64, min=0.0, max=8.0, mean=3.71), [repeated 7x across cluster]
...
Hello!
In my case generating video
python baselines/evaluation/evaluate.py --num_episodes 1 --eval_on_scenario 1 --scenario allelopathic_harvest__open_0 REST_OF_ARGUMENTS
takes ~10 minutes.
Before it was vp90 codec which compressed really well, but took 3+ second per to encode per image. One episode took ~40 minutes to generate.
I have swaped vp90 with mp4v codec and now it takes only 0.3 second per frame, but the env takes ~1 second per step. It takes 20 minutes to finish the game and generate final video.
Is there any already available way to record environment faster? Mayber better codec, or make stepping faster? I see that gpu is barely utilized. Maybe record each agent as a step/game state save it into log and then play it in game/environment runner (like in replays in videogames such as Lux-AI competition on kaggle, quake, dota 2, counter-strike, etc.).
Hi, I was wondering if there is any interest in help with additional tutorials/baselines? I posted on the Melting Pot repo itself (google-deepmind/meltingpot#113) but since users will likely refer here, I figure maybe it’d be useful to contribute towards this.
I am the project manager/maintainer of PettingZoo, and created a conversion wrapper adapting MeltingPot environments to work with PettingZoo using the Shimmy library (https://shimmy.farama.org/environments/meltingpot/). I have also been updating PettingZoo’s internal tutorials to give users a better starting point, with examples using Stabe-Baselines3, CleanRL RLlib, Tianshou, LangChain (proof of concept), and AgileRL. I would be happy to create some simple scripts demonstrating how to use the conversation wrappers and then basic examples with different libraries, if that is something you are interested in.
Would also be happy to help out with the baselines aspect of things, for example benchmarking the implementation of MADDPG from RLlib vs AgileRL, though I imagine the tutorials would be a more practical/simpler way to contribute.
Hello,
The tensforflow
package cannot be installed in M1 Macs, so I propose modifying this line in the setup.py
file from:
'tensorflow==2.11.1'
to:
'tensorflow==2.11.1' if sys.platform != 'darwin' or platform.processor() != 'arm' else 'tensorflow-macos==2.11.0',
I was trying to figure out how to resume training based on a restored checkpoint with run_ray_train.py. Specifically:
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.algorithms.algorithm import Algorithm
from ray.tune import registry
from baselines.train import make_envs
ray.init(local_mode=True, ignore_reinit_error=True)
registry.register_env("meltingpot", make_envs.env_creator)
## train mode, two failed attempts
my_ppo_config = PPOConfig().environment("meltingpot")
my_ppo = my_ppo_config.build()
# method1: fail at .build stage
PPOConfig().environment("meltingpot").build().restore(checkpoint_dir)
# method2: failed at .train stage
Algorithm.from_checkpoint(checkpoint_dir).train()
I came across KeyError, details shown as below:
ray::RolloutWorker.__init__() (pid=180001, ip=10.0.0.182, actor_id=17cb813ab79e0c981feebd6e01000000, repr=<ray.rllib.evaluation.rollout_worker._modify_class.<locals>.Class object at 0x7f6b2de58850>)
File "anaconda3/envs/mpc_main/lib/python3.10/site-packages/ray/rllib/evaluation/rollout_worker.py", line 397, in __init__
self.env = env_creator(copy.deepcopy(self.env_context))
File "/home/researchyw20/meltingpot/code/Melting-Pot-Contest-2023/baselines/train/make_envs.py", line 10, in env_creator
env = substrate.build(env_config['substrate'], roles=env_config['roles'])
File "anaconda3/envs/mpc_main/lib/python3.10/site-packages/ml_collections/config_dict/config_dict.py", line 909, in __getitem__
raise KeyError(self._generate_did_you_mean_message(key, str(e)))
KeyError: "'substrate'"
Any help on this is appreciated.
Problem:
I am encountering an issue while running the MeltingPot baseline Ray training model. The episode rewards I am getting are consistently NaN (Not-a-Number).
Steps to Reproduce:
python baselines/train/run_ray_train.py --num_gpus 1 --wandb True
The training args are set as"
# training
"seed": args.seed,
"rollout_fragment_length": 5, # Divide episodes into fragments of this many steps each during rollouts.
"train_batch_size": 40, # Batch size (batch * rollout_fragment_length) Trajectories of this size are collected from rollout workers and combined into a larger batch of train_batch_size for learning.
"sgd_minibatch_size": 32, # PPO further divides the train batch into minibatches for multi-epoch SGD
"disable_observation_precprocessing": True,
"use_new_rl_modules": False,
"use_new_learner_api": False,
"framework": args.framework, # torch or tensorflow
# agent model
"fcnet_hidden": (4, 4), # fully connected network
"post_fcnet_hidden": (16,), # Layer sizes after the fully connected torso.
"cnn_activation": "relu",
"fcnet_activation": "relu",
"post_fcnet_activation": "relu",
# == LSTM ==
"use_lstm": True,
"lstm_use_prev_action": True,
"lstm_use_prev_reward": False,
"lstm_cell_size": 2, # A cell, is an LSTM unit
"shared_policy": False,
Please let me know if there's any additional information or logs needed to diagnose this issue. Thank you for your assistance in resolving this problem.
Running an episode with 7 players: ['1', '2', '3', '4', '5', '6', '7'].
Traceback (most recent call last):
File "/home/tess/Desktop/MARL/contest/meltingpot/human_players/play_clean_up.py", line 91, in
main()
File "/home/tess/Desktop/MARL/contest/meltingpot/human_players/play_clean_up.py", line 83, in main
level_playing_utils.run_episode(
File "/home/tess/Desktop/MARL/contest/meltingpot/human_players/level_playing_utils.py", line 344, in run_episode
verbose_fn(timestep, i, player_index)
File "/home/tess/Desktop/MARL/contest/meltingpot/human_players/play_clean_up.py", line 44, in verbose_fn
cleaned = env_timestep.observation[f'{lua_index}.PLAYER_CLEANED']
KeyError: '1.PLAYER_CLEANED'
it's only appear when verbose=True
We are starting to test attention mechanisms and we get no evaluation result both with the evaluation.py file as well as with the local_evaluation.py file. We use a working without attention configs.py file and set the attention parameters. We deactivate use_lstm and set use_attention. The evaluation.py results can be seen here: https://drive.google.com/drive/folders/1s2OIwmb_bFjUoJ9O_OnI2BG-GL_fk_gU?usp=sharing. Also, there is a sample here of the result when submitting as well as when doing local evaluation.
error_submission_oct_18.txt
output (1).txt
Hello, I am encountering a shape error when running the training script CUDA_VISIBLE_DEVICES=0 python baselines/train/run_ray_train.py --framework torch --exp al_harvest
. Any help is greatly appreciated.
2023-09-04 21:31:18,216 ERROR tune.py:1144 -- Trials did not complete: [PPO_meltingpot_55a86_00000] (PPO pid=6862) File "/home/paperspace/miniconda3/envs/mpc_main/lib/python3.10/site-packages/ray/rllib/models/torch/recurrent_net.py", line 274, in forward_rnn (PPO pid=6862) self._features, [h, c] = self.lstm( (PPO pid=6862) File "/home/paperspace/miniconda3/envs/mpc_main/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl (PPO pid=6862) return forward_call(*args, **kwargs) (PPO pid=6862) File "/home/paperspace/miniconda3/envs/mpc_main/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 810, in forward (PPO pid=6862) self.check_forward_args(input, hx, batch_sizes) (PPO pid=6862) File "/home/paperspace/miniconda3/envs/mpc_main/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 730, in check_forward_args (PPO pid=6862) self.check_input(input, batch_sizes) (PPO pid=6862) File "/home/paperspace/miniconda3/envs/mpc_main/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 218, in check_input (PPO pid=6862) raise RuntimeError( (PPO pid=6862) RuntimeError: input.size(-1) must be equal to input_size. Expected 147, got 27 (PPO pid=6862) (PPO pid=6862) During handling of the above exception, another exception occurred: (PPO pid=6862) (PPO pid=6862) ray::PPO.__init__() (pid=6862, ip=10.64.34.33, actor_id=afbc4db286cfab682041540a01000000, repr=PPO) (PPO pid=6862) File "/home/paperspace/miniconda3/envs/mpc_main/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 517, in __init__ (PPO pid=6862) super().__init__( (PPO pid=6862) File "/home/paperspace/miniconda3/envs/mpc_main/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 169, in __init__ (PPO pid=6862) self.setup(copy.deepcopy(self.config)) (PPO pid=6862) File "/home/paperspace/miniconda3/envs/mpc_main/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 639, in setup (PPO pid=6862) self.workers = WorkerSet( (PPO pid=6862) File "/home/paperspace/miniconda3/envs/mpc_main/lib/python3.10/site-packages/ray/rllib/evaluation/worker_set.py", line 179, in __init__ (PPO pid=6862) raise e.args[0].args[2] (PPO pid=6862) RuntimeError: input.size(-1) must be equal to input_size. Expected 147, got 27
I have manually installed melting pot and I have also changed rllib torch model file for the fix of LSTM wrapper . I am getting the assertion error while running training with these arguments .
Running trails with the following arguments: Namespace(num_workers=8, num_gpus=0, local=False, no_tune=False, algo='ppo', framework='torch', exp='clean_up', seed=123, results_dir='./results', logging='INFO', wandb=False, downsample=True, as_test=False)
Failure # 1 (occurred at 2023-09-02_22-43-34) �[36mray::PPO.train()�[39m (pid=86686, ip=######, actor_id=b2a47515069c17b28793673b01000000, repr=PPO) File "/home/saidinesh/Desktop/Projects/Melting-Pot-Contest-2023/rllib-env/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 375, in train raise skipped from exception_cause(skipped) File "/home/saidinesh/Desktop/Projects/Melting-Pot-Contest-2023/rllib-env/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 372, in train result = self.step() File "/home/saidinesh/Desktop/Projects/Melting-Pot-Contest-2023/rllib-env/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 851, in step results, train_iter_ctx = self._run_one_training_iteration() File "/home/saidinesh/Desktop/Projects/Melting-Pot-Contest-2023/rllib-env/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2835, in _run_one_training_iteration results = self.training_step() File "/home/saidinesh/Desktop/Projects/Melting-Pot-Contest-2023/rllib-env/lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo.py", line 455, in training_step train_results = train_one_step(self, train_batch) File "/home/saidinesh/Desktop/Projects/Melting-Pot-Contest-2023/rllib-env/lib/python3.10/site-packages/ray/rllib/execution/train_ops.py", line 56, in train_one_step info = do_minibatch_sgd( File "/home/saidinesh/Desktop/Projects/Melting-Pot-Contest-2023/rllib-env/lib/python3.10/site-packages/ray/rllib/utils/sgd.py", line 129, in do_minibatch_sgd local_worker.learn_on_batch( File "/home/saidinesh/Desktop/Projects/Melting-Pot-Contest-2023/rllib-env/lib/python3.10/site-packages/ray/rllib/evaluation/rollout_worker.py", line 810, in learn_on_batch info_out[pid] = policy.learn_on_batch(batch) File "/home/saidinesh/Desktop/Projects/Melting-Pot-Contest-2023/rllib-env/lib/python3.10/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper return func(self, *a, **k) File "/home/saidinesh/Desktop/Projects/Melting-Pot-Contest-2023/rllib-env/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 729, in learn_on_batch grads, fetches = self.compute_gradients(postprocessed_batch) File "/home/saidinesh/Desktop/Projects/Melting-Pot-Contest-2023/rllib-env/lib/python3.10/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper return func(self, *a, **k) File "/home/saidinesh/Desktop/Projects/Melting-Pot-Contest-2023/rllib-env/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 929, in compute_gradients pad_batch_to_sequences_of_same_size( File "/home/saidinesh/Desktop/Projects/Melting-Pot-Contest-2023/rllib-env/lib/python3.10/site-packages/ray/rllib/policy/rnn_sequencing.py", line 155, in pad_batch_to_sequences_of_same_size feature_sequences, initial_states, seq_lens = chop_into_sequences( File "/home/saidinesh/Desktop/Projects/Melting-Pot-Contest-2023/rllib-env/lib/python3.10/site-packages/ray/rllib/policy/rnn_sequencing.py", line 387, in chop_into_sequences assert i == len(f), f AssertionError: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
I tried to setup MeltingPot environment on my lab's server but failed when I run
SYSTEM_VERSION_COMPAT=0 pip install dmlab2d
raising error:
ERROR: Could not find a version that satisfies the requirement dmlab2d (from versions: none) ERROR: No matching distribution found for dmlab2d
The system on server is CentOS Linux release 7.9.2009, and I also tried with my Mac (M1 chips) and the installation succeeded, so I suppose the installation failure should be caused by the system version. I think it would be better if I can run it on server for better GPU resources, is there some method to install dmlab2d on CentOS?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.