facebookresearch / mvfst-rl Goto Github PK

An asynchronous RL platform for congestion control in QUIC transport protocol. https://arxiv.org/abs/1910.04054.

License: Other

CMake 2.00% Shell 6.73% C++ 37.79% Python 52.36% Dockerfile 1.12%

mvfst-rl's Introduction

mvfst-rl

mvfst-rl is a framework for network congestion control in the QUIC transport protocol that leverages state-of-the-art in asynchronous Reinforcement Learning training with off-policy correction. It is built upon the following components:

mvfst, an implementation of the IETF QUIC transport protocol.
torchbeast, a PyTorch implementation of asynchronous distributed deep RL.
Pantheon, a set of calibrated network emulators.

MTEnv API

If your objective is to experiment with new RL algorithms on congestion control tasks, you are encouraged to switch to the mtenv branch.

That branch implements in particular an MTEnv-compatible API that makes it easy to define a multi-task environment and interact with it, independently of the more complex IMPALA-based learning framework this project is based on.

Asynchronous RL Agent

Training Architecture

For more details, please refer to our paper.

Building mvfst-rl

Ubuntu 20+

Pantheon requires Python 2 while mvfst-rl training requires Python 3.8+. The recommended setup is to explicitly use python2/python3 commands.

For building with training support, it is recommended to have a conda environment first:

conda create -n mvfst-rl python=3.8 -y && conda activate mvfst-rl
./setup.sh

If you have a previous installation and need to re-install from scratch after updating the code, run the following commands:

conda activate base && conda env remove -n mvfst-rl
conda create -n mvfst-rl python=3.8 -y && conda activate mvfst-rl
./setup.sh --clean

For building mvfst-rl in test-only or deployment mode, run the following script. This allows you to run a trained model exported via TorchScript purely in C++.

./setup.sh --inference

Training

Training can be run locally as follows:

python3 -m train.train \
mode=train \
total_steps=1_000_000 \
num_actors=40 \
hydra.run.dir=/tmp/logs

The above starts 40 Pantheon instances in parallel that communicate with the torchbeast actors via RPC. To see the full list of training parameters, run python3 -m train.train --help.

Hyper-parameter sweeps with Hydra

mvfst-rl uses Hydra, which in particular makes it easy to run hyper-parameter sweeps. Here is an example showing how to run three experiments with different learning rates on a Slurm cluster:

python3 -m train.train \
mode=train \
test_after_train=false \
total_steps=1_000_000 \
num_actors=40 \
learning_rate=1e-5,1e-4,1e-3 \
hydra.sweep.dir='${oc.env:HOME}/tmp/logs_${now:%Y-%m-%d_%H-%M-%S}' \
hydra/launcher=_submitit_slurm -m

Note the following settings in the above example:

test_after_train=false skips running the test mode after training. This can be useful for instance when the machines on the cluster have not been setup with all the libraries required in test mode.
learning_rate=1e-5,1e-4,1e-3: this is the basic syntax to perform a parameter sweep.
hydra.sweep.dir='${oc.env:HOME}/tmp/logs_${now:%Y-%m-%d_%H-%M-%S}': the base location for all logs (look into the .submitit subfolder inside that directory to access the jobs' stdout/stderr).
hydra/launcher=_submitit_slurm: the launcher used to run on Slurm. Hydra supports more launchers, see its documentation for details (by default, the joblib launcher is also installed by setup.sh -- it allows running multiple jobs locally instead of on a cluster). Note that the launcher name must be prefixed with an underscore to match the config files under config/hydra/launcher (which you may edit to tweak launcher settings).
-m: to run Hydra in multi-run mode.

Monitoring training behavior

The script scripts/plotting/plot_sweep.py can be used to plot training curves. Refer to comments in the script's header for instructions on how to execute it.

It is also possible to use TensorBoard: the data can be found in the train/tensorboard subfolder of an experiment's logs directory.

Evaluation

To test a trained model on all emulated Pantheon environments, run with mode=test as follows:

python3 -m train.train \
  mode=test \
  base_logdir=/tmp/logs

The above takes the checkpoint.tar file in /tmp/logs, traces the model via TorchScript, and runs inference in C++ (without RPC).

Pantheon logs cleanup

Pantheon generates temporary logs (in _build/deps/pantheon/tmp) that may take up a lot of space. It is advised to regularly run scripts/clean_pantheon_logs.sh to delete them (when no experiment is running). Note that when running jobs on a SLURM cluster, where a temporary local folder is made available to each job in /scratch/slurm_tmpdir/$SLURM_JOB_ID, this folder is used instead to store the logs (thus alleviating the need for manual cleanup).

Contributing

We would love to have you contribute to mvfst-rl or use it for your research. See the CONTRIBUTING file for how to help out.

License

mvfst-rl is licensed under the CC-BY-NC 4.0 license, as found in the LICENSE file.

BibTeX

@article{mvfstrl2019,
  title={MVFST-RL: An Asynchronous RL Framework for Congestion Control with Delayed Actions},
  author={Viswanath Sivakumar and Olivier Delalleau and Tim Rockt\"{a}schel and Alexander H. Miller and Heinrich K\"{u}ttler and Nantas Nardelli and Mike Rabbat and Joelle Pineau and Sebastian Riedel},
  year={2019},
  eprint={1910.04054},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/1910.04054},
  journal={NeurIPS Workshop on Machine Learning for Systems},
}

mvfst-rl's People

Contributors

Stargazers

Watchers

mvfst-rl's Issues

'MvFstEnv' object has no attribute '_previous_obs_from_pantheon_process'

when i run the example command for running with a single gym-env,it raises this

my config.yaml:
defaults:
- base_config
- jobs@train_jobs: random_traces_0_5
- jobs@eval_jobs: fixed_0_5
- jobs@env_configs: fixed_0_5
# - override hydra/launcher: _submitit_slurm
- override hydra/launcher: _joblib

hydra:
run:
dir: checkpoint/${oc.env:USER}/mvfst-rl/run/${now:%Y-%m-%d_%H-%M-%S}
sweep:
dir: checkpoint/${oc.env:USER}/mvfst-rl/multirun/${now:%Y-%m-%d_%H-%M-%S}

questions about training

Dear author:
I'm sorry to bother you again. When I increased the kDefaultMaxCwndInMss of QuicConstants.h file in mvfst/quic to 10000. I only trained the bandwidth of 12mbps at different delays. My train command as follows:
python3 -m train.train --mode=train --base_logdir=/tmp/logs --total_steps=100000 --learning_rate=0.00001 --num_actors=2 --cc_env_history_size=20
I observed such a problem during training. First of all ,I found that the value of cwnd can be increased to a large value, the delay and throughput will also be very large, and the reward value will also be very large. The training process seems to have not converged.
Then, our cwnd value is small, but the delay is large.
Finally, after training, the tests found that the selected cwnd values were almost 10000.
This phenomenon is a bit strange. The some of training process and test results are attached.
test_12.log
train_12.log

I want to know if this phenomenon is caused by the incorrect setting of our kDefaultMaxCwndInMss or the problem of the algorithm itself. Looking forward to your answer. Thank you very much.

Encounter an error when running script setup.sh

Hello! Thank you for providing this highly integrated platform.
I tried to build the project in the conda environment with python in Ubuntu 20.04LTS.
However, when I run setup.sh, the following error appears：
”
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
“

Why? And how to solve the problem?
Thans a lot!

Paper in README is the wrong one

Was expecting this link:
https://arxiv.org/abs/1910.04054

setup error with sprout.

Hello!
when I run ./setup.sh, the following error occurs when building sprout:
gcc version = 7.5.0
Thanks very much if you can give me some advice to solve this problem.

collect2: error: ld returned 1 exit status
Makefile:375: recipe for target 'ntester' failed
make[3]: *** [ntester] Error 1
make[3]: Leaving directory '/root/Congestion_RL/mvfst-rl/_build/deps/pantheon/third_party/sprout/src/examples'
Makefile:341: recipe for target 'all-recursive' failed
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory '/root/Congestion_RL/mvfst-rl/_build/deps/pantheon/third_party/sprout/src'
Makefile:386: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/root/Congestion_RL/mvfst-rl/_build/deps/pantheon/third_party/sprout'
Makefile:327: recipe for target 'all' failed
make: *** [all] Error 2
Traceback (most recent call last):
File "/root/Congestion_RL/mvfst-rl/_build/deps/pantheon/src/wrappers/sprout.py", line 54, in
main()
File "/root/Congestion_RL/mvfst-rl/_build/deps/pantheon/src/wrappers/sprout.py", line 35, in main
check_call(sh_cmd, shell=True, cwd=cc_repo)
File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command './autogen.sh && ./configure --enable-examples && make -j' returned non-zero exit status 2
Traceback (most recent call last):
File "./src/experiments/setup.py", line 59, in
main()
File "./src/experiments/setup.py", line 55, in main
setup(args)
File "./src/experiments/setup.py", line 45, in setup
check_call([cc_src, 'setup'])
File "/root/Congestion_RL/mvfst-rl/_build/deps/pantheon/src/helpers/subprocess_wrappers.py", line 24, in check_call
return subprocess.check_call(cmd, **kwargs)
File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/root/Congestion_RL/mvfst-rl/_build/deps/pantheon/src/wrappers/sprout.py', 'setup']' returned non-zero exit status 1

mvfst-rl/_build/build/traffic_gen/traffic_gen: undefined symbol: _ZNK6google8protobuf7Message25InitializationErrorStringB5cxx11Ev

Hi ：
when i run: “python3 -m train.train mode=train total_steps=100 num_actors=2 hydra.run.dir=/tmp/logs ” met some trouble：it seem do training and local test successfully（show in the following log file(1): /tmp/logs/train.log）
but actually i see some running error(but hasn't interrupt running) in terminal output log(2). i check protobuf has been installed(version is 3.8.0) ，looking forward to your help！

~which protoc：
/home/wang/anaconda3/bin/protoc
~protoc --version
libprotoc 3.8.0

1、/tmp/logs/train.log：
[2022-04-26 14:10:13,971][root][INFO] - Deleted dir (in 0.009s): /dev/shm/mvfst-rl.tmp/f9c4817dd8e9684cbdbfcc4e174d6ff7117a83631d0cd2d7ec86521a2fb748de
[2022-04-26 14:10:42,891][root][INFO] - Mode=train
[2022-04-26 14:10:43,600][root][INFO] - Deleted dir (in 0.000s): /dev/shm/mvfst-rl.tmp/f9c4817dd8e9684cbdbfcc4e174d6ff7117a83631d0cd2d7ec86521a2fb748de
[2022-04-26 14:10:43,601][root][INFO] - Starting agent 0. Mode=train, logdir=/tmp/logs/train
[2022-04-26 14:10:45,904][root][INFO] - Stop event #0 set, will kill corresponding env (pid=29404)
[2022-04-26 14:10:46,627][root][INFO] - Done training.
[2022-04-26 14:10:46,627][root][INFO] - Deleted dir (in 0.000s): /dev/shm/mvfst-rl.tmp/f9c4817dd8e9684cbdbfcc4e174d6ff7117a83631d0cd2d7ec86521a2fb748de
[2022-04-26 14:10:46,722][root][INFO] - Missing traced model, tracing first
[2022-04-26 14:10:46,735][root][INFO] - Deleted dir (in 0.000s): /dev/shm/mvfst-rl.tmp/f9c4817dd8e9684cbdbfcc4e174d6ff7117a83631d0cd2d7ec86521a2fb748de
[2022-04-26 14:10:46,735][root][INFO] - Tracing model from checkpoint /tmp/logs/checkpoint.tar
[2022-04-26 14:10:47,774][root][INFO] - Done tracing to /tmp/logs/traced_model.pt
[2022-04-26 14:10:47,775][root][INFO] - Deleted dir (in 0.000s): /dev/shm/mvfst-rl.tmp/f9c4817dd8e9684cbdbfcc4e174d6ff7117a83631d0cd2d7ec86521a2fb748de
[2022-04-26 14:10:47,775][root][INFO] - Starting local test, logdir=/tmp/logs/test
[2022-04-26 14:54:26,930][root][INFO] - Done local test
[2022-04-26 14:54:26,930][root][INFO] - All done! Checkpoint: /tmp/logs/checkpoint.tar, traced model: /tmp/logs/traced_model.pt

2、terminal log：
/home/wang/WorkSpace/RL/mvfst-rl/_build/deps/pantheon/third_party/mvfst-rl/_build/build/traffic_gen/traffic_gen: symbol lookup error: /home/wang/WorkSpace/RL/mvfst-rl/_build/deps/pantheon/third_party/mvfst-rl/_build/build/traffic_gen/traffic_gen: undefined symbol: _ZNK6google8protobuf7Message25InitializationErrorStringB5cxx11Ev
Traceback (most recent call last):
File "/home/wang/WorkSpace/RL/mvfst-rl/_build/deps/pantheon/src/wrappers/mvfst_rl.py", line 139, in
main()
File "/home/wang/WorkSpace/RL/mvfst-rl/_build/deps/pantheon/src/wrappers/mvfst_rl.py", line 118, in main
assert p.wait() == 0, "Command failed: {}".format(" ".join(cmd))
AssertionError: Command failed: /home/wang/WorkSpace/RL/mvfst-rl/_build/deps/pantheon/third_party/mvfst-rl/_build/build/traffic_gen/traffic_gen --mode=server --host=0.0.0.0 --port=35673 --cc_algo=rl --cc_env_mode=local --cc_env_rpc_address=unix:/tmp/rl_server_path_0 --cc_env_actor_id=0 --cc_env_job_count=-1 --cc_env_model_file=/tmp/logs/traced_model.pt --cc_env_agg=time --cc_env_time_window_ms=100 --cc_env_fixed_window_size=10 --cc_env_use_state_summary=True --cc_env_history_size=20 --cc_env_norm_ms=100.0 --cc_env_norm_bytes=1000.0 --cc_env_actions=0,/2,-10,+10,*2 --cc_env_uplink_bandwidth=0.069 --cc_env_uplink_queue_size_bytes=21000 --cc_env_base_rtt=56 --cc_env_reward_delay_offset=0.0 --cc_env_reward_formula=log_ratio --cc_env_reward_throughput_factor=1.0 --cc_env_reward_throughput_log_offset=1e-05 --cc_env_reward_delay_factor=0.2 --cc_env_reward_delay_log_offset=1e-05 --cc_env_reward_packet_loss_factor=0.0 --cc_env_reward_packet_loss_log_offset=1e-05 --cc_env_reward_min_throughput_ratio=0.9 --cc_env_reward_max_throughput_ratio=1.0 --cc_env_reward_n_packets_offset=1.0 --cc_env_reward_uplink_queue_max_fill_ratio=0.5 --cc_env_reward_max_delay=True --cc_env_fixed_cwnd=10 --cc_env_min_rtt_window_length_us=10000000000 --cc_env_rtt_noise_std=0.0 --cc_env_ack_delay_avg_coeff=0.1 --cc_env_bandwidth_min_window_duration_ms=100 --cc_env_obs_scaling=1.0 --cc_env_stats_file= -v=1
[INFO:29637 pantheon_env:609 2022-04-26 14:10:49,518] Using all 6 jobs.
[INFO:29637 pantheon_env:624 2022-04-26 14:10:49,523] Located python2 in /usr/bin
[DEBUG:29637 pantheon_env:341 2022-04-26 14:10:49,527] Sampled job with target cwnd: 39.07987220447284

setup.sh --clean error

Hello! Thank you for providing this highly integrated platform.
I tried to build the project in the conda environment with python 3.8 in Ubuntu 20.04LTS.
when I run ./setup.sh --clean, the following error occurs:

Collecting git+git://github.com/odelalleau/hydra.git@f07036a8f3895169e62d89ad653434291b994780
Cloning git://github.com/odelalleau/hydra.git (to revision f07036a8f3895169e62d89ad653434291b994780) to /tmp/pip-req-build-t0400uu1
Running command git clone -q git://github.com/odelalleau/hydra.git /tmp/pip-req-build-t0400uu1
fatal: remote error：
The unauthenticated git protocol on port 9418 is no longer supported.
Please see https://github.blog/2021-09-01-improving-git-protocol-security-github/ for more information.
WARNING: Discarding git+git://github.com/odelalleau/hydra.git@f07036a8f3895169e62d89ad653434291b994780. Command errored out with exit status 128: git clone -q git://github.com/odelalleau/hydra.git /tmp/pip-req-build-t0400uu1 Check the logs for full command output.
ERROR: Command errored out with exit status 128: git clone -q git://github.com/odelalleau/hydra.git /tmp/pip-req-build-t0400uu1 Check the logs for full command output.

May I know how to solve this problem?
Thank you again.

Build Error while building mvfst

Hello, i have been having problems while building this library on my Ubuntu 18.04.5 LTS, but most were build errors for mvfst, espicially while building folly and fizz libraries; some of which where because i did not have google tests installed on my anaconda environment, after that this error showed up and i cannot find a solution for it, i would appreciate any assistance.
I made sure a couple of times that the problem is not from my end, could it be because the library is 2 years old, perhaps there is a dependency compatibility problem?

here is the last lines of the ouput showing the error after running setup.sh:

CMake Warning at quic/tools/tperf/CMakeLists.txt:10 (add_executable):
  Cannot generate a safe runtime search path for target tperf because files
  in some directories may conflict with libraries in implicit directories:

    runtime library [libz.so.1] in /usr/lib/x86_64-linux-gnu may be hidden by files in:
      /home/mohamed/anaconda3/envs/mvfst-rl/lib
    runtime library [libssl.so.1.1] in /usr/lib/x86_64-linux-gnu may be hidden by files in:
      /home/mohamed/anaconda3/envs/mvfst-rl/lib
    runtime library [libcrypto.so.1.1] in /usr/lib/x86_64-linux-gnu may be hidden by files in:
      /home/mohamed/anaconda3/envs/mvfst-rl/lib
    runtime library [liblzma.so.5] in /usr/lib/x86_64-linux-gnu may be hidden by files in:
      /home/mohamed/anaconda3/envs/mvfst-rl/lib
    runtime library [liblz4.so.1] in /usr/lib/x86_64-linux-gnu may be hidden by files in:
      /home/mohamed/anaconda3/envs/mvfst-rl/lib

  Some of these libraries may not be found correctly.


-- Generating done
-- Build files have been written to: /home/mohamed/Desktop/Bachelor/mvfst-rl/third-party/mvfst/_build/build
+ make -j 4
[  0%] Built target mvfst_constants
[  1%] Built target mvfst_bufutil
[  1%] Performing update step for 'googletest'
[  2%] Built target mvfst_exception
[  3%] Built target mvfst_looper
[  6%] Built target mvfst_codec_types
[  7%] No patch step for 'googletest'
[  8%] Built target mvfst_codec_decode
[  8%] Performing configure step for 'googletest'
loading initial cache file /home/mohamed/Desktop/Bachelor/mvfst-rl/third-party/mvfst/_build/build/googletest/tmp/googletest-cache-RelWithDebInfo.cmake
[  8%] Built target mvfst_codec_packet_number_cipher
[  8%] Building CXX object quic/handshake/CMakeFiles/mvfst_handshake.dir/QuicFizzFactory.cpp.o
-- Configuring done
-- Generating done
-- Build files have been written to: /home/mohamed/Desktop/Bachelor/mvfst-rl/third-party/mvfst/_build/build/googletest/src/googletest-build
[  8%] Performing build step for 'googletest'
[ 36%] Built target gmock_main
[ 54%] Built target gtest
[ 81%] Built target gmock
[100%] Built target gtest_main
[  8%] No install step for 'googletest'
[  9%] Completed 'googletest'
[ 10%] Built target googletest
Scanning dependencies of target QuicConnectionIdTest
Scanning dependencies of target PacketNumberTest
Scanning dependencies of target QuicIntegerTest
[ 11%] Building CXX object quic/codec/test/CMakeFiles/QuicIntegerTest.dir/QuicIntegerTest.cpp.o
[ 11%] Building CXX object quic/codec/test/CMakeFiles/QuicConnectionIdTest.dir/QuicConnectionIdTest.cpp.o
[ 11%] Building CXX object quic/codec/test/CMakeFiles/PacketNumberTest.dir/PacketNumberTest.cpp.o
/home/mohamed/Desktop/Bachelor/mvfst-rl/third-party/mvfst/quic/handshake/QuicFizzFactory.cpp:17:37: error: ‘folly::Optional<fizz::TLSMessage> {anonymous}::QuicPlaintextReadRecordLayer::read(folly::IOBufQueue&)’ marked ‘override’, but does not override
   folly::Optional<fizz::TLSMessage> read(folly::IOBufQueue& buf) override {
                                     ^~~~
/home/mohamed/Desktop/Bachelor/mvfst-rl/third-party/mvfst/quic/handshake/QuicFizzFactory.cpp:17:37: warning:   by ‘folly::Optional<fizz::TLSMessage> {anonymous}::QuicPlaintextReadRecordLayer::read(folly::IOBufQueue&)’ [-Woverloaded-virtual]
/home/mohamed/Desktop/Bachelor/mvfst-rl/third-party/mvfst/quic/handshake/QuicFizzFactory.cpp:35:37: error: ‘folly::Optional<fizz::TLSMessage> {anonymous}::QuicEncryptedReadRecordLayer::read(folly::IOBufQueue&)’ marked ‘override’, but does not override
   folly::Optional<fizz::TLSMessage> read(folly::IOBufQueue& buf) override {
                                     ^~~~
/home/mohamed/Desktop/Bachelor/mvfst-rl/third-party/mvfst/quic/handshake/QuicFizzFactory.cpp:35:37: warning:   by ‘folly::Optional<fizz::TLSMessage> {anonymous}::QuicEncryptedReadRecordLayer::read(folly::IOBufQueue&)’ [-Woverloaded-virtual]
quic/handshake/CMakeFiles/mvfst_handshake.dir/build.make:146: recipe for target 'quic/handshake/CMakeFiles/mvfst_handshake.dir/QuicFizzFactory.cpp.o' failed
make[2]: *** [quic/handshake/CMakeFiles/mvfst_handshake.dir/QuicFizzFactory.cpp.o] Error 1
CMakeFiles/Makefile2:3149: recipe for target 'quic/handshake/CMakeFiles/mvfst_handshake.dir/all' failed
make[1]: *** [quic/handshake/CMakeFiles/mvfst_handshake.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 11%] Building CXX object quic/codec/test/CMakeFiles/QuicConnectionIdTest.dir/__/__/common/test/TestMain.cpp.o
[ 11%] Building CXX object quic/codec/test/CMakeFiles/QuicIntegerTest.dir/__/__/common/test/TestMain.cpp.o
[ 12%] Linking CXX executable QuicConnectionIdTest
[ 13%] Building CXX object quic/codec/test/CMakeFiles/PacketNumberTest.dir/__/__/common/test/TestMain.cpp.o
[ 13%] Linking CXX executable QuicIntegerTest
[ 13%] Linking CXX executable PacketNumberTest
[ 13%] Built target QuicConnectionIdTest
[ 13%] Built target QuicIntegerTest
[ 13%] Built target PacketNumberTest
Makefile:159: recipe for target 'all' failed
make: *** [all] Error 2

Questions with evaluation

Hi,

 When I evaluate a trained model using the script "python3 -m train.train mode=test base_logdir=/tmp/logs", there are errors like this:

"
FileNotFoundError: [Errno 2] No such file or directory: '/checkpoint/jerry/mvfst-rl/run/2022-05-11_17-21-50'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/jerry/.conda/envs/mvfst-rl/lib/python3.8/pathlib.py", line 1288, in mkdir
self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/checkpoint/jerry/mvfst-rl/run'

During handling of the above exception, another exception occurred:
"

After training the model, the checkpoint.tar file is in /tmp/logs/, not /checkpoint/jerry/mvfst-rl/. Why does it go to /checkpoint/jerry/mvfst-rl/run/ to find the checkpoint.tar?

And what are the use of checkpoint.tar, traced_model.pt and traced_model.flags.pkl? Is checkpoint.tar a file with parameters? And traced_model.pt a model file? I'm confused.

Thank you!

the version of protobuf

I always report errors when compiling.Can you tell me your version of protobuf

AttributeError: module 'nest' has no attribute 'map'

Hello. Appreciate your previous suggestions and help in building project.
The final lines of the output after I run ./setup.sh are shown as belows.
-- Generating done
-- Build files have been written to: /home/zongshen/Project/mvfst-rl/_build/build
+ make -j 24
[ 6%] Building CXX object third-party/CMakeFiles/rpcenv_pb.dir/torchbeast/torchbeast/rpc.grpc.pb.cc.o
[ 13%] Building CXX object third-party/CMakeFiles/rpcenv_pb.dir/torchbeast/torchbeast/rpc.pb.cc.o
[ 20%] Linking CXX static library librpcenv_pb.a
[ 20%] Built target rpcenv_pb
[ 26%] Building CXX object congestion_control/CMakeFiles/rl_congestion_control.dir/NetworkState.cpp.o
[ 33%] Building CXX object congestion_control/CMakeFiles/rl_congestion_control.dir/CongestionControlFixedCwndEnv.cpp.o
[ 40%] Building CXX object congestion_control/CMakeFiles/rl_congestion_control.dir/CongestionControlLocalEnv.cpp.o
[ 46%] Building CXX object congestion_control/CMakeFiles/rl_congestion_control.dir/CongestionControlEnv.cpp.o
[ 53%] Building CXX object congestion_control/CMakeFiles/rl_congestion_control.dir/CongestionControlEnvConfig.cpp.o
[ 60%] Building CXX object congestion_control/CMakeFiles/rl_congestion_control.dir/RLBandwidthSampler.cpp.o
[ 66%] Building CXX object congestion_control/CMakeFiles/rl_congestion_control.dir/RLCongestionController.cpp.o
[ 73%] Building CXX object congestion_control/CMakeFiles/rl_congestion_control.dir/CongestionControlRPCEnv.cpp.o
[ 80%] Building CXX object congestion_control/CMakeFiles/rl_congestion_control.dir/Utils.cpp.o
[ 86%] Linking CXX static library librl_congestion_control.a
[ 86%] Built target rl_congestion_control
[ 93%] Building CXX object traffic_gen/CMakeFiles/traffic_gen.dir/main.cpp.o
[100%] Linking CXX executable traffic_gen
[100%] Built target traffic_gen
+ echo -e 'Done building.'
Done building.
In this case I assume that the building with training support is done correctly.
However, when I run
python3 -m train.train \
mode=train \
total_steps=1_000_000 \
num_actors=40 \
hydra.run.dir=/tmp/logs
I got the following error saying that the module 'nest' has no attribute 'map'.
[INFO:366242 pantheon_env:233 2022-03-11 01:24:31,743] Using all 18 jobs.
[INFO:366242 pantheon_env:184 2022-03-11 01:24:31,743] Launching 18 jobs over 40 threads for train.
[INFO:366242 pantheon_env:248 2022-03-11 01:24:31,751] Located python2 in /usr/bin
[INFO:366242 pantheon_env:115 2022-03-11 01:24:31,753] Thread: 0, episode: 0, experiment: 6, cmd: /home/zongshen/Project/mvfst-rl/_build/deps/pantheon/src/experiments/test.py local --data-dir /tmp/logs/train/train_tid0_run0_expt6 --pkill-cleanup --uplink-trace /home/zongshen/Project/mvfst-rl/train/traces/12mbps.trace --downlink-trace /home/zongshen/Project/mvfst-rl/train/traces/12mbps.trace --prepend-mm-cmds mm-delay 10 --extra-mm-link-args --uplink-queue=droptail --uplink-queue-args=packets=1 --downlink-queue=droptail --downlink-queue-args=packets=1 --schemes=mvfst_rl --run-times=1 --extra-sender-args="--cc_env_mode=remote --cc_env_rpc_address=unix:/tmp/rl_server_path_0 --cc_env_actor_id=0 --cc_env_job_id=6 --cc_env_model_file=/tmp/logs/traced_model.pt --cc_env_agg=time --cc_env_time_window_ms=100 --cc_env_fixed_window_size=10 --cc_env_use_state_summary=True --cc_env_history_size=20 --cc_env_norm_ms=100.0 --cc_env_norm_bytes=1000.0 --cc_env_actions=0,/2,-10,+10,*2 --cc_env_reward_log_ratio=True --cc_env_reward_throughput_factor=1.0 --cc_env_reward_throughput_log_offset=1e-05 --cc_env_reward_delay_factor=0.2 --cc_env_reward_delay_log_offset=1e-05 --cc_env_reward_packet_loss_factor=0.0 --cc_env_reward_packet_loss_log_offset=1e-05 --cc_env_reward_max_delay=True --cc_env_fixed_cwnd=10 --cc_env_min_rtt_window_length_us=10000000000 -v=1"
[DEBUG:366241 cmd:870 2022-03-11 01:24:31,810] Popen(['git', 'version'], cwd=/home/zongshen/Project/mvfst-rl, universal_newlines=False, shell=None, istream=None)
[DEBUG:366241 cmd:870 2022-03-11 01:24:31,817] Popen(['git', 'version'], cwd=/home/zongshen/Project/mvfst-rl, universal_newlines=False, shell=None, istream=None)
[DEBUG:366241 cmd:870 2022-03-11 01:24:31,825] Popen(['git', 'cat-file', '--batch-check'], cwd=/home/zongshen/Project/mvfst-rl, universal_newlines=False, shell=None, istream=<valid stream>)
[DEBUG:366241 cmd:870 2022-03-11 01:24:31,832] Popen(['git', 'diff', '--cached', '--abbrev=40', '--full-index', '--raw'], cwd=/home/zongshen/Project/mvfst-rl, universal_newlines=False, shell=None, istream=None)
[DEBUG:366241 cmd:870 2022-03-11 01:24:31,837] Popen(['git', 'diff', '--abbrev=40', '--full-index', '--raw'], cwd=/home/zongshen/Project/mvfst-rl, universal_newlines=False, shell=None, istream=None)
$ /home/zongshen/Project/mvfst-rl/_build/deps/pantheon/src/experiments/git_summary.sh Testing scheme mvfst_rl for experiment run 1/1...
$ /home/zongshen/Project/mvfst-rl/_build/deps/pantheon/src/wrappers/mvfst_rl.py run_first
[INFO:366241 learner:578 2022-03-11 01:24:32,011] Writing logs to /tmp/logs/train/logs.tsv
[INFO:366241 pantheon_env:233 2022-03-11 01:24:32,012] Using all 18 jobs.
[tunnel server manager (tsm)] $ python /home/zongshen/Project/mvfst-rl/_build/deps/pantheon/src/experiments/tunnel_manager.py
Process Process-2:
Traceback (most recent call last):
File "/home/zongshen/anaconda3/envs/mvfst-rl/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/zongshen/anaconda3/envs/mvfst-rl/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/zongshen/Project/mvfst-rl/train/learner.py", line 1108, in main
return _main(
File "/home/zongshen/Project/mvfst-rl/train/learner.py", line 1154, in _main
learner_loop(flags, rank, barrier, gossip_buffer, stop_event)
File "/home/zongshen/Project/mvfst-rl/train/learner.py", line 603, in learner_loop
dummy_env_output = nest.map(
AttributeError: module 'nest' has no attribute 'map'
[2022-03-11 01:24:32,065][root][INFO] - Stop event #0 set, will kill corresponding env (pid=366242)
[2022-03-11 01:24:32,201][root][INFO] - Done training.
Error executing job with overrides: ['mode=train', 'total_steps=1_000_000', 'num_actors=40']
Traceback (most recent call last):
File "/home/zongshen/Project/mvfst-rl/train/train.py", line 319, in main
test(flags)
File "/home/zongshen/Project/mvfst-rl/train/train.py", line 244, in test
init_logdirs(flags)
File "/home/zongshen/Project/mvfst-rl/train/train.py", line 95, in init_logdirs
assert os.path.exists(
AssertionError: Checkpoint /tmp/logs/checkpoint.tar missing in test mode

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Could you please help me with this?
Thank you.

questions about training and using the trained model

1- while training i have been facing this error:

onObservation: Still waiting for an update from ActorPoolServer, skipping observation
It seems that some times the agent takes more time than usual and did not return the action for the previous observation, it occurs a lot in each episode. It was said that this rarely occurs, is there a reason for this?

2- Is it possible to use the exported model in real environment, can i for example run a mvfst server that runs the trained model?

model mismatch

Excuse me, I'm running the traffic_gen server and client using the cc_algo='rl', while loading traced_model.pt provided in mvfst-rl/models, but a size mismatch problem emerged. Could you tell me if I missed anything or should I set any params before testing? Thank you very much.

_build/build/traffic_gen/traffic_gen --mode=client
i'm really using rl
I1111 16:54:13.475431 13896 ExampleClient.h:151] ExampleClient connecting to [::1]:6666
I1111 16:54:13.475687 13896 RLCongestionControllerFactory.h:34] Creating RLCongestionController
I1111 16:54:13.475867 13896 CongestionControlLocalEnv.cpp:21] Loading traced model from models/traced_model.pt
I1111 16:54:13.856813 13896 ExampleClient.h:81] ExampleClient connected to [::1]:6666
I1111 16:54:13.856878 13896 ExampleClient.h:178] ExampleClient wrote "hello", len=5 on stream=0
W1111 16:54:14.123860 13896 CongestionControlLocalEnv.cpp:36] onObservation: Still waiting for an update from model, skipping observation
terminate called after throwing an instance of 'std::runtime_error'
what(): size mismatch, m1: [1 x 112], m2: [220 x 512] at ../../aten/src/TH/generic/THTensorMath.cpp:752
The above operation failed in interpreter, with the following stack trace:
at :2:25

  def addmm(self: Tensor, mat1: Tensor, mat2: Tensor, beta: number = 1.0, alpha: number = 1.0):
      return self + mat1.mm(mat2)
                    ~~~~~~~ <--- HERE

  def batch_norm(input : Tensor, running_mean : Optional[Tensor], running_var : Optional[Tensor], training : bool, momentum : float, eps : float) -> Tensor:
      if training:
          norm_mean, norm_var = torch.batch_norm_update_stats(input, running_mean, running_var, momentum)
      else:
          norm_mean = torch._unwrap_optional(running_mean)
          norm_var = torch._unwrap_optional(running_var)
      norm_mean = torch._ncf_unsqueeze(norm_mean, input.dim())
      norm_var = torch._ncf_unsqueeze(norm_var, input.dim())

The above operation failed in interpreter, with the following stack trace:

*** Aborted at 1573462454 (unix time) try "date -d @1573462454" if you are using GNU date ***
PC: @ 0x7fdd0a003e97 gsignal
*** SIGABRT (@0x3e800003647) received by PID 13895 (TID 0x7fdd0634a700) from PID 13895; stack trace: ***
@ 0x7fdd1855f890 (unknown)
@ 0x7fdd0a003e97 gsignal
@ 0x7fdd0a005801 abort
@ 0x7fdd0a9f8957 (unknown)
@ 0x7fdd0a9feab6 (unknown)
@ 0x7fdd0a9feaf1 std::terminate()
@ 0x7fdd0a9fed24 __cxa_throw
W1111 16:54:14.249354 13896 CongestionControlLocalEnv.cpp:36] onObservation: Still waiting for an update from model, skipping observation
@ 0x7fdd0dce76a0 torch::jit::InterpreterStateImpl::handleError()
@ 0x7fdd0dceeb4e torch::jit::InterpreterStateImpl::runImpl()
@ 0x7fdd0dce2cfc torch::jit::InterpreterState::run()
@ 0x7fdd0dcbf2f3 torch::jit::GraphExecutorImplBase::run()
@ 0x7fdd0dedc896 torch::jit::script::Method::operator()()
@ 0x560ecdc205c6 (unknown)
@ 0x7fdd0aa2966f (unknown)
@ 0x7fdd185546db start_thread
@ 0x7fdd0a0e688f clone

How to start-up remote mode

Hello, thanks for your interesting works .
I find that when I run commands for training , mvfst-rlwill read the content in file experiment.yml. Then the command common_param_set: >-local --data-dir {data_dir} --pkill-cleanup will start-up the local mode of pantheon and do the whole trainning .
My question is that what should I do if I wanted to use remote mode of pantheon？I can not find the port that mvfst-rl start-up pantheon's remote mode for training . Is there anything I missed, or I should complete this myself ？Looking forward to your reply. Thanks so much.

Questions about input

Dear author:
I'm sorry to disturb you again. In the paper, our input is a 21-dimensional state space, and the value of reward is not used as an input. But in polybeast.py in the train folder, "core_input = torch.cat ([x, clipped_reward], dim = -1)", it seems that the reward is used as input. I ask is that right? What is the purpose of this? Looking forward to your answer. Thank you very much?

Questions about models

Dear author,
Sorry to bother you, I want to use your trained reinforcement learning model for congestion experiments in the actual environment, instead of testing in pantheon. However, I found that on the one hand, when the delay reached 40ms, the model could not control it. On the other hand, when there is no time delay, the value of cwnd has been increasing at a rate of 2 times, until it reaches the maximum value we set (even if we set the value to be large), I feel this phenomenon is a bit strange. May I ask the author if your model has been tested in an actual environment. Whether the two phenomena above normal will occur? Looking forward to your answer, thank you!

Renaming `master` branch to `main`

As a part of a broad effort to avoid insensitive terminology in our software, we are renaming our default branch from master to main. We recognize that this is only a small step, but it is an opportunity to make our project and community more welcoming to historically marginalized communities.

How does this impact my development process?

There should be very little impact. GitHub will surface the branch name change in your fork, if you have one. For new forks, you will automatically have main as the default branch.

We encourage the use of feature branches for local development. The only change in practice is changing which branch your feature branch is started from. When sending Pull Requests on GitHub, the target will default to our main branch, so there are no changes to make there.

I have a lot of tools that depend on `master` being the upstream branch name. How can I fix that?

master has always been only a default value and a number of projects have used other names for their primary development branch for years. We encourage updating your tooling to instead dynamically determine the branch to use. This article provides insight into how you can do that. Additionally, you can always set up a branch locally of any name to track our main branch.

I'd like to do this for my own projects, do you have any documentation on how this works?

GitHub has published a guide documenting their tooling. We recommend reading that and the accompanying documentation.

If you're a Facebook employee looking to do this for a project you maintain, please reach out to the Open Source Team.

run setup.sh error

Hello! Thank you for providing this highly integrated platform.
I tried to build the project in the conda environment with python 3.8 in Ubuntu 20.04LTS.
gcc version: 7.5.0
when I run ./setup.sh, the following error occurs:

g++ random.o memory.o memoryrange.o rat.o whisker.o whiskertree.o udp-socket.o traffic-generator.o remycc.o markoviancc.o estimators.o rtt-window.o sender.o protobufs-default/dna.pb.o -o sender -ljemalloc -lm -pthread -lprotobuf -lpthread -ljemalloc
/usr/bin/ld: protobufs-default/dna.pb.o:(.data.rel.ro._ZTVN11RemyBuffers11WhiskerTreeE[_ZTVN11RemyBuffers11WhiskerTreeE]+0x20): undefined reference to google::protobuf::Message::GetTypeName[abi:cxx11]() const' /usr/bin/ld: protobufs-default/dna.pb.o:(.data.rel.ro._ZTVN11RemyBuffers11WhiskerTreeE[_ZTVN11RemyBuffers11WhiskerTreeE]+0x58): undefined reference to google::protobuf::Message::InitializationErrorStringabi:cxx11 const'
/usr/bin/ld: protobufs-default/dna.pb.o:(.data.rel.ro._ZTVN11RemyBuffers11MemoryRangeE[_ZTVN11RemyBuffers11MemoryRangeE]+0x20): undefined reference to google::protobuf::Message::GetTypeName[abi:cxx11]() const' /usr/bin/ld: protobufs-default/dna.pb.o:(.data.rel.ro._ZTVN11RemyBuffers11MemoryRangeE[_ZTVN11RemyBuffers11MemoryRangeE]+0x58): undefined reference to google::protobuf::Message::InitializationErrorStringabi:cxx11 const'
/usr/bin/ld: protobufs-default/dna.pb.o:(.data.rel.ro._ZTVN11RemyBuffers6MemoryE[_ZTVN11RemyBuffers6MemoryE]+0x20): undefined reference to google::protobuf::Message::GetTypeName[abi:cxx11]() const' /usr/bin/ld: protobufs-default/dna.pb.o:(.data.rel.ro._ZTVN11RemyBuffers6MemoryE[_ZTVN11RemyBuffers6MemoryE]+0x58): undefined reference to google::protobuf::Message::InitializationErrorStringabi:cxx11 const'
/usr/bin/ld: protobufs-default/dna.pb.o:(.data.rel.ro._ZTVN11RemyBuffers7WhiskerE[_ZTVN11RemyBuffers7WhiskerE]+0x20): undefined reference to google::protobuf::Message::GetTypeName[abi:cxx11]() const' /usr/bin/ld: protobufs-default/dna.pb.o:(.data.rel.ro._ZTVN11RemyBuffers7WhiskerE[_ZTVN11RemyBuffers7WhiskerE]+0x58): undefined reference to google::protobuf::Message::InitializationErrorStringabi:cxx11 const'
/usr/bin/ld: protobufs-default/dna.pb.o:(.data.rel.ro._ZTVN11RemyBuffers19OptimizationSettingE[_ZTVN11RemyBuffers19OptimizationSettingE]+0x20): undefined reference to google::protobuf::Message::GetTypeName[abi:cxx11]() const' /usr/bin/ld: protobufs-default/dna.pb.o:(.data.rel.ro._ZTVN11RemyBuffers19OptimizationSettingE[_ZTVN11RemyBuffers19OptimizationSettingE]+0x58): undefined reference to google::protobuf::Message::InitializationErrorStringabi:cxx11 const'
/usr/bin/ld: protobufs-default/dna.pb.o:(.data.rel.ro._ZTVN11RemyBuffers20OptimizationSettingsE[_ZTVN11RemyBuffers20OptimizationSettingsE]+0x20): undefined reference to google::protobuf::Message::GetTypeName[abi:cxx11]() const' /usr/bin/ld: protobufs-default/dna.pb.o:(.data.rel.ro._ZTVN11RemyBuffers20OptimizationSettingsE[_ZTVN11RemyBuffers20OptimizationSettingsE]+0x58): undefined reference to google::protobuf::Message::InitializationErrorStringabi:cxx11 const'
/usr/bin/ld: protobufs-default/dna.pb.o:(.data.rel.ro._ZTVN11RemyBuffers5RangeE[_ZTVN11RemyBuffers5RangeE]+0x20): undefined reference to google::protobuf::Message::GetTypeName[abi:cxx11]() const' /usr/bin/ld: protobufs-default/dna.pb.o:(.data.rel.ro._ZTVN11RemyBuffers5RangeE[_ZTVN11RemyBuffers5RangeE]+0x58): undefined reference to google::protobuf::Message::InitializationErrorStringabi:cxx11 const'
/usr/bin/ld: protobufs-default/dna.pb.o:(.data.rel.ro._ZTVN11RemyBuffers11ConfigRangeE[_ZTVN11RemyBuffers11ConfigRangeE]+0x20): undefined reference to google::protobuf::Message::GetTypeName[abi:cxx11]() const' /usr/bin/ld: protobufs-default/dna.pb.o:(.data.rel.ro._ZTVN11RemyBuffers11ConfigRangeE[_ZTVN11RemyBuffers11ConfigRangeE]+0x58): undefined reference to google::protobuf::Message::InitializationErrorStringabi:cxx11 const'
/usr/bin/ld: protobufs-default/dna.pb.o:(.data.rel.ro._ZTVN11RemyBuffers9NetConfigE[_ZTVN11RemyBuffers9NetConfigE]+0x20): undefined reference to google::protobuf::Message::GetTypeName[abi:cxx11]() const' /usr/bin/ld: protobufs-default/dna.pb.o:(.data.rel.ro._ZTVN11RemyBuffers9NetConfigE[_ZTVN11RemyBuffers9NetConfigE]+0x58): undefined reference to google::protobuf::Message::InitializationErrorStringabi:cxx11 const'
collect2: error: ld returned 1 exit status
makepp: error: Failed to build target `/home/jerry/project/mvfst-rl/mvfst-rl/_build/deps/pantheon/third_party/genericCC/sender' [1]
makepp: 17 files updated and 1 target failed
Traceback (most recent call last):
File "/home/jerry/project/mvfst-rl/mvfst-rl/_build/deps/pantheon/src/wrappers/copa.py", line 53, in
main('do_ss:auto:0.5')
File "/home/jerry/project/mvfst-rl/mvfst-rl/_build/deps/pantheon/src/wrappers/copa.py", line 29, in main
check_call(['makepp', 'all', '--no-builtin-rules'], cwd=cc_repo)
File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['makepp', 'all', '--no-builtin-rules']' returned non-zero exit status 1
Traceback (most recent call last):
File "./src/experiments/setup.py", line 59, in
main()
File "./src/experiments/setup.py", line 55, in main
setup(args)
File "./src/experiments/setup.py", line 45, in setup
check_call([cc_src, 'setup'])
File "/home/jerry/project/mvfst-rl/mvfst-rl/_build/deps/pantheon/src/helpers/subprocess_wrappers.py", line 24, in check_call
return subprocess.check_call(cmd, **kwargs)
File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/home/jerry/project/mvfst-rl/mvfst-rl/_build/deps/pantheon/src/wrappers/copa.py', 'setup']' returned non-zero exit status 1

It seems some google protobuf functions are not found, can you help me solve the problem?
Thanks a lot!

Problem when running RL-server and Client

When RUNNING a server that uses reinforcement learning congestion control algorithms, I ran into a small problem using 'models/traced_model.pt' and' LOCAL 'modes,

I1123 15:53:53.327754 1776259 RLCongestionControllerFactory.h:36] Creating RLCongestionController
I1123 15:53:53.328002 1776259 CongestionControlLocalEnv.cpp:21] Loading traced model from /home/ustc-4/mvfst-rl/models/traced_model.pt
I1123 15:53:53.362632 1776259 tperf.cpp:420] Starting sends to client.
terminate called after throwing an instance of 'std::runtime_error'
  what():  The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "code/traced_model.py", line 24, in forward
  _9, x, _10, = torch.values(argument_1)
  _11, _12, = argument_2
  T = ops.prim.NumToTensor(torch.size(x, 0))
                           ~~~~~~~~~~ <--- HERE
  _13 = int(T)
  _14 = int(T)
RuntimeError: dim() called on undefined Tensor

I try to fing out the error, but there is not too much information in file 'traced_model.pt'. I wondered if it is due to a mismatch between the run-time input and github's 'models/traced_model.pt'. I'd appreciate any Pointers.

questions about the test result

Dear Author:
I think I have successfully built the mvfst-rl environment with your help. Thank you very much. Then I used the model in the model file to test it in pantheon. But I found that the environmental condition at "mm-delay 30 mm-link 12mbps.trace 12mbps.trace --uplink-queue=droptail --uplink-queue-args=bytes= 90000" or other environmental conditions, mvfst-rl has a higher throughput, but the packet loss rate is also high. The results are as follows table:

scheme# runs mean avg tput (Mbit/s) mean 95th-%ile delay (ms) mean loss rate (%)
TCP BBR 3 11.69 67.85 0.65
TCP Cubic 3 11.92 88.64 0.35
FillP 3 7.56 89.23 14.63
FillP-Sheep 3 8.01 86.73 4.61
Indigo 3 11.36 40.14 0.13
LEDBAT 3 11.33 89.38 0.24
mvfst-bbr 3 11.91 91.19 48.06
mvfst-copa 3 2.86 38.67 0.11
mvfst-rl 3 11.87 91.39 86.5
PCC-Allegro3 10.04 33.23 0.89
PCC-Expr 3 10.91 81.76 1.41
PCC-Vivace 3 10.9 37.12 0.11

I want to know if the result of this is that my environment is not configured correctly, or the model that we train through reinforcement learning is not very friendly to this network condition? Could you give me the pantheon test report for reference? My email address is: [email protected]
I look forward to your answer, thank you again.

Question about build grpc

Dear Author:
Hello, I encountered grpc issues while rebuilding the mvfst-rl environment. Could you say that the manual installation is to install protobuf located at install_grpc.sh yourself? Or go to Google how to install grpc? If it is manually installed, does the conda install protobuf in setup.sh conflict with the manual installation? Also, is grpc installed in the environment of mvfst-rl? I toss for a day and still haven't solved the problem. Hope to get your help, thank you very much.

questions about fairness

Dear Author:
I'm sorry to bother you. I want to further analyze mvfst-rl. I want to know if fairness is considered when designing algorithms and training models. If so, how did you do it? Looking forward to your answer. Best wishes!

Issue with starting the pantheon network

Hi @viswanathgs ,
I recently setup the mvfst-rl training environment and tried the training command and getting the below error in starting the pantheon env, with the test.py: error: unrecognized arguments: --extra-sender-args error.

python3 -m train.train --mode=train --base_logdir=/tmp/logs --total_steps=1 --learning_rate=0.00001 --num_actors=4 --cc_env_history_size=20 2>&1 | tee train_log.txt

i ave pasted the output log below.

/home/jogin/Desktop/mvfst-rl/train/utils.py:39: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
cfg.read(), {"src_dir": SRC_DIR, "pantheon_root": PANTHEON_ROOT}
[INFO:31113 train:229 2019-11-24 17:24:47,653] Mode=train
[INFO:31113 train:167 2019-11-24 17:24:48,460] Starting agent 0 on device cpu. Mode=train, logdir=/tmp/logs/train
/home/jogin/Desktop/mvfst-rl/train/utils.py:39: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
cfg.read(), {"src_dir": SRC_DIR, "pantheon_root": PANTHEON_ROOT}
[DEBUG:31125 cmd:728 2019-11-24 17:24:49,860] Popen(['git', 'cat-file', '--batch-check'], cwd=/home/jogin/Desktop/mvfst-rl, universal_newlines=False, shell=None, istream=)
/home/jogin/Desktop/mvfst-rl/train/utils.py:39: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
cfg.read(), {"src_dir": SRC_DIR, "pantheon_root": PANTHEON_ROOT}
[INFO:31126 pantheon_env:359 2019-11-24 17:24:49,863] Using all 18 jobs.
[INFO:31126 pantheon_env:235 2019-11-24 17:24:49,866] Launching 18 jobs over 4 threads for train.
[DEBUG:31125 cmd:728 2019-11-24 17:24:49,867] Popen(['git', 'diff', '--cached', '--abbrev=40', '--full-index', '--raw'], cwd=/home/jogin/Desktop/mvfst-rl, universal_newlines=False, shell=None, istream=None)
[DEBUG:31125 cmd:728 2019-11-24 17:24:49,880] Popen(['git', 'diff', '--abbrev=40', '--full-index', '--raw'], cwd=/home/jogin/Desktop/mvfst-rl, universal_newlines=False, shell=None, istream=None)
[INFO:31126 pantheon_env:277 2019-11-24 17:24:49,902] Located python2 in /usr/bin
[INFO:31126 pantheon_env:177 2019-11-24 17:24:49,904] Thread: 0, episode: 0, experiment: 15, cmd: /home/jogin/Desktop/mvfst-rl/_build/deps/pantheon/src/experiments/test.py local --data-dir /tmp/logs/train/train_tid0_run0_expt15 --pkill-cleanup --uplink-trace /home/jogin/Desktop/mvfst-rl/train/traces/12mbps.trace --downlink-trace /home/jogin/Desktop/mvfst-rl/train/traces/12mbps.trace --prepend-mm-cmds mm-delay 30 --extra-mm-link-args --uplink-queue=droptail --uplink-queue-args=bytes=30000 --schemes=mvfst_rl --run-times=1 --extra-sender-args="--cc_env_mode=remote --cc_env_rpc_address=unix:/tmp/rl_server_path_0 --cc_env_model_file=/tmp/logs/traced_model.pt --cc_env_agg=time --cc_env_time_window_ms=100 --cc_env_fixed_window_size=10 --cc_env_use_state_summary=True --cc_env_history_size=20 --cc_env_norm_ms=100.0 --cc_env_norm_bytes=1000.0 --cc_env_actions=0,/2,-10,+10,*2 --cc_env_reward_throughput_factor=1.0 --cc_env_reward_delay_factor=0.2 --cc_env_reward_packet_loss_factor=0.0 --cc_env_reward_max_delay=True -v=1"
usage: test.py [-h] [-c CONFIG] {local,remote} ...
test.py: error: unrecognized arguments: --extra-sender-args="--cc_env_mode=remote --cc_env_rpc_address=unix:/tmp/rl_server_path_0 --cc_env_model_file=/tmp/logs/traced_model.pt --cc_env_agg=time --cc_env_time_window_ms=100 --cc_env_fixed_window_size=10 --cc_env_use_state_summary=True --cc_env_history_size=20 --cc_env_norm_ms=100.0 --cc_env_norm_bytes=1000.0 --cc_env_actions=0,/2,-10,+10,*2 --cc_env_reward_throughput_factor=1.0 --cc_env_reward_delay_factor=0.2 --cc_env_reward_packet_loss_factor=0.0 --cc_env_reward_max_delay=True -v=1"
[INFO:31126 pantheon_env:177 2019-11-24 17:24:50,415] Thread: 0, episode: 1, experiment: 5, cmd: /home/jogin/Desktop/mvfst-rl/_build/deps/pantheon/src/experiments/test.py local --data-dir /tmp/logs/train/train_tid0_run1_expt5 --pkill-cleanup --uplink-trace /home/jogin/Desktop/mvfst-rl/train/traces/114.68mbps.trace --downlink-trace /home/jogin/Desktop/mvfst-rl/train/traces/114.68mbps.trace --prepend-mm-cmds mm-delay 45 --extra-mm-link-args --uplink-queue=droptail --uplink-queue-args=packets=450 --schemes=mvfst_rl --run-times=1 --extra-sender-args="--cc_env_mode=remote --cc_env_rpc_address=unix:/tmp/rl_server_path_0 --cc_env_model_file=/tmp/logs/traced_model.pt --cc_env_agg=time --cc_env_time_window_ms=100 --cc_env_fixed_window_size=10 --cc_env_use_state_summary=True --cc_env_history_size=20 --cc_env_norm_ms=100.0 --cc_env_norm_bytes=1000.0 --cc_env_actions=0,/2,-10,+10,*2 --cc_env_reward_throughput_factor=1.0 --cc_env_reward_delay_factor=0.2 --cc_env_reward_packet_loss_factor=0.0 --cc_env_reward_max_delay=True -v=1"
Creating log directory: /tmp/logs/train/torchbeast/torchbeast-20191124-172449
[INFO:31125 file_writer:87 2019-11-24 17:24:50,828] Creating log directory: /tmp/logs/train/torchbeast/torchbeast-20191124-172449
Symlinked log directory: /tmp/logs/train/torchbeast/latest
[INFO:31125 file_writer:99 2019-11-24 17:24:50,829] Symlinked log directory: /tmp/logs/train/torchbeast/latest
Saving arguments to /tmp/logs/train/torchbeast/torchbeast-20191124-172449/meta.json
[INFO:31125 file_writer:111 2019-11-24 17:24:50,829] Saving arguments to /tmp/logs/train/torchbeast/torchbeast-20191124-172449/meta.json
Saving messages to /tmp/logs/train/torchbeast/torchbeast-20191124-172449/out.log
[INFO:31125 file_writer:119 2019-11-24 17:24:50,834] Saving messages to /tmp/logs/train/torchbeast/torchbeast-20191124-172449/out.log
Saving logs data to /tmp/logs/train/torchbeast/torchbeast-20191124-172449/logs.csv
[INFO:31125 file_writer:129 2019-11-24 17:24:50,834] Saving logs data to /tmp/logs/train/torchbeast/torchbeast-20191124-172449/logs.csv
Saving logs' fields to /tmp/logs/train/torchbeast/torchbeast-20191124-172449/fields.csv
[INFO:31125 file_writer:130 2019-11-24 17:24:50,834] Saving logs' fields to /tmp/logs/train/torchbeast/torchbeast-20191124-172449/fields.csv
[INFO:31125 polybeast:437 2019-11-24 17:24:50,835] Not using CUDA.
[INFO:31126 pantheon_env:277 2019-11-24 17:24:50,910] Located python2 in /usr/bin
[INFO:31126 pantheon_env:177 2019-11-24 17:24:50,912] Thread: 1, episode: 0, experiment: 8, cmd: /home/jogin/Desktop/mvfst-rl/_build/deps/pantheon/src/experiments/test.py local --data-dir /tmp/logs/train/train_tid1_run0_expt8 --pkill-cleanup --uplink-trace /home/jogin/Desktop/mvfst-rl/train/traces/108mbps.trace --downlink-trace /home/jogin/Desktop/mvfst-rl/train/traces/108mbps.trace --prepend-mm-cmds mm-delay 10 --extra-mm-link-args --uplink-queue=droptail --uplink-queue-args=packets=1 --downlink-queue=droptail --downlink-queue-args=packets=1 --schemes=mvfst_rl --run-times=1 --extra-sender-args="--cc_env_mode=remote --cc_env_rpc_address=unix:/tmp/rl_server_path_0 --cc_env_model_file=/tmp/logs/traced_model.pt --cc_env_agg=time --cc_env_time_window_ms=100 --cc_env_fixed_window_size=10 --cc_env_use_state_summary=True --cc_env_history_size=20 --cc_env_norm_ms=100.0 --cc_env_norm_bytes=1000.0 --cc_env_actions=0,/2,-10,+10,*2 --cc_env_reward_throughput_factor=1.0 --cc_env_reward_delay_factor=0.2 --cc_env_reward_packet_loss_factor=0.0 --cc_env_reward_max_delay=True -v=1"
usage: test.py [-h] [-c CONFIG] {local,remote} ...
test.py: error: unrecognized arguments: --extra-sender-args="--cc_env_mode=remote --cc_env_rpc_address=unix:/tmp/rl_server_path_0 --cc_env_model_file=/tmp/logs/traced_model.pt --cc_env_agg=time --cc_env_time_window_ms=100 --cc_env_fixed_window_size=10 --cc_env_use_state_summary=True --cc_env_history_size=20 --cc_env_norm_ms=100.0 --cc_env_norm_bytes=1000.0 --cc_env_actions=0,/2,-10,+10,*2 --cc_env_reward_throughput_factor=1.0 --cc_env_reward_delay_factor=0.2 --cc_env_reward_packet_loss_factor=0.0 --cc_env_reward_max_delay=True -v=1"
[INFO:31126 pantheon_env:177 2019-11-24 17:24:51,243] Thread: 0, episode: 2, experiment: 9, cmd: /home/jogin/Desktop/mvfst-rl/_build/deps/pantheon/src/experiments/test.py local --data-dir /tmp/logs/train/train_tid0_run2_expt9 --pkill-cleanup --uplink-trace /home/jogin/Desktop/mvfst-rl/train/traces/12mbps.trace --downlink-trace /home/jogin/Desktop/mvfst-rl/train/traces/12mbps.trace --prepend-mm-cmds mm-delay 50 --extra-mm-link-args --uplink-queue=droptail --uplink-queue-args=packets=1 --downlink-queue=droptail --downlink-queue-args=packets=1 --schemes=mvfst_rl --run-times=1 --extra-sender-args="--cc_env_mode=remote --cc_env_rpc_address=unix:/tmp/rl_server_path_0 --cc_env_model_file=/tmp/logs/traced_model.pt --cc_env_agg=time --cc_env_time_window_ms=100 --cc_env_fixed_window_size=10 --cc_env_use_state_summary=True --cc_env_history_size=20 --cc_env_norm_ms=100.0 --cc_env_norm_bytes=1000.0 --cc_env_actions=0,/2,-10,+10,*2 --cc_env_reward_throughput_factor=1.0 --cc_env_reward_delay_factor=0.2 --cc_env_reward_packet_loss_factor=0.0 --cc_env_reward_max_delay=True -v=1"
Server listening on unix:/tmp/rl_server_path_0
usage: test.py [-h] [-c CONFIG] {local,remote} ...
test.py: error: unrecognized arguments: --extra-sender-args="--cc_env_mode=remote --cc_env_rpc_address=unix:/tmp/rl_server_path_0 --cc_env_model_file=/tmp/logs/traced_model.pt --cc_env_agg=time --cc_env_time_window_ms=100 --cc_env_fixed_window_size=10 --cc_env_use_state_summary=True --cc_env_history_size=20 --cc_env_norm_ms=100.0 --cc_env_norm_bytes=1000.0 --cc_env_actions=0,/2,-10,+10,*2 --cc_env_reward_throughput_factor=1.0 --cc_env_reward_delay_factor=0.2 --cc_env_reward_packet_loss_factor=0.0 --cc_env_reward_max_delay=True -v=1"

Some questions about this method

Hello, first of all, thank you very much for your project sharing, but I would like to describe some questions in the process of reproduction:

When we train under a single trace, we find that it is difficult to converge and can not achieve the effect of the paper. After training, the model always uses a single action, such as * 2, / 2, instead of + 10, - 10; or vice versa;
When we read the code, we found that you will truncate the input before LSTM to [- 1, 1]. For 108MB bandwidth environment, reward will basically exceed this range, so whether violent truncation will cause problems. In your paper, figure 3 shows that reward will exceed - 25000, which we do not know;
We have made a lot of modification attempts: input level modification, network structure modification, various reward modifications, but the large probability can not converge, let alone multiple tracks training together;
Congestion control should not be a particularly complex task. We have made many attempts with your method, but all of them can't converge well (single environment). Is reinforcement learning so bad (we also contact reinforcement learning for the first time), or what are the limitations of your method we don't know;

Looking forward to your reply.

Pantheon sprout protobuf incompatible when running ./setup.sh

Hello! Thank you for providing this highly integrated platform.
I tried to build the project in the conda environment with py3.8 in Ubuntu 20.04LTS.
However, I met the protobuf incompatibility errors when building sprout in _build/deps/pantheon.
The error log says as
"userinput.pb.h:12:2: error: #error This file was generated by a newer version of protoc which is
12 | #error This file was generated by a newer version of protoc which is
... ...
... ..."

Pantheon requires Python 2.7, but in its pantheon/src/wrappers/sprout.py --setup, it says it needs to use conda version of protobuf.
However, my current python env setup is:
base: python-2.7 , protoc-3.6.1
mvfst-rl conda: python-3.8 , protoc-3.8.0
Thus, I think pantheon currently uses python-2.7 in base while uses protoc-3.8.0 in conda.
May I know how to solve this problem?

Thank you again.

run error

Hello, I read your article carefully and am very interested in this project. I want to run and test this project, but I encountered some problems when I ran it. When I input “ ./setup.sh --inference”
line 39 ./setup.sh: 行 39: POSITIONAL[@]: Unbound variable
then i did " #set -- "${POSITIONAL[@]}" # Restore positional parameters"
then i get
"
Inference-only build
Installing libtorch CPU-only build into /home/python/mvfst-rl-master/_build/deps/libtorch
2019-12-01 02:48:17 URL:https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.2.0.zip [81767660/81767660] -> "libtorch-cxx11-abi-shared-with-deps-1.2.0.zip.7" [1]
Archive: libtorch-cxx11-abi-shared-with-deps-1.2.0.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of libtorch-cxx11-abi-shared-with-deps-1.2.0.zip or
libtorch-cxx11-abi-shared-with-deps-1.2.0.zip.zip, and cannot find libtorch-cxx11-abi-shared-with-deps-1.2.0.zip.ZIP, period.
"
What should I do to run this project？Looking forward to hearing from you

Questions about train trace

Dear Author:
I have a question about the training trace. I use vs code to find that the data in 12Mbps.trace are all numbers. How these numbers are generated, may I ask the author if he has any knowledge of this? Looking forward to your answer, thank you very much.

question about training

Dear author, I am very interested in your project. But when I set up the training environment of mvfst-rl and run the training command, there is a mistake "AssertionError: Checkpoint /tmp/logs/checkpoint.tar missing in trace mode
" and the training cannot be carried out. I am confused about this error. I want to know if there is a solution. Looking forward to your answer, thank you very much.

Error occurred while building mvfst

Hello. I encountered the following error while building the fizz library. How can I solve it?
Enviroment: Ubuntu22.04, gcc-12, python3.8

In file included from /usr/include/c++/12/bits/shared_ptr_atomic.h:33,
                 from /usr/include/c++/12/memory:78,
                 from /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/include/folly/Traits.h:23,
                 from /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/include/folly/Optional.h:68,
                 from /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/fizz/fizz/crypto/aead/Aead.h:11,
                 from /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/fizz/fizz/crypto/aead/OpenSSLEVPCipher.h:12,
                 from /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/fizz/fizz/tool/FizzServerCommand.cpp:10:In member function ‘void std::__atomic_base<_IntTp>::store(__int_type, std::memory_order) [with _ITp = long unsigned int]’,
    inlined from ‘static folly::fbstring_core<Char>::RefCounted* folly::fbstring_core<Char>::RefCounted::create(size_t*) [with Char = char]’ at /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/include/folly/FBString.h:490:30,
    inlined from ‘static folly::fbstring_core<Char>::RefCounted* folly::fbstring_core<Char>::RefCounted::create(const Char*, size_t*) [with Char = char]’ at /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/include/folly/FBString.h:497:27,
    inlined from ‘void folly::fbstring_core<Char>::initLarge(const Char*, size_t) [with Char = char]’ at /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/include/folly/FBString.h:724:40:
/usr/include/c++/12/bits/atomic_base.h:464:25: warning: ‘void __atomic_store_8(volatile void*, long unsigned int, int)’ writing 8 bytes into a region of size 0 overflows the destination [-Wstringop-overflow=]
  464 |         __atomic_store_n(&_M_i, __i, int(__m));
      |         ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
In file included from /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/include/folly/FBString.h:48,
                 from /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/include/folly/io/IOBuf.h:31,
                 from /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/fizz/fizz/crypto/aead/Aead.h:12:
In function ‘void* folly::checkedMalloc(size_t)’,
    inlined from ‘static folly::fbstring_core<Char>::RefCounted* folly::fbstring_core<Char>::RefCounted::create(size_t*) [with Char = char]’ at /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/include/folly/FBString.h:489:59,
    inlined from ‘static folly::fbstring_core<Char>::RefCounted* folly::fbstring_core<Char>::RefCounted::create(const Char*, size_t*) [with Char = char]’ at /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/include/folly/FBString.h:497:27,
    inlined from ‘void folly::fbstring_core<Char>::initLarge(const Char*, size_t) [with Char = char]’ at /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/include/folly/FBString.h:724:40:
/home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/include/folly/memory/Malloc.h:218:19: note: destination object of size 0 allocated by ‘malloc’
  218 |   void* p = malloc(size);
      |             ~~~~~~^~~~~~
/usr/bin/ld: CMakeFiles/BogoShim.dir/test/BogoShim.cpp.o: in function `readSelfCert()':
/home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/fizz/fizz/test/BogoShim.cpp:294: undefined reference to `EVP_PKEY_id'
/usr/bin/ld: /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/fizz/fizz/test/BogoShim.cpp:296: undefined reference to `EVP_PKEY_id'
/usr/bin/ld: CMakeFiles/BogoShim.dir/test/BogoShim.cpp.o: in function `folly::ssl::OpenSSLHash::Hmac::hash_final(folly::Range<unsigned char*>)':
/home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/include/folly/ssl/OpenSSLHash.h:120: undefined reference to `EVP_MD_size'
/usr/bin/ld: CMakeFiles/BogoShim.dir/test/BogoShim.cpp.o: in function `folly::ssl::OpenSSLHash::Digest::hash_final(folly::Range<unsigned char*>)':
/home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/include/folly/ssl/OpenSSLHash.h:63: undefined reference to `EVP_MD_size'
/usr/bin/ld: /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/include/folly/ssl/OpenSSLHash.h:63: undefined reference to `EVP_MD_size'
/usr/bin/ld: /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/include/folly/ssl/OpenSSLHash.h:63: undefined reference to `EVP_MD_size'
/usr/bin/ld: /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/include/folly/ssl/OpenSSLHash.h:63: undefined reference to `EVP_MD_size'
/usr/bin/ld: CMakeFiles/BogoShim.dir/test/BogoShim.cpp.o: in function `fizz::OpenSSLSignature<(fizz::KeyType)0>::setKey(std::unique_ptr<evp_pkey_st, folly::static_function_deleter<evp_pkey_st, &EVP_PKEY_free> >)':
/home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/fizz/fizz/crypto/signature/Signature-inl.h:187: undefined reference to `EVP_PKEY_id'
/usr/bin/ld: lib/libfizz.a(Signature.cpp.o): in function `fizz::detail::ecSign(folly::Range<unsigned char const*>, std::unique_ptr<evp_pkey_st, folly::static_function_deleter<evp_pkey_st, &EVP_PKEY_free> > const&, int)':
/home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/fizz/fizz/crypto/signature/Signature.cpp:49: undefined reference to `EVP_PKEY_size'
/usr/bin/ld: lib/libfizz.a(Signature.cpp.o): in function `fizz::detail::edSign(folly::Range<unsigned char const*>, std::unique_ptr<evp_pkey_st, folly::static_function_deleter<evp_pkey_st, &EVP_PKEY_free> > const&)':
/home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/fizz/fizz/crypto/signature/Signature.cpp:97: undefined reference to `EVP_PKEY_size'
/usr/bin/ld: lib/libfizz.a(Signature.cpp.o): in function `fizz::detail::rsaPssSign(folly::Range<unsigned char const*>, std::unique_ptr<evp_pkey_st, folly::static_function_deleter<evp_pkey_st, &EVP_PKEY_free> > const&, int)':
/home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/fizz/fizz/crypto/signature/Signature.cpp:172: undefined reference to `EVP_PKEY_size'
/usr/bin/ld: lib/libfizz.a(OpenSSLKeyUtils.cpp.o): in function `fizz::detail::validateEdKey(std::unique_ptr<evp_pkey_st, folly::static_function_deleter<evp_pkey_st, &EVP_PKEY_free> > const&, int)':
/home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/fizz/fizz/crypto/openssl/OpenSSLKeyUtils.cpp:42: undefined reference to `EVP_PKEY_base_id'
/usr/bin/ld: lib/libfizz.a(Certificate.cpp.o): in function `fizz::CertUtils::getKeyType(std::unique_ptr<evp_pkey_st, folly::static_function_deleter<evp_pkey_st, &EVP_PKEY_free> > const&)':
/home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/fizz/fizz/protocol/Certificate.cpp:213: undefined reference to `EVP_PKEY_id'
/usr/bin/ld: lib/libfizz.a(Certificate.cpp.o): in function `fizz::CertUtils::makePeerCert(std::unique_ptr<x509_st, folly::static_function_deleter<x509_st, &X509_free> >)':
/home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/fizz/fizz/protocol/Certificate.cpp:127: undefined reference to `EVP_PKEY_id'
/usr/bin/ld: lib/libfizz.a(Certificate.cpp.o): in function `fizz::OpenSSLSignature<(fizz::KeyType)0>::setKey(std::unique_ptr<evp_pkey_st, folly::static_function_deleter<evp_pkey_st, &EVP_PKEY_free> >)':
/home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/fizz/fizz/crypto/signature/Signature-inl.h:187: undefined reference to `EVP_PKEY_id'
/usr/bin/ld: lib/libfizz.a(Encryption.cpp.o): in function `folly::ssl::OpenSSLHash::Digest::hash_final(folly::Range<unsigned char*>)':
/home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/include/folly/ssl/OpenSSLHash.h:63: undefined reference to `EVP_MD_size'
/usr/bin/ld: /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/lib/libfolly.a(AsyncSSLSocket.cpp.o): in function `folly::AsyncSSLSocket::getPeerCertificate() const':
/home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/folly/folly/io/async/AsyncSSLSocket.cpp:1010: undefined reference to `SSL_get_peer_certificate'
/usr/bin/ld: /home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/lib/libfolly.a(AsyncSSLSocket.cpp.o): in function `folly::AsyncSSLSocket::getSSLCertSize() const':
/home/aolifuo/cpp/mvfst-rl/third-party/mvfst/_build/deps/folly/folly/io/async/AsyncSSLSocket.cpp:999: undefined reference to `EVP_PKEY_bits'
collect2: error: ld returned 1 exit status
make[2]: *** [CMakeFiles/BogoShim.dir/build.make:124: bin/BogoShim] Error 1
make[1]: *** [CMakeFiles/Makefile2:141: CMakeFiles/BogoShim.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....

Questions about train bandwidth

Dear Author:
Sorry to bother you again, because the mvfst-rl model is used in different environments for congestion control, the model needs to be retrained. But when we expand the training trace bandwidth to more than 500M, our RL model cannot be trained, and the entire training process cannot run. Have you tried training with a larger bandwidth? The reasons for not being trained should be analyzed from what aspects. Looking forward to your answer, thank you very much.

the version of protobuf

Can you tell me the configuration of your environment, the syntax error is always reported during compilation, what version of protobuf are you using

Test model error

Hello! Thank you for providing this highly integrated platform.
I tried to build the project in the conda environment with python 3.8 in Ubuntu 20.04LTS.
I wanted to test the model.
After I finished "./setup.sh --clean", and "python3 -m train.train mode=test base_logdir=/tmp/logs", the following error occured:

AssertionError: Checkpoint /tmp/logs/checkpoint.tar missing in test mode

It seems checkpoint.tar was missing. So how to generate checkpoint.tar?

Thanks a lot.