Code Monkey home page Code Monkey logo

Comments (37)

ChanganVR avatar ChanganVR commented on June 23, 2024 1

I realized later that the savi-pretraining model needs to be validated separately from training. And for that reason, I didn't automate this process. But as I mentioned earlier, both of these two steps only need to be trained once and should be the same for variants or ablations of the main model, and thus manually updating the weights shouldn't be too much of a cost.

from sound-spaces.

ChanganVR avatar ChanganVR commented on June 23, 2024

Thanks. It has been fixed.

from sound-spaces.

gtatiya avatar gtatiya commented on June 23, 2024

Thanks. Now, I do not get that error, but I get this error:

INFO:root:AudioNavSMTNet ===> Freezing goal, visual, fusion encoders!
Traceback (most recent call last):
  File "ss_baselines/savi/run.py", line 144, in <module>
    main()
  File "ss_baselines/savi/run.py", line 107, in main
    trainer.train()
  File "/home/gyan/Documents/sound-spaces/ss_baselines/savi/ddppo/algo/ddppo_trainer.py", line 239, in train
    self._setup_actor_critic_agent(ppo_cfg)
  File "/home/gyan/Documents/sound-spaces/ss_baselines/savi/ddppo/algo/ddppo_trainer.py", line 147, in _setup_actor_critic_agent
    pretrained_state = torch.load(self.config.RL.DDPPO.pretrained_weights, map_location="cpu")
  File "/home/gyan/miniconda3/envs/avn/lib/python3.6/site-packages/torch/serialization.py", line 581, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/home/gyan/miniconda3/envs/avn/lib/python3.6/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/gyan/miniconda3/envs/avn/lib/python3.6/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'data/models/savi/data/ckpt.50.pth'

This is the list of files in savi:
image

There is no data folder. I ran python ss_baselines/savi/pretraining/audiogoal_trainer.py --run-type train --model-dir data/models/savi --predict-label before and this is its logs:
pre-train_audiogoal_savi_log.txt

from sound-spaces.

ChanganVR avatar ChanganVR commented on June 23, 2024

which config are you using? You need to update the weight path with the best savi model you pretrained

from sound-spaces.

gtatiya avatar gtatiya commented on June 23, 2024

I am running the code mentioned here: https://github.com/facebookresearch/sound-spaces/blob/master/ss_baselines/savi/README.md.
I am not changing the default config file.
It is not mentioned in the README that I need to update the weight path with best savi model. There is a file called best_val.pth is that the best savi model? Shall I rename best_val.pth to ckpt.50 and copy it to data/models/savi/data/? Why your code is looking for data/models/savi/data/ckpt.50.pth?

from sound-spaces.

ChanganVR avatar ChanganVR commented on June 23, 2024

I should've been more clear. The model is first pretrained with memory size 1 (savi_pretraining.yaml) and then trainedw with full memory size (savi.yaml). You'll need to update ckpt.50 to the best checkpoint from the pretraining when doing second step.

best_val.pth is the best checkpoint for the second step, not for the first one. I'll update the description accordingly.

from sound-spaces.

gtatiya avatar gtatiya commented on June 23, 2024

I am not sure how to update ckpt.50 to the best checkpoint. The audiogoal_trainer.py script only generate these files:

best_val.pth  ckpt.1.pth  ckpt.3.pth  ckpt.8.pth
ckpt.0.pth    ckpt.2.pth  ckpt.4.pth  tb

There is no ckpt.50 file. Could you please explain how to update ckpt.50 to the best checkpoint from the pretraining when doing second step?

Also, you said best_val.pth is the best checkpoint for the second step. The the first step if giving me the error above, and thus I never ran the second step (savi.yaml), then how come best_val.pth is generated?

from sound-spaces.

ChanganVR avatar ChanganVR commented on June 23, 2024

Somehow, the config is wrong. Just setting this value (

) to False should work for you.

from sound-spaces.

gtatiya avatar gtatiya commented on June 23, 2024

from sound-spaces.

ChanganVR avatar ChanganVR commented on June 23, 2024

For your first question, I tried predicting labels alone and predicting both. The classsification accuracies for them were similar. I did end up using the label predictor trained with joint traning but I think this does not make a big difference.

That is definitely okay. It's just that the model directory might change so you need to update the path.

from sound-spaces.

gtatiya avatar gtatiya commented on June 23, 2024

In the case where only predict_label is True, the model has 21 outputs each belonging to a class, but why are you computing loss for the prediction of last 2 classes based on ground truth location here: https://github.com/facebookresearch/sound-spaces/blob/master/ss_baselines/savi/pretraining/audiogoal_trainer.py#L118? In other words, why in the case of predict_label you used regressor_loss = regressor_criterion(predicts[:, -2:], gts[:, -2:]) where as in the case of predict_location you used classifier_loss = torch.tensor([0], device=self.device)?

image

from sound-spaces.

ChanganVR avatar ChanganVR commented on June 23, 2024

that looks like a bug, when predicting label only, the loss should only consist of the classifier loss. I'll replace that line with regressor_loss = torch.tensor([0], device=self.device)

from sound-spaces.

gtatiya avatar gtatiya commented on June 23, 2024

Thanks, could you please fix this change, so that I can re-run code?

I have a question regarding how to run savi code:

  • First, audiogoal_trainer.py trains a label predictor and generates data/models_temp/savi/best_val.pth
  • Then savi model is pre-trained with savi_pretraining.yaml with memory size 1, and it will generate 400 .pth files in data/models_temp/savi/data. Does it automatically uses best_val.pth? Or I need to modify something to train it?
  • Then savi model is trained with save.yaml with whole memory, and I need to modify pretrained_weights path in savi.yaml with the best checkpoint of pre-training. How do I find the best checkpoint of pre-training? Can you automate saving the best model, like you are doing in audiogoal_trainer.py?

from sound-spaces.

ChanganVR avatar ChanganVR commented on June 23, 2024

Yes, I'll make this change and push soon.

Both the pretraining and full training steps take the best model of the previous step to continue training. Yeah, it would be good to automate this process and I'll try to add some code for that.

from sound-spaces.

gtatiya avatar gtatiya commented on June 23, 2024

So, does savi_pretraining.yaml will use data/models/savi/best_val.pth automatically or I need to make some change for that?

from sound-spaces.

gtatiya avatar gtatiya commented on June 23, 2024

I completed the first step of training the label predictor using audiogoal_trainer.py script. Now, I want to pre-train the savi model with savi_pretraining.yaml, and my question is do I need to make any changes manually to make the savi model use data/models/savi/best_val.pth generated by audiogoal_trainer.py? When I tried removing checkpoints generated by audiogoal_trainer.py, and ran the script to pre-train the savi model with savi_pretraining.yaml, it still seems to train and save checkpoints in data/models/savi/data/, so I am not sure if it is using data/models/savi/best_val.pth or not. Could you please specifically clarify this issue?

from sound-spaces.

ChanganVR avatar ChanganVR commented on June 23, 2024

I'm not sure if you're aware of this, but this line of code loads the data:

state_dict = torch.load('data/pretrained_weights/semantic_audionav/savi/label_predictor.pth')

And since I'm specifying the pretrained weights here so you do need to update the path with your own model weights.

from sound-spaces.

gtatiya avatar gtatiya commented on June 23, 2024

Thanks for answering. I do not have pretrained_weights in my data directory, but still I was able to pre-train savi model with savi_pretraining.yaml. After training, there were 400 checkpoints in data/savi/data directory. Do you know why your code did not use data/pretrained_weights/semantic_audionav/savi/label_predictor.pth?

I think, it is because in savi_pretraining.yaml, you have set use_belief_predictor to False: https://github.com/facebookresearch/sound-spaces/blob/master/ss_baselines/savi/config/semantic_audionav/savi_pretraining.yaml#L37. Is it supposed to be True?

Could you please help me train the savi model in the same way published in the paper?

from sound-spaces.

gtatiya avatar gtatiya commented on June 23, 2024

@ChanganVR, Could you please answer the questions I asked above?

from sound-spaces.

ChanganVR avatar ChanganVR commented on June 23, 2024

@gtatiya yeah, I was just checking the configuration and sorry about the delay. I found it was because when I cleaned up my code, the pretraining configuration somehow got messed up. I just pushed the new config files. This should work now.

from sound-spaces.

ChanganVR avatar ChanganVR commented on June 23, 2024

You can check this commit for the detailed changes I made: 721333f

from sound-spaces.

gtatiya avatar gtatiya commented on June 23, 2024

Thank you for making changes to fix the issue, but I am still facing issue to run savi code. I am trying to run python ss_baselines/savi/run.py --exp-config ss_baselines/savi/config/semantic_audionav/savi_pretraining.yaml --model-dir data/models/savi, and before running that, I added this pretrained_weights: "data/models/savi/best_val.pth", here: https://github.com/facebookresearch/sound-spaces/blob/master/ss_baselines/savi/config/semantic_audionav/savi_pretraining.yaml#L63. The issue is the code is stuck, and nothing is happening. Please see the logs attached here: pre-train_model_savi_log.txt. I also tried removing pretrained_weights: "data/models/savi/best_val.pth", but still the code is stuck. Before your push, the code was training the model and the checkpoints were getting saved, but now the code is stuck. Could you please fix this issue?

from sound-spaces.

ChanganVR avatar ChanganVR commented on June 23, 2024

Hi @gtatiya you don't need to set pretrained_weights for pretraining, this is only needed for finetuning. The goal predictor by default loads this weight:

state_dict = torch.load('data/pretrained_weights/semantic_audionav/savi/label_predictor.pth')
.

I ran this code again locally and it worked just fine. I'm not sure what happened to you. It could possibly freeze due to the large GPU memory ussage, which in case, you can reduce the memory size. Also you could print some statements to see where the code is getting stuck.

from sound-spaces.

gtatiya avatar gtatiya commented on June 23, 2024

Thank you! Yes, this could be because of GPU memory usage. Could you please tell me how to reduce the memory size?

from sound-spaces.

ChanganVR avatar ChanganVR commented on June 23, 2024

@gtatiya there are many parameters you could tweak to reduce the GPU memory usage, including external memory size, hidden feature size, mini batch size and etc at the cost of performance drop. I'd suggest to start with reducing external memory size, which in my experience affects the GPU memory a lot.

from sound-spaces.

gtatiya avatar gtatiya commented on June 23, 2024

Thank you. Could you please specify how to reduce external memory size?

I changed the NUM_PROCESSES to 4 here: https://github.com/facebookresearch/sound-spaces/blob/master/ss_baselines/savi/config/semantic_audionav/savi_pretraining.yaml#L3, and the training started, but it again got stuck at 388th checkpoint. Here is the logs: pre-train_model_savi_log.txt. Could you please help me figure out what is the issue with it?

from sound-spaces.

ChanganVR avatar ChanganVR commented on June 23, 2024

Oh right, I totally forgot about the NUM_PROCESSES parameter. To reduce the external memory size, you just need to change this number:

Based on the log, I can't really tell what was wrong. But since the model weights are saved, are you able to resume the training?

from sound-spaces.

gtatiya avatar gtatiya commented on June 23, 2024

I am running savi_pretraining.yaml, and it has memory_size of 1:

You recently, made this change, so I ran it again, with NUM_PROCESSES = 4, but it still got stuck.

What changes I need to make to resume training?

from sound-spaces.

ChanganVR avatar ChanganVR commented on June 23, 2024

You don't need make changes. The resuming function is implemented in the code already:

# Try to resume at previous checkpoint (independent of interrupted states)
count_steps_start, count_checkpoints, start_update = self.try_to_resume_checkpoint()
count_steps = count_steps_start

from sound-spaces.

gtatiya avatar gtatiya commented on June 23, 2024

Thank you. I was able to complete savi_pretraining.yaml step. But, I am facing issues with savi.yaml step:

  • The README says I need to use the best pretrained checkpoint, but does not mentions how to do that. When I use --eval-best True, I get this error:
No max index is found in data/models/savi/tb
Evaluating the best checkpoint: data/models/savi/data/ckpt.-1.pth
Traceback (most recent call last):
  File "ss_baselines/savi/run.py", line 144, in <module>
    main()
  File "ss_baselines/savi/run.py", line 95, in main
    config = get_config(args.exp_config, args.opts, args.model_dir, args.run_type, args.overwrite)
  File "/home/i21_gtatiya/projects/sound-spaces/ss_baselines/savi/config/default.py", line 264, in get_config
    config.merge_from_list(opts)
  File "/home/i21_gtatiya/miniconda3/envs/avn/lib/python3.6/site-packages/yacs/config.py", line 226, in merge_from_list
    cfg_list
  File "/home/i21_gtatiya/miniconda3/envs/avn/lib/python3.6/site-packages/yacs/config.py", line 545, in _assert_with_logging
    assert cond, msg
AssertionError: Override list has odd length: ['True', 'EVAL_CKPT_PATH_DIR', 'data/models/savi/data/ckpt.-1.pth']; it must be a list of pairs
  • When I used pretrained_weights: "data/models/savi/data/ckpt.399.pth", training finishes very quickly, I am not sure if that is supossed to hapen. Here are the logs:
    train_model_savi_log.txt

Could you please help?

from sound-spaces.

ChanganVR avatar ChanganVR commented on June 23, 2024

--eval-best is for evaluating the best checkpoint on the test set based on the validation curve.

You need to set the weights in here:

pretrained_weights: "data/models/savi/data/ckpt.XXX.pth"

For the second point, see this issue: #51 (comment)

from sound-spaces.

gtatiya avatar gtatiya commented on June 23, 2024

I am setting pretrained_weights: "data/models/savi/data/ckpt.399.pth", but still I am getting that error. Do you know why?

Could you please specify how to find the best pre-trained checkpoint?

from sound-spaces.

ChanganVR avatar ChanganVR commented on June 23, 2024

The best validation checkpoint should be based on the validation curve, that is, you evaluate every checkpoint on the validation set and pick the best one to continue training for the next stage.

If you're talking about the --eval-best error, you'll need to get that curve first.

from sound-spaces.

gtatiya avatar gtatiya commented on June 23, 2024

Thank you. Do you have an automated way to evaluate every checkpoint on the validation set? There are 400 checkpoints, so it would be hard to evaluate them manually.

I used the last checkpoint (ckpt.399.pth) from savi_pretraining.yaml and trained savi.yaml, and evaluated on test set using this command: python ss_baselines/savi/run.py --run-type eval --exp-config ss_baselines/savi/config/semantic_audionav/savi.yaml EVAL_CKPT_PATH_DIR data/models/savi/data/ckpt.399.pth EVAL.SPLIT test USE_SYNC_VECENV True RL.DDPPO.pretrained False, and the results were:

2021-08-26 10:01:59,366 Average episode reward: 4.563702
2021-08-26 10:01:59,367 Average episode distance_to_goal: 13.225000
2021-08-26 10:01:59,367 Average episode normalized_distance_to_goal: 0.578507
2021-08-26 10:01:59,367 Average episode success: 0.113000
2021-08-26 10:01:59,367 Average episode spl: 0.081052
2021-08-26 10:01:59,367 Average episode softspl: 0.307712
2021-08-26 10:01:59,367 Average episode na: 113.229000
2021-08-26 10:01:59,367 Average episode sna: 0.043960
2021-08-26 10:01:59,367 Average episode sws: 0.089000

When I used the pre-trained weights you provide, the results were:

2021-08-26 10:53:22,413 Average episode reward: 8.952902
2021-08-26 10:53:22,414 Average episode distance_to_goal: 9.326000
2021-08-26 10:53:22,414 Average episode normalized_distance_to_goal: 0.392776
2021-08-26 10:53:22,414 Average episode success: 0.233000
2021-08-26 10:53:22,414 Average episode spl: 0.154922
2021-08-26 10:53:22,414 Average episode softspl: 0.348543
2021-08-26 10:53:22,414 Average episode na: 163.308000
2021-08-26 10:53:22,414 Average episode sna: 0.121521
2021-08-26 10:53:22,414 Average episode sws: 0.139000

Why do you think there is a huge difference? Is it just because I did not use the best checkpoint from savi_pretraining.yaml?

Why the results I got from the pre-trained weights you provided are not same as the results in your semantic AVN paper?

from sound-spaces.

ChanganVR avatar ChanganVR commented on June 23, 2024

Thank you. Do you have an automated way to evaluate every checkpoint on the validation set? There are 400 checkpoints, so it would be hard to evaluate them manually.

This function is made for this.

def eval(self, eval_interval=1, prev_ckpt_ind=-1, use_last_ckpt=False) -> None:
r"""Main method of trainer evaluation. Calls _eval_checkpoint() that
is specified in Trainer class that inherits from BaseRLTrainer
Returns:
None
"""
self.device = (
torch.device("cuda", self.config.TORCH_GPU_ID)
if torch.cuda.is_available()
else torch.device("cpu")
)
if "tensorboard" in self.config.VIDEO_OPTION:
assert (
len(self.config.TENSORBOARD_DIR) > 0
), "Must specify a tensorboard directory for video display"
if "disk" in self.config.VIDEO_OPTION:
assert (
len(self.config.VIDEO_DIR) > 0
), "Must specify a directory for storing videos on disk"
with TensorboardWriter(
self.config.TENSORBOARD_DIR, flush_secs=self.flush_secs
) as writer:
# eval last checkpoint in the folder
if use_last_ckpt:
models_paths = list(
filter(os.path.isfile, glob.glob(self.config.EVAL_CKPT_PATH_DIR + "/*"))
)
models_paths.sort(key=os.path.getmtime)
self.config.defrost()
self.config.EVAL_CKPT_PATH_DIR = models_paths[-1]
self.config.freeze()
if os.path.isfile(self.config.EVAL_CKPT_PATH_DIR):
# evaluate singe checkpoint
result = self._eval_checkpoint(self.config.EVAL_CKPT_PATH_DIR, writer)
return result
else:
# evaluate multiple checkpoints in order
while True:
current_ckpt = None
while current_ckpt is None:
current_ckpt = poll_checkpoint_folder(
self.config.EVAL_CKPT_PATH_DIR, prev_ckpt_ind, eval_interval
)
time.sleep(2) # sleep for 2 secs before polling again
logger.info(f"=======current_ckpt: {current_ckpt}=======")
prev_ckpt_ind += eval_interval
self._eval_checkpoint(
checkpoint_path=current_ckpt,
writer=writer,
checkpoint_index=prev_ckpt_ind
)

It monitors all the checkpoints in a specified directory and evaluates them once a new one is available. I usually run a separate process for evaluation.

Why do you think there is a huge difference? Is it just because I did not use the best checkpoint from savi_pretraining.yaml?

There could be many reasons for this. How many GPUs are you using and how long have you trained the model? You'll get a better idea by plotting the validation curve as instructed above. Then you'll know if the model has converged.

Why the results I got from the pre-trained weights you provided are not same as the results in your semantic AVN paper?

Which result is not consistent?

from sound-spaces.

gtatiya avatar gtatiya commented on June 23, 2024
  • How should I run that eval function? I used this command: python ss_baselines/savi/run.py --run-type eval --exp-config ss_baselines/savi/config/semantic_audionav/savi.yaml EVAL_CKPT_PATH_DIR data/models/savi/data/ckpt.399.pth EVAL.SPLIT test USE_SYNC_VECENV True RL.DDPPO.pretrained False , but it believe it only evaluated using ckpt.399.pth. What is the command to evaluate on all the checkpoints?

  • I used 1 GPU, and I believe your code only uses 1 GPU at a time. Here, you load the model on only one GPU: https://github.com/facebookresearch/sound-spaces/blob/master/ss_baselines/savi/ddppo/algo/ddppo.py#L77. I have 4 GPUs, how can I use all of them? I used the defaults setting in your config file (NUM_UPDATES: 20000 and CHECKPOINT_INTERVAL: 50), I only change NUM_PROCESSES: 4, and there are 400 checkpoints after training. Did you used different settings for training than the config you provided?

  • I think, the results are similar to Table 1 (Unheard Sounds) of the paper, I asked because I got slightly less performance with the weights you provided. So, I might be doing something wrong. I used python ss_baselines/savi/run.py --run-type eval --exp-config ss_baselines/savi/config/semantic_audionav/savi.yaml EVAL_CKPT_PATH_DIR data/pretrained_weights/semantic_audionav/savi/best_val.pth EVAL.SPLIT test USE_SYNC_VECENV True RL.DDPPO.pretrained False. Do you think, this is correct command?

from sound-spaces.

ChanganVR avatar ChanganVR commented on June 23, 2024

If you don't provide EVAL_CKPT_PATH_DIR and just run the eval mode, by default it will always evaluate all checkpoints under that directory.

As I mentioned earlier in another post, if you change the number of GPUs, you might also want to change NUM_UPDATES as the default number of GPUs are 32. If you can evaluate all the validation checkpoints, you will know if the model has converged based on the validation performance curve.

The command is correct and yes, this command is for unheard sounds setting. The performance is indeed slightly lower than I first evaluated the model and uploaded the weights. Maybe some updates broke the consistency in some way. I'll look into that and keep you updated!

from sound-spaces.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.