In my experiment, the result data frame contains multiple rows with trial id 1 with th

Experiment Results Contain Random Rows about syne-tune HOT 17 CLOSED

awslabs commented on June 6, 2024

Experiment Results Contain Random Rows

from syne-tune.

Comments (17)

mseeger commented on June 6, 2024

OK, I know you guys just hate it, but I'd still love to see some debug_log output. Then, we could see what that trial_id 1 is really doing over the course of the experiment.

from syne-tune.

mseeger commented on June 6, 2024

I don't get one thing. Your config space has height and width, but the table shows config_lr.

from syne-tune.

mseeger commented on June 6, 2024

I am also missing the step attribute

from syne-tune.

mseeger commented on June 6, 2024

What I'd like to see is the results dataframe, which is just very different from that table. In the end, you create your plots and so on from the results, right?

from syne-tune.

wistuba commented on June 6, 2024

The table refers to my original experiment. I created the example later and confirmed that the same happens but didn't take another screenshot. My original experiment had only one hyperparameter: lr.

How is this table different from the results dataframe? This is load_experiment(tuner.name).results.

I have no problem with debug_log, I wasn't aware of it and I am not sure how to properly use it. As you've suggested, I activated it by passing it to the searcher and checked only the output on the console. Is this the intended use or does it write more logs somewhere else? Otherwise I couldn't spot anything suspicious. Trial id 1 is actually finished. Nevertheless, more results are reported. I am currently running a much longer experiment. Let's see how frequently trial id 1 pops up.

from syne-tune.

wistuba commented on June 6, 2024

Longer experiment: 287 rows in the table, 278 trials in total. Only trial id 1 occurs multiple times and it occurs within the first 24 rows (20 workers).

from syne-tune.

mseeger commented on June 6, 2024

I don't know what load_experiment(...).results is doing. I'd recommend loading the CSV directly and checking what is going on.
Thanks for raising this. We need to figure this out. Does this happen for local backend as well? I've not been using the SageMaker backend much at all, it may have quite some glitches.

from syne-tune.

wistuba commented on June 6, 2024

That's basically what the function does, loading the results.csv.zip: https://github.com/awslabs/syne-tune/blob/main/syne_tune/experiments.py#L144
I've shared my results.csv (and I believe this one is for the example snippet above) internally.

I didn't face problems using only a single worker. I'll try many workers on local backend and let you know.

I started to use SM backend more frequently now. Fortunately that's the first and only one I've faced. Let's hope it is the last one as well.

from syne-tune.

mseeger commented on June 6, 2024

Aaron is also using it more now, so let us figure this out!

from syne-tune.

geoalgo commented on June 6, 2024

Hi Martin, I just tried the example you gave and I do have a results.csv with one row per trial. Did you observe the issue with mainline? Can you confirm that the issue also happened with the script you gave?

from syne-tune.

wistuba commented on June 6, 2024

I set up a new SM notebook instance, created a new conda environment, and run the script above:

conda create -n test python=3.9.5 -y
conda activate test
pip install syne-tune
git clone https://github.com/awslabs/syne-tune.git
cd syne-tune
pip install -r requirements-ray.txt
python script.py

from syne_tune.experiments import load_experiment
load_experiment('train-height-2022-03-31-12-34-19-697').results['trial_id']

Again, multiple 1s showed up.

from syne-tune.

wistuba commented on June 6, 2024

I've run an experiment on a grid with 49 configurations and during the search all were evaluated. The results table has 52 rows, 4 of which have trial id 1. Exactly 49 jobs were executed on SageMaker.

It is not limited to trial id 1. In addition to 1, I also saw 2.

from syne-tune.

geoalgo commented on June 6, 2024

We are working on finding a good setup to reproduce this issue as it happens sporadically.

from syne-tune.

mseeger commented on June 6, 2024

I might have experienced this issue as well. Here is what happens for me. I am running an experiment with SM backend, ASHA, and lstm_wikitext2 benchmark. There are 10 seeds, 6 fail, 4 succeed.

In the 6 failed ones, this happens:

trial_id 1 is successful and moves beyond 9 or 27 epochs
scheduler receives report from trial_id 1 with resource=1, this leads to exception
in all cases, the next trial_id to report anything at resource=1, is always 10. "10" looks similar to "1" (?)

In the 4 successful ones, trial_id 1 is stopped at 1 or 3, at a time when trial_id 10 does not exist yet.

from syne-tune.

mseeger commented on June 6, 2024

I'll dig a bit into this. If trial_id's are mixed up, we can detect this easily by passing trial_id to training function and back in the reports.

from syne-tune.

mseeger commented on June 6, 2024

Hunch: Trial with id 10 is mistaken for trial_id 1. Maybe a path is mismatched. In S3, "XYZ-1" matches to "XYZ-1*", unless you use "XYZ-1/". I'll have a look.

from syne-tune.

geoalgo commented on June 6, 2024

Closing as #374 seems to have addressed the issue.

from syne-tune.

Experiment Results Contain Random Rows about syne-tune HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent