Code Monkey home page Code Monkey logo

Comments (17)

mseeger avatar mseeger commented on June 6, 2024

OK, I know you guys just hate it, but I'd still love to see some debug_log output. Then, we could see what that trial_id 1 is really doing over the course of the experiment.

from syne-tune.

mseeger avatar mseeger commented on June 6, 2024

I don't get one thing. Your config space has height and width, but the table shows config_lr.

from syne-tune.

mseeger avatar mseeger commented on June 6, 2024

I am also missing the step attribute

from syne-tune.

mseeger avatar mseeger commented on June 6, 2024

What I'd like to see is the results dataframe, which is just very different from that table. In the end, you create your plots and so on from the results, right?

from syne-tune.

wistuba avatar wistuba commented on June 6, 2024

The table refers to my original experiment. I created the example later and confirmed that the same happens but didn't take another screenshot. My original experiment had only one hyperparameter: lr.

How is this table different from the results dataframe? This is load_experiment(tuner.name).results.

I have no problem with debug_log, I wasn't aware of it and I am not sure how to properly use it. As you've suggested, I activated it by passing it to the searcher and checked only the output on the console. Is this the intended use or does it write more logs somewhere else? Otherwise I couldn't spot anything suspicious. Trial id 1 is actually finished. Nevertheless, more results are reported. I am currently running a much longer experiment. Let's see how frequently trial id 1 pops up.

from syne-tune.

wistuba avatar wistuba commented on June 6, 2024

Longer experiment: 287 rows in the table, 278 trials in total. Only trial id 1 occurs multiple times and it occurs within the first 24 rows (20 workers).

from syne-tune.

mseeger avatar mseeger commented on June 6, 2024

I don't know what load_experiment(...).results is doing. I'd recommend loading the CSV directly and checking what is going on.
Thanks for raising this. We need to figure this out. Does this happen for local backend as well? I've not been using the SageMaker backend much at all, it may have quite some glitches.

from syne-tune.

wistuba avatar wistuba commented on June 6, 2024

That's basically what the function does, loading the results.csv.zip: https://github.com/awslabs/syne-tune/blob/main/syne_tune/experiments.py#L144
I've shared my results.csv (and I believe this one is for the example snippet above) internally.

I didn't face problems using only a single worker. I'll try many workers on local backend and let you know.

I started to use SM backend more frequently now. Fortunately that's the first and only one I've faced. Let's hope it is the last one as well.

from syne-tune.

mseeger avatar mseeger commented on June 6, 2024

Aaron is also using it more now, so let us figure this out!

from syne-tune.

geoalgo avatar geoalgo commented on June 6, 2024

Hi Martin, I just tried the example you gave and I do have a results.csv with one row per trial. Did you observe the issue with mainline? Can you confirm that the issue also happened with the script you gave?

from syne-tune.

wistuba avatar wistuba commented on June 6, 2024

I set up a new SM notebook instance, created a new conda environment, and run the script above:

conda create -n test python=3.9.5 -y
conda activate test
pip install syne-tune
git clone https://github.com/awslabs/syne-tune.git
cd syne-tune
pip install -r requirements-ray.txt
python script.py
from syne_tune.experiments import load_experiment
load_experiment('train-height-2022-03-31-12-34-19-697').results['trial_id']

Again, multiple 1s showed up.

from syne-tune.

wistuba avatar wistuba commented on June 6, 2024

I've run an experiment on a grid with 49 configurations and during the search all were evaluated. The results table has 52 rows, 4 of which have trial id 1. Exactly 49 jobs were executed on SageMaker.

It is not limited to trial id 1. In addition to 1, I also saw 2.

from syne-tune.

geoalgo avatar geoalgo commented on June 6, 2024

We are working on finding a good setup to reproduce this issue as it happens sporadically.

from syne-tune.

mseeger avatar mseeger commented on June 6, 2024

I might have experienced this issue as well. Here is what happens for me. I am running an experiment with SM backend, ASHA, and lstm_wikitext2 benchmark. There are 10 seeds, 6 fail, 4 succeed.

In the 6 failed ones, this happens:

  • trial_id 1 is successful and moves beyond 9 or 27 epochs
  • scheduler receives report from trial_id 1 with resource=1, this leads to exception
  • in all cases, the next trial_id to report anything at resource=1, is always 10. "10" looks similar to "1" (?)

In the 4 successful ones, trial_id 1 is stopped at 1 or 3, at a time when trial_id 10 does not exist yet.

from syne-tune.

mseeger avatar mseeger commented on June 6, 2024

I'll dig a bit into this. If trial_id's are mixed up, we can detect this easily by passing trial_id to training function and back in the reports.

from syne-tune.

mseeger avatar mseeger commented on June 6, 2024

Hunch: Trial with id 10 is mistaken for trial_id 1. Maybe a path is mismatched. In S3, "XYZ-1" matches to "XYZ-1*", unless you use "XYZ-1/". I'll have a look.

from syne-tune.

geoalgo avatar geoalgo commented on June 6, 2024

Closing as #374 seems to have addressed the issue.

from syne-tune.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.