Code Monkey home page Code Monkey logo

Comments (18)

mvsusp avatar mvsusp commented on May 12, 2024 1

Thanks @vassiasim. I will reproduce the issue and come back with further information.

from sagemaker-python-sdk.

hsakkout avatar hsakkout commented on May 12, 2024 1

Just checking in to see if there are any updates or any indication of when this will be fixed.

from sagemaker-python-sdk.

ChoiByungWook avatar ChoiByungWook commented on May 12, 2024 1

@jbencook Thank you so much for your contribution!

Until the next SDK release, the Tensorboard fix can be viewed by building and installing from master.

It is also possible to view the fix within a SageMaker notebook instance by building and installing from source.

  1. Start a new conda_tensorflow_p27 notebook
  2. Clone from master and pip install within the cell
! git clone https://github.com/aws/sagemaker-python-sdk.git python-sdk-tensorboard-fix && cd python-sdk-tensorboard-fix && pip install . --upgrade
  1. Run the cell

All tensorflow jobs that run tensorboard should now correctly display scalars!

Feel free to run the sample tensorboard notebook, tensorflow_resnet_cifar10_with_tensorboard , which is in /sample-notebooks/sagemaker-python-sdk.

Thanks again!

from sagemaker-python-sdk.

mvsusp avatar mvsusp commented on May 12, 2024

Hi @vassiasim,

I am investigating your issue. Would you mind sharing the OS platform and a code block allowing me to reprove the issue in the same way that you did?

Thanks for using SageMaker!

from sagemaker-python-sdk.

vassiasim avatar vassiasim commented on May 12, 2024

Hi @mvsusp ,

Thank you for having a look!!

The example provided here gives me the exact same behaviour: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_resnet_cifar10_with_tensorboard/tensorflow_resnet_cifar10_with_tensorboard.ipynb

I am using aws, so Ubuntu 16.4.

from sagemaker-python-sdk.

mvsusp avatar mvsusp commented on May 12, 2024

Hello @vassiasim,

I created a SageMaker workspace and executed the resnet cifar 10 jupyter notebook. The behavior when estimator.fit(inputs, run_tensorboard_locally=True) is called is almost exactly as you described. The only difference is that the scalars are being displayed in the end of the training job and quickly after TensorBoard is deactivated.

The second call of estimator.fit(inputs, run_tensorboard_locally=True) will create a second training job but it will use the same checkpoints from the previous execution. That is an useful feature and can be avoided creating a new TensorFlow estimator instead of using the previous one. When the second call of estimator.fit(inputs, run_tensorboard_locally=True) starts, TensorBoard will pick the state from the previous run, which is why you see the scalars there.

TensorBoard scalars are created after each evaluation. The notebook example that you used only evaluates once in the end of the training, which explains the current behavior.

I will change the example defined in tensorflow_resnet_cifar10_with_tensorboard.ipynb as follow:

from sagemaker.tensorflow import TensorFlow


source_dir = os.path.join(os.getcwd(), 'source_dir')
estimator = TensorFlow(entry_point='resnet_cifar_10.py',
                       source_dir=source_dir,
                       role=role,
                       hyperparameters={ 'min_eval_frequency': 10},
                       training_steps=1000, evaluation_steps=100,
                       train_instance_count=2, train_instance_type='ml.c4.xlarge', 
                       base_job_name='tensorboard-example')

estimator.fit(inputs, run_tensorboard_locally=True)

That will change the training job behavior to evaluate the training job more often, allowing you see the scalars in the example. That will make the training job slower as well, given that more checkpoints will be saved to the S3 bucket.

Please, change you notebook with code block above.

For more information on how TensorFlow training process works: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/docs_src/get_started/estimator.md#fit-the-dnnclassifier-to-the-iris-training-data-fit-dnnclassifier

For more information on the available hyperparameters for TensorFlow sagemaker python sdk: https://github.com/aws/sagemaker-python-sdk#optional-hyperparameters

PR with the code changes: aws/amazon-sagemaker-examples#154

from sagemaker-python-sdk.

vassiasim avatar vassiasim commented on May 12, 2024

Hi @mvsusp,

Thank you for the reply and PR!

That makes sense. I tried it and get the scalars and images displayed, but only when the job is complete even though evaluation is run few times before that.

Any thoughts?

Thanks a lot.

from sagemaker-python-sdk.

mvsusp avatar mvsusp commented on May 12, 2024

Hi @vassiasim,

You can increase the number of training steps to have a longer training and more scalars and images displayed as well. Try training_steps=10000.

from sagemaker-python-sdk.

vassiasim avatar vassiasim commented on May 12, 2024

Hi @mvsusp ,

Thank you for your reply. Yes agree, doing that will result to longer training, but i still don't understand why I don't get more points on the graph.

Based on your previous answer, adding min_eval_frequency runs the evaluation more often and not every 1000 steps that is the default, so I would expect a point on Tensorboard every time the evaluation is run. Just trying to understand when Tensorboard updates the displayed values for training and for evaluation.

Using the notebook with training_steps=10000 and hyperparameters={ 'min_eval_frequency': 10} , I get points for training up to 600 and one point for evaluation at 1 after passing step=1000. However, I would expect the Tensorboard to get updated, after passing step=2000, but it doesn't. It only got updated at some random point after step=3000 but again only for training (still just getting one point for evaluation at 1). It doesn't update again until the end of the training at 10000 steps.

I am just trying to understand how it works exactly and also to be able to see the progress of training.

Thank you!

from sagemaker-python-sdk.

jbencook avatar jbencook commented on May 12, 2024

I'm digging in on this issue right now since we need TensorBoard to work in order to make use of SageMaker. It seems like TensorBoard only ever sees the first event that gets written to the .tfevents file and then it stops updating. Steps to reproduce the issue:

  1. Run the CIFAR10 example
  2. After the job starts training, refresh the TensorBoard page. At this point, TensorBoard shows me one point on the accuracy plot under the scalars tab.
  3. Keep an eye on the .tfevents file in the log directory by running ls -la every 30 seconds or so. When it updates (you know because the timestamp changed), start a new TensorBoard on a different port pointing to the same log directory. You can see that the new TensorBoard registers the new event, while the existing TensorBoard still ignores it. I've got a screenshot with side-by-side TensorBoards below. Both sessions are pointing to the same log directory, but running on different ports.

screen shot 2018-01-19 at 12 36 03 pm

I'm still not 100% sure what the cause is. But I have found a couple things. First, someone else was having a very similar problem when syncing logs from Google Cloud Storage to a local directory. They filed an issue in the TensorBoard repo here. When I run the hack they suggest (rsync --inplace) the logs seem to mostly update correctly.

But I also found another problem. When I start TensorBoard separately, I can see what it logs, and at one point I got the following error:

E0122 08:56:36.886072 Reloader directory_watcher.py:241] File /private/var/folders/_t/ywxyc4gs5gv10xx01cj12ps80000gn/T/tmp2jTUUg/events.out.tfevents.1516632057.aws updated even though the current file is /private/var/folders/_t/ywxyc4gs5gv10xx01cj12ps80000gn/T/tmp2jTUUg/events.out.tfevents.1516632057.aws.8af0e520

As soon as TensorBoard sees a file that's lexicographically higher than the event file it's supposed to be watching, it moves on to the new one and never looks at the original again. The aws s3 sync command creates temporary files that are lexicographically higher, so as soon as TensorBoard sees those, it stops watching the correct tfevents file.

I hope this helps. Unless I can figure out a more elegant solution in the next couple hours, my plan is to implement a hack where I keep two copies of the logs directory. One for aws s3 sync to update and the other one for TensorBoard to watch. That will work fine for us, but a proper solution will probably be more involved.

from sagemaker-python-sdk.

jbencook avatar jbencook commented on May 12, 2024

Update: it looks like files don't actually have to be updated in place for TensorBoard to pick them up. That would narrow it down to the temporary files that aws s3 sync creates.

I need to do some testing with a non-toy example, but I have a quick fix that seems to be working for CIFAR 10 here.

If I'm right about the actual issue, then I think the best solution is probably to convince TensorBoard to expose their path filter so it can be used in the CLI. In that case SageMaker could just tell TensorBoard to ignore files with anything after .aws.

Another option would be to sync files from S3 without creating temporary files in the same directory. That would be easy enough if you can download the whole .tfevent file in one call but I'm guessing those can get big.

So far, I've got the easiest solution which is to keep 2 copies of the log directory locally. I'm happy to help out with a better fix if someone can comment about a preferred approach.

I'll also follow up if I find out my current solution is not working.

from sagemaker-python-sdk.

jpbarto avatar jpbarto commented on May 12, 2024

All, any update on this issue?

from sagemaker-python-sdk.

lukmis avatar lukmis commented on May 12, 2024

I've ran the example mentioned in this thread. I made sure there will be more data produced by setting 'save_summary_secs' and 'save_checkpoints_secs' with just a few seconds.
I noticed that indeed the originally launched tensorboard was not refreshing but starting a separate tensorboard even on the same local directory was showing data. I also ran 'tensorboard --inspect' to see the data was there.
I also realized that the example code doesn't do any special handling of the 'tf.summary.FileWriter' which made me try to run with only 1 instance.

Running with 1 training instance will work correctly and tensofboard is being refreshed as training goes by.

from sagemaker-python-sdk.

winstonaws avatar winstonaws commented on May 12, 2024

@hsakkout Right now we're prioritizing this against a host of tensorflow serving-related bugs that have been discovered recently, e.g.: #99

Expect an update by EOD wednesday about where we landed.

from sagemaker-python-sdk.

winstonaws avatar winstonaws commented on May 12, 2024

With further investigation, I confirmed there's a problem even for single-machine training. (The problem @lukmis mentioned may be a separate one we have to investigate).

I believe the problem is the same one described here: tensorflow/tensorboard#349

We use aws s3 sync https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/estimator.py#L100 , which would cause the same problem as the gsutil sync.

I'll try out @jbencook's fix next.

from sagemaker-python-sdk.

winstonaws avatar winstonaws commented on May 12, 2024

@jbencook I tried out your fix and it's working. I think the overall approach is the right one - the main difference I'd suggest is simply to use a context manager to clean up the temporary directory after its contents are copied to the tensorflow log dir.

Would you be interested in submitting a PR?

from sagemaker-python-sdk.

jbencook avatar jbencook commented on May 12, 2024

Good point - I'm not currently cleaning up the intermediate directory. I can fix that up and submit a PR later today.

from sagemaker-python-sdk.

ChoiByungWook avatar ChoiByungWook commented on May 12, 2024

Hello,

@jbencook Thank you so much for your contribution.

This fix has been released.

from sagemaker-python-sdk.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.