Comments (18)
Thanks @vassiasim. I will reproduce the issue and come back with further information.
from sagemaker-python-sdk.
Just checking in to see if there are any updates or any indication of when this will be fixed.
from sagemaker-python-sdk.
@jbencook Thank you so much for your contribution!
Until the next SDK release, the Tensorboard fix can be viewed by building and installing from master.
It is also possible to view the fix within a SageMaker notebook instance by building and installing from source.
- Start a new conda_tensorflow_p27 notebook
- Clone from master and pip install within the cell
! git clone https://github.com/aws/sagemaker-python-sdk.git python-sdk-tensorboard-fix && cd python-sdk-tensorboard-fix && pip install . --upgrade
- Run the cell
All tensorflow jobs that run tensorboard should now correctly display scalars!
Feel free to run the sample tensorboard notebook, tensorflow_resnet_cifar10_with_tensorboard , which is in /sample-notebooks/sagemaker-python-sdk.
Thanks again!
from sagemaker-python-sdk.
Hi @vassiasim,
I am investigating your issue. Would you mind sharing the OS platform and a code block allowing me to reprove the issue in the same way that you did?
Thanks for using SageMaker!
from sagemaker-python-sdk.
Hi @mvsusp ,
Thank you for having a look!!
The example provided here gives me the exact same behaviour: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_resnet_cifar10_with_tensorboard/tensorflow_resnet_cifar10_with_tensorboard.ipynb
I am using aws
, so Ubuntu 16.4
.
from sagemaker-python-sdk.
Hello @vassiasim,
I created a SageMaker workspace and executed the resnet cifar 10 jupyter notebook. The behavior when estimator.fit(inputs, run_tensorboard_locally=True)
is called is almost exactly as you described. The only difference is that the scalars are being displayed in the end of the training job and quickly after TensorBoard is deactivated.
The second call of estimator.fit(inputs, run_tensorboard_locally=True)
will create a second training job but it will use the same checkpoints from the previous execution. That is an useful feature and can be avoided creating a new TensorFlow
estimator instead of using the previous one. When the second call of estimator.fit(inputs, run_tensorboard_locally=True)
starts, TensorBoard will pick the state from the previous run, which is why you see the scalars there.
TensorBoard scalars are created after each evaluation. The notebook example that you used only evaluates once in the end of the training, which explains the current behavior.
I will change the example defined in tensorflow_resnet_cifar10_with_tensorboard.ipynb as follow:
from sagemaker.tensorflow import TensorFlow
source_dir = os.path.join(os.getcwd(), 'source_dir')
estimator = TensorFlow(entry_point='resnet_cifar_10.py',
source_dir=source_dir,
role=role,
hyperparameters={ 'min_eval_frequency': 10},
training_steps=1000, evaluation_steps=100,
train_instance_count=2, train_instance_type='ml.c4.xlarge',
base_job_name='tensorboard-example')
estimator.fit(inputs, run_tensorboard_locally=True)
That will change the training job behavior to evaluate the training job more often, allowing you see the scalars in the example. That will make the training job slower as well, given that more checkpoints will be saved to the S3 bucket.
Please, change you notebook with code block above.
For more information on how TensorFlow
training process works: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/docs_src/get_started/estimator.md#fit-the-dnnclassifier-to-the-iris-training-data-fit-dnnclassifier
For more information on the available hyperparameters for TensorFlow sagemaker python sdk: https://github.com/aws/sagemaker-python-sdk#optional-hyperparameters
PR with the code changes: aws/amazon-sagemaker-examples#154
from sagemaker-python-sdk.
Hi @mvsusp,
Thank you for the reply and PR!
That makes sense. I tried it and get the scalars
and images
displayed, but only when the job is complete even though evaluation is run few times before that.
Any thoughts?
Thanks a lot.
from sagemaker-python-sdk.
Hi @vassiasim,
You can increase the number of training steps to have a longer training and more scalars
and images
displayed as well. Try training_steps=10000
.
from sagemaker-python-sdk.
Hi @mvsusp ,
Thank you for your reply. Yes agree, doing that will result to longer training, but i still don't understand why I don't get more points on the graph.
Based on your previous answer, adding min_eval_frequency
runs the evaluation more often and not every 1000 steps that is the default, so I would expect a point on Tensorboard every time the evaluation is run. Just trying to understand when Tensorboard updates the displayed values for training and for evaluation.
Using the notebook with training_steps=10000
and hyperparameters={ 'min_eval_frequency': 10}
, I get points for training up to 600
and one point for evaluation at 1
after passing step=1000. However, I would expect the Tensorboard to get updated, after passing step=2000, but it doesn't. It only got updated at some random point after step=3000 but again only for training (still just getting one point for evaluation at 1
). It doesn't update again until the end of the training at 10000 steps.
I am just trying to understand how it works exactly and also to be able to see the progress of training.
Thank you!
from sagemaker-python-sdk.
I'm digging in on this issue right now since we need TensorBoard to work in order to make use of SageMaker. It seems like TensorBoard only ever sees the first event that gets written to the .tfevents
file and then it stops updating. Steps to reproduce the issue:
- Run the CIFAR10 example
- After the job starts training, refresh the TensorBoard page. At this point, TensorBoard shows me one point on the accuracy plot under the scalars tab.
- Keep an eye on the
.tfevents
file in the log directory by runningls -la
every 30 seconds or so. When it updates (you know because the timestamp changed), start a new TensorBoard on a different port pointing to the same log directory. You can see that the new TensorBoard registers the new event, while the existing TensorBoard still ignores it. I've got a screenshot with side-by-side TensorBoards below. Both sessions are pointing to the same log directory, but running on different ports.
I'm still not 100% sure what the cause is. But I have found a couple things. First, someone else was having a very similar problem when syncing logs from Google Cloud Storage to a local directory. They filed an issue in the TensorBoard repo here. When I run the hack they suggest (rsync --inplace
) the logs seem to mostly update correctly.
But I also found another problem. When I start TensorBoard separately, I can see what it logs, and at one point I got the following error:
E0122 08:56:36.886072 Reloader directory_watcher.py:241] File /private/var/folders/_t/ywxyc4gs5gv10xx01cj12ps80000gn/T/tmp2jTUUg/events.out.tfevents.1516632057.aws updated even though the current file is /private/var/folders/_t/ywxyc4gs5gv10xx01cj12ps80000gn/T/tmp2jTUUg/events.out.tfevents.1516632057.aws.8af0e520
As soon as TensorBoard sees a file that's lexicographically higher than the event file it's supposed to be watching, it moves on to the new one and never looks at the original again. The aws s3 sync
command creates temporary files that are lexicographically higher, so as soon as TensorBoard sees those, it stops watching the correct tfevents file.
I hope this helps. Unless I can figure out a more elegant solution in the next couple hours, my plan is to implement a hack where I keep two copies of the logs directory. One for aws s3 sync
to update and the other one for TensorBoard to watch. That will work fine for us, but a proper solution will probably be more involved.
from sagemaker-python-sdk.
Update: it looks like files don't actually have to be updated in place for TensorBoard to pick them up. That would narrow it down to the temporary files that aws s3 sync
creates.
I need to do some testing with a non-toy example, but I have a quick fix that seems to be working for CIFAR 10 here.
If I'm right about the actual issue, then I think the best solution is probably to convince TensorBoard to expose their path filter so it can be used in the CLI. In that case SageMaker could just tell TensorBoard to ignore files with anything after .aws
.
Another option would be to sync files from S3 without creating temporary files in the same directory. That would be easy enough if you can download the whole .tfevent
file in one call but I'm guessing those can get big.
So far, I've got the easiest solution which is to keep 2 copies of the log directory locally. I'm happy to help out with a better fix if someone can comment about a preferred approach.
I'll also follow up if I find out my current solution is not working.
from sagemaker-python-sdk.
All, any update on this issue?
from sagemaker-python-sdk.
I've ran the example mentioned in this thread. I made sure there will be more data produced by setting 'save_summary_secs' and 'save_checkpoints_secs' with just a few seconds.
I noticed that indeed the originally launched tensorboard was not refreshing but starting a separate tensorboard even on the same local directory was showing data. I also ran 'tensorboard --inspect' to see the data was there.
I also realized that the example code doesn't do any special handling of the 'tf.summary.FileWriter' which made me try to run with only 1 instance.
Running with 1 training instance will work correctly and tensofboard is being refreshed as training goes by.
from sagemaker-python-sdk.
@hsakkout Right now we're prioritizing this against a host of tensorflow serving-related bugs that have been discovered recently, e.g.: #99
Expect an update by EOD wednesday about where we landed.
from sagemaker-python-sdk.
With further investigation, I confirmed there's a problem even for single-machine training. (The problem @lukmis mentioned may be a separate one we have to investigate).
I believe the problem is the same one described here: tensorflow/tensorboard#349
We use aws s3 sync https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/estimator.py#L100 , which would cause the same problem as the gsutil sync.
I'll try out @jbencook's fix next.
from sagemaker-python-sdk.
@jbencook I tried out your fix and it's working. I think the overall approach is the right one - the main difference I'd suggest is simply to use a context manager to clean up the temporary directory after its contents are copied to the tensorflow log dir.
Would you be interested in submitting a PR?
from sagemaker-python-sdk.
Good point - I'm not currently cleaning up the intermediate directory. I can fix that up and submit a PR later today.
from sagemaker-python-sdk.
Hello,
@jbencook Thank you so much for your contribution.
This fix has been released.
from sagemaker-python-sdk.
Related Issues (20)
- JumpStart Uncompressed Format Causing AttributeError for JumpStart ID Local Container Mode HOT 1
- [HuggingFace] use default Hugging Face Inference DLC for non TGI models when using `ModelBuilder` HOT 1
- ModelBuilder tag is appending to user agent everytime build() is called
- ModelBuilder support for XGBoost container is impaired due to boto3 dependencies issues
- SageMaker/serve didn't bring the latest code for save handler
- Update fastapi dependency version > 0.100 for pydantic v2 support HOT 2
- Invalid dash-separated options for description-file HOT 1
- Error running a pipeline with a Processing Job using a LocalSession HOT 11
- Not able to deploy Serverless Endpoint with requirements.txt HOT 7
- Support Tuning Step in Local Mode HOT 1
- Incompatible `tblib` version for Python 3.10
- Programmatic way to get available training/inference containers. HOT 1
- Output of function step is not compatible with `sagemaker.clarify.ModelConfig()` HOT 1
- Unknown parameter in PrimaryContainer: "ModelDataSource" with older boto version HOT 1
- Environment variables not passed when deploying Sagemaker Estimator HOT 1
- Tune Step with Script Mode Estimator won't start due to lacking MetricDefinition in Pipeline definition HOT 1
- Support for Hugginface multimodal models HOT 3
- Unable to deploy huggingface-llm 1.3.3 HOT 12
- Encrypting code artifact with SSE-S3 instead of SSE-KMS
- Attribute error when passing kms_key to sklearn_processor.run method
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sagemaker-python-sdk.