Code Monkey home page Code Monkey logo

Comments (5)

ivan-khvostishkov avatar ivan-khvostishkov commented on June 4, 2024 1

Looking into exception stack trace, I see that it's again something related to SageMaker itself rather than to SSH Helper. It's downloading the code from S3, most likely from the default bucket that looks like s3://sagemaker-eu-west-1-555555555555/ . Could you check that this bucket exists, you can access this bucket from your notebook instance (e.g. by running aws s3 cp command from the Terminal) and it's located in the same region as your notebook?

If the above steps don't help, please, raise a support case:
https://docs.aws.amazon.com/awssupport/latest/user/case-management.html

from sagemaker-ssh-helper.

ivan-khvostishkov avatar ivan-khvostishkov commented on June 4, 2024

Hi, @djmarti , thanks for bringing up this important observation. The issue is probably rooted in the recent changes of docker-compose: docker/compose#10797 .

Please, downgrade the version as a workaround:

sudo curl -L "https://github.com/docker/compose/releases/download/v2.18.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

There's nothing that we can do on the SageMaker SSH Helper side, I'll keep this issue open until SageMaker Notebooks will get an update.

from sagemaker-ssh-helper.

djmarti avatar djmarti commented on June 4, 2024

Thanks Ivan for your prompt response and for the workaround. I think I gave a misleading hint. I am still unable to run the notebook after downgrading docker-compose to version 2.18.1. I checked that the version of docker-compose is the expected one:

$ whereis docker-compose
docker-compose: /usr/local/bin/docker-compose
$ docker-compose -v
Docker Compose version v2.18.1

But now I get an error that smells like a permission error:

e13eeylbz4-algo-1-c64ep  | 2023-11-17 01:34:50,818 sagemaker_pytorch_container.training INFO     Invoking user training script.
e13eeylbz4-algo-1-c64ep  | 2023-11-17 01:34:50,875 botocore.credentials INFO     Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
e13eeylbz4-algo-1-c64ep  | 2023-11-17 01:34:51,010 sagemaker-training-toolkit ERROR    Reporting training FAILURE
e13eeylbz4-algo-1-c64ep  | 2023-11-17 01:34:51,010 sagemaker-training-toolkit ERROR    Framework Error: 
e13eeylbz4-algo-1-c64ep  | Traceback (most recent call last):
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/sagemaker_training/trainer.py", line 88, in train
e13eeylbz4-algo-1-c64ep  |     entrypoint()
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_container/training.py", line 153, in main
e13eeylbz4-algo-1-c64ep  |     train(environment.Environment())
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_container/training.py", line 100, in train
e13eeylbz4-algo-1-c64ep  |     entry_point.run(uri=training_environment.module_dir,
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/sagemaker_training/entry_point.py", line 92, in run
e13eeylbz4-algo-1-c64ep  |     files.download_and_extract(uri=uri, path=environment.code_dir)
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/sagemaker_training/files.py", line 138, in download_and_extract
e13eeylbz4-algo-1-c64ep  |     s3_download(uri, dst)
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/sagemaker_training/files.py", line 174, in s3_download
e13eeylbz4-algo-1-c64ep  |     s3.Bucket(bucket).download_file(key, dst)
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/boto3/s3/inject.py", line 277, in bucket_download_file
e13eeylbz4-algo-1-c64ep  |     return self.meta.client.download_file(
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/boto3/s3/inject.py", line 190, in download_file
e13eeylbz4-algo-1-c64ep  |     return transfer.download_file(
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/boto3/s3/transfer.py", line 326, in download_file
e13eeylbz4-algo-1-c64ep  |     future.result()
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/s3transfer/futures.py", line 103, in result
e13eeylbz4-algo-1-c64ep  |     return self._coordinator.result()
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/s3transfer/futures.py", line 266, in result
e13eeylbz4-algo-1-c64ep  |     raise self._exception
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/s3transfer/tasks.py", line 269, in _main
e13eeylbz4-algo-1-c64ep  |     self._submit(transfer_future=transfer_future, **kwargs)
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/s3transfer/download.py", line 354, in _submit
e13eeylbz4-algo-1-c64ep  |     response = client.head_object(
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/botocore/client.py", line 530, in _api_call
e13eeylbz4-algo-1-c64ep  |     return self._make_api_call(operation_name, kwargs)
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/botocore/client.py", line 960, in _make_api_call
e13eeylbz4-algo-1-c64ep  |     raise error_class(parsed_response, operation_name)
e13eeylbz4-algo-1-c64ep  | botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
e13eeylbz4-algo-1-c64ep  | 
e13eeylbz4-algo-1-c64ep  | An error occurred (403) when calling the HeadObject operation: Forbidden
e13eeylbz4-algo-1-c64ep  | 2023-11-17 01:34:51,011 sagemaker-training-toolkit ERROR    Encountered exit_code 1

A permission error is surprising because I didn't have issues before and because there haven't been any changes in my setup.

from sagemaker-ssh-helper.

djmarti avatar djmarti commented on June 4, 2024

Apologies for the long delay. I retried with the exact same code and the problem is gone, which is consistent with your suggestion that this was something related to SageMaker. Everything works as expected, closing the ticket.

from sagemaker-ssh-helper.

ivan-khvostishkov avatar ivan-khvostishkov commented on June 4, 2024

I've faced the similar message with HeadObject, but it looks like the notebook instance was running for a very long time. I've stopped and started this instance again and the issue is gone.

from sagemaker-ssh-helper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.