Code Monkey home page Code Monkey logo

sagemaker-ssh-helper's Introduction

SageMaker SSH Helper

Latest Version Supported Python Versions License

SageMaker SSH Helper is the "army-knife" library that helps you to securely connect to Amazon SageMaker training jobs, processing jobs, batch inference jobs and realtime inference endpoints as well as SageMaker Studio Notebooks and SageMaker Notebook Instances for fast interactive experimentation, remote debugging, and advanced troubleshooting.

Three most common tasks that motivated to create the library, sometimes referred as "SSH into SageMaker", are:

  1. A terminal session into a container running in SageMaker to diagnose a stuck training job, use CLI commands like nvidia-smi, or iteratively fix and re-execute your training script within seconds.
  2. Remote debugging of a code running in SageMaker from your local favorite IDE like PyCharm Professional Edition or Visual Studio Code.
  3. Port forwarding to access auxiliary tools running inside SageMaker, e.g., Dask dashboard, Streamlit apps, TensorBoard or Spark Web UI.

Screenshot

Other tasks include but not limited to connecting to a remote Jupyter Notebook in SageMaker Studio from your IDE, or starting a VNC session to SageMaker Studio to run GUI apps.

How it works

SageMaker SSH helper uses AWS Systems Manager (SSM) Session Manager, to register a SageMaker container in SSM, followed by creating a session between your client machine and the Session Manager service. From there, you can "SSH into SageMaker" by creating an SSH (Secure Shell) connection on top of the SSM sessions, that allows opening a Linux shell and configuring bidirectional SSH port forwarding to run applications like remote development, debugging, desktop GUI, and others.

Screenshot

If you want to understand deeper how both the SageMaker service and the SageMaker SSH Helper library work, check the Flow Diagrams of the common use cases and carefully read all sections of the documentation.

Make sure you also looked at our Frequently Asked Questions, especially at the troubleshooting section, as well as at the existing both open and resolved issues.

Getting started

To get started, your AWS system administrator must configure IAM and SSM in your AWS account as shown in Setting up your AWS account with IAM and SSM configuration.

Note: This repository is a sample AWS content. You should not use the sample content in your production accounts, in a production environment or on production or other critical data. If you plan to use the content in production, please, carefully review it with your security team. You are responsible for testing, securing, and optimizing the sample content as appropriate for production grade use based on your specific business requirements, including any quality control practices and standards.

Use Cases

SageMaker SSH Helper supports a variety of use cases:

Pro Tip: While multiple use cases allow debugging and remote code execution, the typical development journey looks as follows: (1) you start developing and running code in the IDE on your local machine, then (2) you connect the IDE to SageMaker Studio with SageMaker SSH Helper to test and troubleshoot it on the remote instance, then (3) you integrate your code with SageMaker and run it as a training, processing or inference job, using SageMaker SSH Helper if needed, and finally (4) assemble jobs into MLOps pipelines with SageMaker Projects, to be deployed into multi-account structure on top of (5) the secure enterprise ML platform.

If you want to add a new use case or a feature, see CONTRIBUTING.

Connecting to SageMaker training jobs with SSM

Download Demo (.mov) Download Demo (.mov)

Note: This demo is recorded with a previous version of SSH Helper and may be not up-to-date with the recent features. Check the below documentation for the most up-to-date steps.

Step 1: Install the library

Before starting the whole procedure, check that both pip and python commands point to Python version 3.7 or higher with python --version command.

Install the latest stable version of library from the PyPI repository:

pip install sagemaker-ssh-helper

Caution: It's always recommended to install the library into a Python venv, not into the system env.

Step 2: Modify your start training job code

  1. Add import for SSHEstimatorWrapper
  2. Add a dependencies parameter to the Estimator object constructor. Alternatively, instead of dependencies parameter, put sagemaker_ssh_helper into source_dir/training/requirements.txt.
  3. Add an SSHEstimatorWrapper.create(estimator,...) call before calling fit().
  4. Add a call to ssh_wrapper.print_ssh_info() or ssh_wrapper.get_instance_ids() to get the SSM instance(s) id. You'll use this information to connect to the instance later on.

In a nutshell:

import logging
from sagemaker.pytorch import PyTorch
from sagemaker_ssh_helper.wrapper import SSHEstimatorWrapper  # <--NEW--

role = ...

estimator = PyTorch(
    entry_point='train.py',
    source_dir='source_dir/training/',
    dependencies=[SSHEstimatorWrapper.dependency_dir()],  # <--NEW--
    role=role,
    framework_version='1.9.1',
    py_version='py38',
    instance_count=1,
    instance_type='ml.m5.xlarge'
)

ssh_wrapper = SSHEstimatorWrapper.create(estimator, connection_wait_time_seconds=600)  # <--NEW--

estimator.fit(wait=False)

ssh_wrapper.print_ssh_info()  # <--NEW-- 

The connection_wait_time_seconds is the amount of time the SSH Helper will wait inside SageMaker before it continues the job execution. It's useful for training jobs, when you want to connect before training starts. If you don't want the job to wait and start training as soon as the job starts, set it to 0.

As an example, here's the full working code from a unit test: test_end_to_end.py#L31-L56 . The method start_ssm_connection_and_continue(port_number) will connect to the instance through API, terminate the waiting loop and start training (useful for automation).

If you configured distributed training (i.e., instance_count is more than one), SSH Helper will start by default only on the first two nodes (i.e., on algo-1 and algo-2). If you want to connect with SSH to other nodes, you can log in to either of these nodes, e.g., algo-2, and then SSH from this node to any other node of the training cluster, e.g., algo-4, without running SSH Helper on these nodes, e.g., inside the pre-build SageMaker framework containers like PyTorch training container just run ssh algo-4 from the shell.

Alternatively, for distributed training, pass the additional parameter ssh_instance_count with the desired instance count to SSHEstimatorWrapper.create(), e.g., SSHEstimatorWrapper.create(..., ssh_instance_count=3)

Note: if you a/ don't use script mode, b/ use basic Estimator class and c/ all code is already stored in your Docker container, check the code sample in the corresponding section of the FAQ.

Don't run the modified code yet, see the next step.

Step 3: Modify your training script

import sagemaker_ssh_helper
sagemaker_ssh_helper.setup_and_start_ssh()

The setup_and_start_ssh() will start an SSM Agent that will connect the training instance to AWS Systems Manager.

See the train.py from the corresponding unit test, as a full working code sample.

Step 4: Connect over SSM

Once you launched the job, you'll need to wait, a few minutes, for the SageMaker container to start and the SSM Agent to start successfully. Then you'll need to have the ID of the managed instance. The instance id is prefixed by mi- and will appear in the job's CloudWatch log like this:

Successfully registered the instance with AWS SSM using Managed instance-id: mi-1234567890abcdef0

To fetch the instance IDs in an automated way without looking into the logs, you can call the Python method ssh_wrapper.get_instance_ids() or ssh_wrapper.print_ssh_info(), as mentioned in the step 1:

estimator = ...
ssh_wrapper = ...
estimator.fit(wait=False)
instance_ids = ssh_wrapper.get_instance_ids(timeout_in_sec=900)

The method get_instance_ids() accepts the optional parameter timeout_in_sec (default is 900, i.e., 15 minutes). If timeout is not 0, it will retry attempts to get instance IDs every 10 seconds.

With the instance ID at hand, you will be able to connect to the training container using the command line or the AWS web console.

Method A. Connecting using command line:

  1. On the local machine, make sure that you installed the latest AWS CLI v2 and the AWS Session Manager CLI plugin. Run the following command to perform the installation:
sm-local-configure
  1. Run this command (replace the target value with the instance id for your SageMaker job). Example:
aws ssm start-session --target mi-1234567890abcdef0

Note: Recently SageMaker has introduced the native way to connect to training jobs with SSM.

Method B. Connecting using the AWS Web Console:

  1. In AWS Web Console, navigate to Systems Manager > Fleet Manager.
  2. Select the node, then Node actions > Start terminal session.

Once connected to the container, you might want to switch to the root user with sudo su - command.

Method C. Connecting with SSH and port forwarding:

This method uses sm-ssh connect command and described in more details in the section sm-ssh.

Method D. Connecting from SageMaker Studio

See the corresponding step Connecting from SageMaker Studio in IAM / SSM configuration section. Follow the same steps as in the Method A for the local machine, but run them inside SageMaker Studio. If you run the commands from SageMaker Studio image terminal, make sure that your Python environment is activated, e.g., with conda activate base.

Tip: Useful CLI commands

Here are some useful commands to run in a terminal session:

  • ps xfaww - Show running tree of processes
  • ps xfawwe - Show running tree of processes with environment variables
  • ls -l /opt/ml/input/data - Show input channels
  • ls -l /opt/ml/code - Show your training code
  • pip freeze | less - Show all Python packages installed
  • dpkg -l | less - Show all system packages installed

Tip: Generating a thread dump for stuck training jobs

In case your training job is stuck, it can be useful to observe what where its threads are waiting/busy. This can be done without connecting to a python debugger beforehand.

  1. Having connected to the container as root, find the process id (pid) of the training process (assuming it's named train.py): pgrep --newest -f train.py
  2. Install GNU debugger:
    apt-get -y install gdb python3.9-dbg
  3. Start the GNU debugger with python support:
    gdb python
    source /usr/share/gdb/auto-load/usr/bin/python3.9-dbg-gdb.py
  4. Connect to the process (replace 361 with your pid):
    attach 361
  5. Show C low-level thread dump:
    info threads
  6. Show Python high-level thread dump:
    py-bt
  7. It might also be useful to observe what system calls the process is making: apt-get install strace
  8. Trace the process (replace 361 with your pid):
    sudo strace -p 361

Tip: Pipeline automation

If you're looking for the full automation of the pipeline with SSM and SSH, and not only with get_instance_ids() method, take a look at the automation question in the FAQ.

Connecting to SageMaker inference endpoints with SSM

Note: Recently SageMaker has introduced the native way to connect to endpoints with SSM, but it requires allow-listing of an AWS account (as of writing).

Adding SageMaker SSH Helper to inference endpoint is similar to training with the following differences.

  1. Wrap your model into SSHModelWrapper before calling deploy() and add SSH Helper to dependencies:
from sagemaker import Predictor
from sagemaker_ssh_helper.wrapper import SSHModelWrapper  # <--NEW--

estimator = ...
...
endpoint_name = ... 

model = estimator.create_model(
    entry_point='inference_ssh.py',
    source_dir='source_dir/inference/',
    dependencies=[SSHModelWrapper.dependency_dir()]  # <--NEW 
    # (alternatively, add sagemaker_ssh_helper into requirements.txt 
    # inside source dir) --
)

ssh_wrapper = SSHModelWrapper.create(model, connection_wait_time_seconds=0)  # <--NEW--

predictor: Predictor = model.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.xlarge',
    endpoint_name=endpoint_name,
    wait=True
)

predicted_value = predictor.predict(data=...)

Note: For the inference endpoint, which is always up and running, there's not too much value in setting connection_wait_time_seconds, so it's usually set to 0.

Similar to training jobs, you can fetch the instance ids for connecting to the endpoint with SSM with ssh_wrapper.get_instance_ids() or ssh_wrapper.print_ssh_info().

  1. Add the following lines at the top of your inference_ssh.py script:
import os
import sys
sys.path.append(os.path.join(os.path.dirname(__file__), "lib"))

import sagemaker_ssh_helper
sagemaker_ssh_helper.setup_and_start_ssh()

Note: adding lib dir to Python path is required, because SageMaker inference is putting dependencies into the code/lib directory, while SageMaker training put libs directly to code.

Multi-model endpoints

For multi-model endpoints, the setup procedure is slightly different from regular endpoints:

from sagemaker.multidatamodel import MultiDataModel
from sagemaker_ssh_helper.wrapper import SSHModelWrapper, SSHMultiModelWrapper  # <--NEW--

model_data_prefix = "s3://DOC-EXAMPLE-BUCKET/mms/"
model_name = ...
endpoint_name = ...
estimator = ...
...

model = estimator.create_model(entry_point='inference_ssh.py',
                               source_dir='source_dir/inference/',
                               dependencies=[SSHModelWrapper.dependency_dir()])  # <--NEW--

mdm = MultiDataModel(
    name=model.name,
    model_data_prefix=model_data_prefix,
    model=model
)

ssh_wrapper = SSHMultiModelWrapper.create(mdm, connection_wait_time_seconds=0)  # <--NEW--

predictor = mdm.deploy(initial_instance_count=1,
                       instance_type='ml.m5.xlarge',
                       endpoint_name=endpoint_name)


mdm.add_model(model_data_source=model.repacked_model_data, model_data_path=model_name)

predicted_value = predictor.predict(data=..., target_model=model_name)

Important: Make sure that you're passing to add_model() the model ready for deployment with dependencies located at model.repacked_model_data, not the estimator.model_data that points to the trained model artifact. To obtain model suitable for inference, you might want to deploy first your model to a temporary single-node endpoint, so that SageMaker Python SDK takes care of repacking the model, or call the prepare_container_def() method, like in the MMS test code.

Also note that SageMaker SSH Helper will be lazy loaded together with your model upon the first prediction request. So you should try to connect to the multi-model endpoint only after calling predict().

The inference.py script is the same as for regular endpoints.

Note: If you are using PyTorch containers, make sure you select the latest versions, e.g. 1.12, 1.11, 1.10 (1.10.2), 1.9 (1.9.1). This code might not work if you use PyTorch 1.8, 1.7 or 1.6.

Note: If you're packing your models manually and don't pass the model object to the MultiDataModel constructor, i.e., pass only the image_uri, see corresponding sample code in the FAQ.md.

Connecting to SageMaker batch transform jobs

For batch transform jobs, you need to use both SSHModelWrapper and SSHTransformerWrapper, as in the following example:

from sagemaker_ssh_helper.wrapper import SSHModelWrapper, SSHTransformerWrapper  # <--NEW--

sagemaker_session = ...
bucket = ...
estimator = ...
...

model = estimator.create_model(entry_point='inference_ssh.py',
                               source_dir='source_dir/inference/',
                               dependencies=[SSHModelWrapper.dependency_dir()])  # <--NEW--

transformer_input = sagemaker_session.upload_data(path='data/batch_transform/input',
                                                  bucket=bucket,
                                                  key_prefix='batch-transform/input')

transformer_output = f"s3://{bucket}/batch-transform/output"

ssh_model_wrapper = SSHModelWrapper.create(model, connection_wait_time_seconds=600)  # <--NEW--

transformer = model.transformer(instance_count=1,
                                instance_type="ml.m5.xlarge",
                                accept='text/csv',
                                strategy='SingleRecord',
                                assemble_with='Line',
                                output_path=transformer_output)

ssh_transformer_wrapper = SSHTransformerWrapper.create(transformer, ssh_model_wrapper)  # <--NEW--

transformer.transform(data=transformer_input,
                      content_type='text/csv',
                      split_type='Line',
                      join_source="Input",
                      wait=False)

The inference.py script is the same as for regular endpoints.

Connecting to SageMaker processing jobs

SageMaker SSH Helper supports both Script Processors and Framework processors and setup procedure is similar to training jobs and inference endpoints.

A. Framework processors

The code to set up a framework processor (e.g. PyTorch) is the following:

from sagemaker.pytorch import PyTorchProcessor
from sagemaker_ssh_helper.wrapper import SSHProcessorWrapper  # <--NEW--

role = ...

torch_processor = PyTorchProcessor(
    base_job_name='ssh-pytorch-processing',
    framework_version='1.9.1',
    py_version='py38',
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge"
)

ssh_wrapper = SSHProcessorWrapper.create(torch_processor, connection_wait_time_seconds=600)  # <--NEW--

torch_processor.run(
    source_dir="source_dir/processing/",
    dependencies=[SSHProcessorWrapper.dependency_dir()],  # <--NEW--
    code="process_framework.py"
)

Also add the following lines at the top of process_framework.py:

import sagemaker_ssh_helper
sagemaker_ssh_helper.setup_and_start_ssh()

B. Script Processors

The code to set up a script processor (e.g. PySpark) is the following:

from sagemaker.spark import PySparkProcessor
from sagemaker_ssh_helper.wrapper import SSHProcessorWrapper  # <--NEW--

role = ...

spark_processor = PySparkProcessor(
    base_job_name='ssh-spark-processing',
    framework_version="3.0",
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge"
)

ssh_wrapper = SSHProcessorWrapper.create(spark_processor, connection_wait_time_seconds=600)  # <--NEW--

spark_processor.run(
    submit_app="source_dir/processing/process.py",
    inputs=[ssh_wrapper.augmented_input()]  # <--NEW--
)

Also add the following lines at the top of process.py:

import sys
sys.path.append("/opt/ml/processing/input/")

import sagemaker_ssh_helper
sagemaker_ssh_helper.setup_and_start_ssh()

Forwarding TCP ports over SSH tunnel

Previous sections focused on connecting to non-interactive SageMaker containers with SSM.

Next sections rely on the Session Manager capability to create an SSH tunnel over SSM connection. SageMaker SSH Helper in turn runs SSH session over SSH tunnel and forwards the ports, including the SSH server port 22 itself.

The helper script behind this logic is sm-local-start-ssh:

sm-local-start-ssh "$INSTANCE_ID" \
  -R localhost:12345:localhost:12345 \
  -L localhost:8787:localhost:8787 \
  -L localhost:11022:localhost:22

You can pass -L parameters for forwarding remote container port to local machine (e.g., 8787 for Dask dashboard or 8501 for Streamlit apps) or -R for forwarding local port to remote container. Read more about these options in the SSH manual.

This low-level script takes the managed instance ID as a parameter. Next sections use the high-level command sm-ssh that take the SageMaker resource name as a parameter and resolves it into the instance ID automatically.

sm-ssh

The syntax for the SSH Helper CLI command sm-ssh is the following:

sm-ssh [-h] [-v] {list,start-proxy,connect} [fqdn]

where fqdn is the resource name with .sagemaker suffix, respectively:

  • for model training, .training.sagemaker
  • for inference endpoints, i.e., real-time inference, .inference.sagemaker
  • for transform jobs, i.e., batch inference, .transform.sagemaker
  • for processing jobs, i.e. transforms without a trained model, .processing.sagemaker
  • for SageMaker Studio Classic, .studio.sagemaker - see Local IDE integration with SageMaker Studio over SSH for more details on FQDN format
  • for SageMaker Notebook instances, .notebook.sagemaker

For list command, the resource name and the dot . in the suffix can be omitted, i.e.:

sm-ssh list studio.sagemaker

– will list all running jupyter servers and kernel gateways and their SSH status, and

sm-ssh list 

or

sm-ssh list sagemaker 

– will list all resources of all types.

The instances with SSH Helper will be marked Online while other instances will be marked with -.

The connect command starts interactive SSH session into container, e.g.:

sm-ssh connect ssh-training-example-2023-07-25-03-18-04-490.training.sagemaker

~/.ssh/config

Alternatively, instead of using sm-ssh connect command, you can use the native ssh command, but it will require you to update your ssh config, typically ~/.ssh/config, with sm-ssh start-proxy command as follows:

Host *.*.sagemaker
  IdentityFile ~/.ssh/%h
  PasswordAuthentication no
  ConnectTimeout 90
  ServerAliveInterval 15
  ServerAliveCountMax 4
  ProxyCommand sm-ssh start-proxy %h
  User root

You can copy the same fragment from the ssh_config_template.txt file.

The sm-ssh start-proxy command will set up the non-interactive SSH session that will serve as a proxy tunnel for SSH command.

As a benefit, you will be able to add additional SSH options like forwarding SSH agent connection with -A option, to securely pass your local SSH keys to remote machine, or forward ports with -R and -L options, akin to passing these options to sm-local-start-ssh command.

An example with SSH Agent and forwarding the web server port 8080:

ssh-add
ssh -A -L localhost:8080:localhost:8080 \
  ssh-training-example-2023-07-25-03-18-04-490.training.sagemaker

As a drawback, you won't get a comprehensive logging since the output of sm-ssh will be suspended by ssh. In case you have connection issues with native ssh, try the sm-ssh command instead and check the output.

Follow the steps in the next section for the IDE configuration, to prepare the sm-ssh for the use on the local machine.

Remote code execution with PyCharm / VSCode over SSH

  1. On the local machine, make sure that you installed the latest AWS CLI v2 and the AWS Session Manager CLI plugin. To do so, perform the automated installation with the sm-local-configure script:
sm-local-configure

Caution: If you plan to use sm-ssh tool from the IDE, which you run inside your system Python env, you should install SSH Helper into your system Python env, too.

  1. Submit your code to SageMaker with SSH Helper as described in previous sections, e.g. as a training job.

Make sure you allow enough time for manually setting up the connection (do not set connection_wait_time_seconds to 0, recommended minimum value is 600, i.e. 10 minutes). Don't worry to set it to higher values, e.g. to 30 min, because you will be able to terminate the waiting loop once you connected.

Instead of using SSM to connect to the container from command line, proceed to the next step for configuring the IDE.

  1. Configure the remote interpreter in your IDE

Make sure you've configured your ssh config as mentioned in the ~/.ssh/config section and your IDE can access sm-ssh command from the system env.

A. Follow the instructions in the PyCharm docs, to configure the remote interpreter in PyCharm.

In the field for host name, put the same value as for fqdn in the sm-ssh command, e.g., ssh-training-example-2023-07-25-03-18-04-490.training.sagemaker, and use root as the username.

Tip: When you configure Python interpreter in PyCharm, it's recommended to configure the deployment path mapping for you project to point into /root/project_name instead of default /tmp/pycharm_project_123. This is how you will be able to see your project in SageMaker Studio and PyCharm will automatically sync your local dir to the remote dir.

Tip: Also instead of creating a new venv, point the Python interpreter to the existing location. You can find this location by running a cell with import sys; sys.executable command in a SageMaker Studio notebook. You will get something like /opt/conda/bin/python.

Tip: Now you also can upload and download files from remote and synchronize files with remote.

B. Follow the instructions for VSCode, to configure local Visual Studio Code app

Put the root@fqdn as the hostname to connect to, e.g., root@ssh-training-example-2023-07-25-03-18-04-490.training.sagemaker .

NOTE: The Remote SSH extension described in the above instructions is only for the Visual Studio Code native app. Code Editor in SageMaker Studio and web apps based on Code Server that use extensions from Open VSX Registry might look and work differently. SageMaker SSH Helper DOES NOT support browser-based implementations and haven't been tested with any of Open VSX extensions. If you prefer to use the browser for development, take a look at the Web VNC option.

  1. Connect to the instance and stop the waiting loop

When you set connection_wait_time_seconds to non-zero value, SSH Helper will run a waiting loop inside your training script, until waiting time is passed, or you manually terminate the loop.

To manually terminate the loop, run the sm-wait stop command from the container (under root):

ssh root@ssh-training-example-2023-07-25-03-18-04-490.training.sagemaker \
  sm-wait stop

Note, that if you stop the waiting loop, SageMaker will run your training script only once, and you will be able to execute additional code from local machine from PyCharm only while your script is running. Once the script finishes, you will need to submit another training job and repeat the procedure again.

Here's a useful trick: submit a dummy script train_placeholder.py with the infinite loop, and while this loop will be running, you can rerun your real training script again and again with the remote interpreter inside the same job.

The workflow in this case is roughly the following:

a. You submit a first job with your training script train.py, and it fails for some reason that you want to troubleshoot.

b. You submit a second job with the placeholder script train_placeholder.py. You run your training script inside this job and change it few times until you find the cause of the problem and fix it. Setting max_run parameter of the estimator is highly recommended for the placeholder job, to avoid unnecessary charges.

c. You submit a third job with your fixed training script train.py to make sure it works now.

The dummy script may look like this:

import time
from datetime import timedelta

from sagemaker_ssh_helper import setup_and_start_ssh, is_last_session_timeout

setup_and_start_ssh()

while not is_last_session_timeout(timedelta(minutes=30)):
    time.sleep(10)

The method is_last_session_timeout() will help to prevent unused resources and the job will end if there's no SSM or SSH sessions for the specified period of time. It will count active SSM sessions, and time out when there are no sessions left.

Caution: Keep in mind that SSM sessions will terminate automatically due to user inactivity, but SSH sessions will keep running until either a user terminates them manually or network timeout occurs, i.e., the user closes the laptop lid, disconnects from Wi-Fi, etc. If the user leaves the local machine unattended and connected to Internet, SSM sessions started by aws ssm start-session command will time out, but SSH-over-SSM sessions started with sm-ssh connect will stay open. Consider sending e-mail notifications for users of the long-running jobs, so the users don't forget to shut down unused resources. See the related question in FAQ for more details and train_placeholder.py that implements the similar logic.

Pro Tip: Make sure that you're aware of SageMaker Managed Warm Pools feature, which is also helpful in the scenario when you need to rerun your code remotely multiple times.

Pro Tip: You can debug your code line by line in this scenario. See the tutorial in PyCharm documentation.

  1. Run and debug your code

Now when you have your training script or a placeholder script running, you can run additional code on the remote host, debug it line by line and set breakpoints.

If you want to change the control flow and let your training script call back the IDE for debugging, follow the next section on configuring the Debug Server.

Pro Tip: The curious reader should also read the AWS blog post Run your TensorFlow job on Amazon SageMaker with a PyCharm IDE. In contrast to the scenario with SageMaker SSH Helper, the blog instructions show how to use SageMaker local mode. As with Managed Warm Pools, SageMaker local mode helps to test your code faster, but it consumes local resources and doesn't provide the line by line debugging capability (as of writing).

Remote debugging with PyCharm Debug Server over SSH

There's another way to debug your code that is specific to the PyCharm Professional feature: Remote debugging with the Python remote debug server configuration. The procedure assumes that you're running a training job, but the same steps apply to inference or data processing, too.

  1. In PyCharm, go to the Run/Debug Configurations (Run -> Edit Configurations...), add a new Python Debug Server. Choose the fixed port, e. g. 12345.

  2. Take the correct version of pydevd-pycharm package from the configuration window and install it either through requirements.txt or by calling pip from your source code.

  3. Add commands to connect to the Debug Server to your code after the setup_and_start_ssh(), e.g., into a training script that you submit as an entry point for a training job:

import sagemaker_ssh_helper
sagemaker_ssh_helper.setup_and_start_ssh()

...

import pydevd_pycharm
pydevd_pycharm.settrace('localhost', port=12345, stdoutToServer=True, stderrToServer=True, suspend=True)

Tip: Check the settrace() argument's description in the library source code.

  1. Set breakpoints in your code with PyCharm, as needed

  2. Start the Debug Server in PyCharm

  3. Submit your code to SageMaker with SSH Helper as described in previous sections.

  4. On your local machine, once the SSH helper connects to SSM and starts waiting inside the training job, connect to the host with SSH and start the port forwarding for the Debug Server:

ssh -R localhost:12345:localhost:12345 \
  root@ssh-training-example-2023-07-25-03-18-04-490.training.sagemaker

It will reverse-forward the remote debugger port 12345 to your local machine's Debug Server port.

  1. Stop the waiting loop

As already mentioned, make sure you've configured connection_wait_time_seconds to give yourself time to start the port forwarding before execution of the training script continues, and before it tries to connect to the debug server at port 12345.

Inside the SSH session, run:

sm-wait stop
  1. After you stop the waiting loop, your code will continue running and will connect to your PyCharm Debug Server.

If everything is set up correctly, PyCharm will stop at your breakpoint, highlight the line and wait for your input. Debug Server window will say “connected”. You can now press, for example, F8 to "Step Over" the code line or F7 to "Step Into" the code line.

Local IDE integration with SageMaker Studio over SSH for PyCharm / VSCode

Download Demo (.mov) Download Demo (.mov)

Note: This demo is recorded with a previous version of SSH Helper and may be not up-to-date with the recent features. Check the documentation for the most up-to-date steps.

For your local IDE integration with SageMaker Studio, follow the same steps as for configuring the IDE for Remote code execution, but instead of submitting the training / processing / inference code to SageMaker with Python SDK, execute the Jupyter notebook, as described in the next steps.

  1. Copy SageMaker_SSH_IDE.ipynb into SageMaker Studio and run it.

Alternatively, attach to a domain the KernelGateway lifecycle config script kernel-lc-config.sh (you may need to ask your administrator to do this). Once configured, from the Launcher choose the environment, pick up the lifecycle script and choose 'Open image terminal' (so, you don't even need to create a notebook).

Note that the main branch of this repo can contain changes that are not compatible with the version of sagemaker-ssh-helper that you installed from pip. To ensure the stable performance, check the version with pip freeze | grep sagemaker-ssh-helper and take the notebook and the lifecycle script from the corresponding tag.

  1. Configure remote interpreter in PyCharm / VS Code to connect to SageMaker Studio

Use app_name.user_profile_name.domain_id.studio.sagemaker or app_name.studio.sagemaker as the fqdn to connect.

To see available apps to connect to, you may run the list command:

sm-ssh list studio.sagemaker
  1. Using the remote Jupyter Notebook

To make the remote Jupyter Server port 8889 forwarded to the local machine, use SSH:

ssh -L localhost:8889:localhost:8889 \
  [email protected]

Now you can also connect to a remote Jupyter Server started by SSH Helper inside SageMaker Studio as http://127.0.0.1:8889/?token=<<your_token>>.

You will find the full URL with remote token in the SageMaker_SSH_IDE.ipynb notebook in the output after running the cell with sm-ssh-ide start command. If you use lifecycle configuration, run tail /tmp/jupyter-notebook.log from the image terminal to find the Jupyter Server URL.

  1. Connecting to VNC

To make the remote VNC port 5901 forwarded to the local machine, use SSH:

ssh -L localhost:5901:localhost:5901 \
  -R localhost:443:jetbrains-license-server.example.com:443 \
  [email protected]

Note (PyCharm): The optional -R option will connect the remote port 443 to your local PyCharm license server address. Replace jetbrains-license-server.example.com with your server name and edit your /etc/hosts inside VNC to make this host point to 127.0.0.1 (should be done automatically if you didn't skip the sm-ssh-ide set-jb-license-server in the notebook).

Now you can start the VNC session to vnc://localhost:5901 (e.g. on macOS with Screen Sharing app) and run IDE or any other GUI app on the remote desktop instead of your local machine.

For example, you can run inside VNC the jupyter qtconsole command to start the Jupyter QT app as the alternative to Jupyter web UI:

Jupyter QT in VNC

  1. If you want to switch to another kernel or instance, feel free to do so from SageMaker Studio UI and re-run SageMaker_SSH_IDE.ipynb.

Keep in mind that in this case the previous kernel will stop and SSM Agent will stop, too. To allow multiple kernel and instances to be up and running with SageMaker SSH Helper and SSM Agent, duplicate the notebook and give it a different name, e.g. SageMaker_SSH_IDE-PyTorch.ipynb. In this case you'll be able to keep two environments in parallel.

If you're using lifecycle configuration script, just start another image terminal with different environment settings from Launcher.

  1. Don't forget to shut down SageMaker Studio resources, if you don't need them anymore, e.g., launched notebooks, terminals, apps and instances.

Web VNC

At times, you cannot install all the software on your local machine, also because this is the software processes data, and you cannot copy massive amount of the data to your local machine.

You might have thought about AWS Jupyter Proxy, but some web apps like Dask may not fully work through the proxy, so VNC is the recommended alternative.

By combining the noVNC tool with AWS Jupyter Proxy extension you can run virtually any IDE like PyCharm, VSCode, PyDev, or any tool like Blender (to work with 3D data), OpenShot (to work with audio-video data), etc., as well as other webapps from a SageMaker Studio web UI, without installing all of them to your local machine.

It's also helpful in situations when you cannot run SSH client on your local machine to forward ports for web tools, like Dask dashboard. In this case, you run a tool in the remote browser running through the web VNC (browser-in-a-browser), like on the below screenshot. You might notice that PyCharm and VSCode are also running in the background: WebWNC Screenshot

To achieve this result, your Administrator should configure your SageMaker IAM role with both SSHSageMakerServerPolicy and SSHSageMakerClientPolicy. Configuration of IAM credentials for the local machine is not required in this case. See the Step 4 in IAM_SSM_Setup.md for more details.

Then follow these steps:

  1. On the SageMaker Studio System terminal run the commands from server-lc-config.sh.

Alternatively, ask the Administrator to attach the lifecycle config to the SageMaker Studio domain or to your profile as the default JupyterServer config, e.g., with the name sagemaker-ssh-helper-webvnc.

  1. Follow the step 1 for the IDE configuration procedure, i.e., run the IDE notebook or lifecycle config inside the kernel gateway of your choice.

Instead of your local user ID put the SageMaker Studio user ID (you can get it by running aws sts get-caller-identity from a SageMaker Studio terminal).

  1. On the System (!) terminal (not image terminal), run:
sm-ssh connect app_name.user_profile_name.domain_id.studio.sagemaker

Alternatively, use SSH command to forward the VNC port and add more ports to the command, e.g., -L localhost:8787:localhost:8787 to forward the Dask dashboard that is running inside the kernel gateway:

ssh -L localhost:5901:localhost:5901 \
  -L localhost:8787:localhost:8787 \
  app_name.user_profile_name.domain_id.studio.sagemaker
  1. Navigate to https://d-egm0dexample.studio.eu-west-1.sagemaker.aws/jupyter/default/proxy/6080/vnc.html?host=d-egm0dexample.studio.eu-west-1.sagemaker.aws&port=443&path=jupyter/default/proxy/6080/websockify

Replace both occurrences of d-egm0dexample with your SageMaker Studio domain ID, and eu-west-1 with your AWS Region.

You will see the noVNC welcome screen.

  1. Press "Connect" and enter your password (default is 123456).

Congratulations! You now have successfully logged into the remote desktop environment running inside a SageMaker Studio kernel gateway.

Some data handling application to try inside the VNC desktop, which you cannot run as a web app otherwise, are:

  • 3D Slicer - image computing platform for medical, biomedical, and other 3D images and meshes
  • OpenShot - to work with video data
  • LibreOffice - work with documents and spreadsheets

Tip: If you have issues with copy-pasting through system clipboard, use the temp file, e.g. clip.txt, and open it in VNC session and SageMaker Studio file browser at the same time.

Pro Tip: To set the resolution that matches your browser window size, make a page screenshot (in Firefox - right-click on an empty area -> Take Screenshot -> Save visible), then inspect the resolution of the image, e.g. 1920x970. Then add and switch resolution inside the VNC session:

$ cvt 1920 970 60
# 1920x970 59.93 Hz (CVT) hsync: 60.35 kHz; pclk: 154.50 MHz
Modeline "1920x970_60.00"  154.50  1920 2040 2240 2560  970 973 983 1007 -hsync +vsync
$ xrandr --newmode "1920x970_60.00"  154.50  1920 2040 2240 2560  970 973 983 1007 -hsync +vsync
$ xrandr --addmode VNC-0 1920x970_60.00
$ xrandr -s 1920x970_60.00

Troubleshooting

If something doesn't work as expected, make sure you looked at our FAQ, especially at the troubleshooting section, as well as at the existing both open and resolved issues.

sagemaker-ssh-helper's People

Contributors

amazon-auto avatar b-gran avatar gilinachum avatar ivan-khvostishkov avatar kirill-fedyanin avatar olivigne avatar tejaswi-chillakuru avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sagemaker-ssh-helper's Issues

VSCode disconnects after credentials refresh.

Thank you for the great library! It works fine when I SSH in directly, no issues there. However, if I connect with VSCode, it'll work fine for the most part - until I see a log:

[sagemaker-ssh-helper][sm-setup-ssh][start-ssh] 2024-04-21 07:28:33 INFO [CredentialRefresher] Next credential rotation will be in 29.997197441433332 minutes

Then, the machine will fail within a minute or two of this message, and give the error InternalServerError: We encountered an internal error. Please try again. in Sagemaker.

Weirdly enough, this is regardless of the instance type, amount of memory, etc. Also - it doesn't always happen the first time that message is sent, so I'm not sure if it's exactly that issue, or something else. Regardless, the machine fails with an internal server error only when I connect with VS Code after some amount of time connected.

[Question] How to connect to sagemaker notebooks

You've mentioned on reddit it's possible to connect to notebook instance in sagemaker with the library. (https://www.reddit.com/r/aws/comments/gibbtg/guide_sagemaker_ssh_to_notebook_instances/)
I couldn't see how to do it in the readme; I've tried to just do sagemaker_ssh_helper.setup_and_start_ssh()in script, but it asks for one of the wrappers and I don't see one that would match.

WARNING: SageMaker SSH Helper is not correctly initialized. Did you forget to call wrapper.create() _before_ fit() / run() / transform() / deploy()?

Could you please point the direction on how to approach this?

Are scripts supposed to work on SageMaker notebook instances?

Hey there, thanks a lot for creating these examples. I've noticed that some scripts have parts that mention sagemaker notebook instances, but they fail both on server and client side because they expect DOMAIN_ID, USER_ID etc.

Are these supposed to work on SageMaker notebook instances?

Enable advanced-instances tier?

Hi - I have high hopes for the sagemaker-ssh-helper, for which thanks!

After setup, upon running

ssh -i ~/.ssh/sagemaker-ssh-gw -p 10022 root@localhost

at the end I get:

An error occurred (BadRequest) when calling the StartSession operation: Enable advanced-instances tier to use Session Manager with your on-premises instances
kex_exchange_identification: Connection closed by remote host
Connection closed by UNKNOWN port 65535

The error aside, does this mean the library will only work with the "advanced-instances tier"?? How does one know if one has this? I am trying to SSH into a plain old sagemaker instance... Help!

SSH port forwarding to KernelGateway app container

Hello,

Is it possible to use one of the sagemaker-ssh-helper scripts to enable port forwarding between SageMaker Studio's JupyterServer and a running KernelGateway app container? I am developing a Streamlit application which runs on localhost on a given port, e.g. 8053. If I start the application from the System Terminal I can access the application UI in the browser under domain.studio.region.sagemaker.aws/jupyter/default/proxy/8053. However the app is computationally intensive, so I would like to run it from a container Image Terminal instead, in order to take advantage of the container's resources, while still being able to access the application UI in the browser as before. I tried to forward the port on which the application is running inside the container to the Jupyter Server port using sm-local-ssh-ide, but go the following error: SSMManager:No instance IDs found
image

Perhaps this is not the intended use of the script? Your help would be greatly appreciated as I am new to SageMaker.

[Feature] An option to start only ssh server inside SageMaker Studio

Now SageMaker SSH Helper starts VNC server and Jupyter server along with sshd. The new option will allow to start only the minimal necessary service sshd. The user will need to comment the second line in the IDE notebook cell and uncomment the third one:

sm-ssh-ide stop
sm-ssh-ide start
#sm-ssh-ide start --ssh-only

Error occurred when starting amazon-ssm-agent: failed to get identity: failed to find agent identity

I'm trying to implement local IDE access to Sagemaker Studio by following the instructions found here

Specifically, I've gone for implementing the Lifecycle config script and not the iPython notebook.

It seems to all go well until the very last step in which the amazon-ssm-agent is invoked. It appears to try and call out to IMDS, but AWS themselves say that IMDS access is blocked in Sagemaker Studio.

What should I do in this case? Error logs attached below:

2023-09-12 11:57:23 INFO Checking if agent identity type OnPrem can be assumed
2023-09-12 11:57:23 INFO Checking if agent identity type EC2 can be assumed
2023-09-12 11:57:23 ERROR [EC2Identity] Failed to get instance info from IMDS. Err: failed to get identity instance id. Error: EC2MetadataError: failed to get IMDSv2 token and fallback to IMDSv1 is disabled
caused by: : 
	status code: 0, request id: 
caused by: RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": dial tcp 169.254.169.254:80: connect: invalid argument
2023-09-12 11:57:23 INFO Checking if agent identity type CustomIdentity can be assumed
2023-09-12 11:57:23 ERROR Agent failed to assume any identity
2023-09-12 11:57:23 ERROR failed to get identity: failed to find agent identity
2023-09-12 11:57:23 ERROR Error occurred when starting amazon-ssm-agent: failed to get identity: failed to find agent identity

sagemaker-ssh-helper version 2.1.0 used.

does ssh helper support byoc sagemaker endpoint?

hello team
we have a byoc sagemaker endpoint which want to debug, but I found the ssh helper only has byos mode sagemaker endpoint to setup:

model = estimator.create_model(
entry_point='inference_ssh.py',
source_dir='source_dir/inference/',
dependencies=[SSHModelWrapper.dependency_dir()] # <--NEW
# (alternatively, add sagemaker_ssh_helper into requirements.txt
# inside source dir) --
)

is there anyway we can debug BYOC sm endpoint? do you have any guide for that?

Thanks

[Feature] Copying file to an instance from local

Sometimes I'd like to copy a file from my local host to a running SageMaker instance after the instance is already started. Is it possible to do this with SSM, or through some other solution?

Currently, I use an S3 bucket as a proxy (local host -> s3 bucket -> SageMaker instance). This works, but requires a few steps. I would prefer to directly transfer the file.

PyCharm debugging question

Hi, thanks for your wonderful work in this project. I am trying to do this part Remote debugging with PyCharm Debug Server over SSH and I have a question:

When trying to debug a training session, the command sm-local-ssh-training failed even after using root. To clarify, I am running this inside the ssh connection.

$ ./sm-local-ssh-training connect pytorch-mnist-2022-11-15-09-46-35-587
sh: 11: ./sm-local-ssh-training: Permission denied
$ sudo ./sm-local-ssh-training connect pytorch-mnist-2022-11-15-09-46-35-587
./sm-local-ssh-training: line 17: python: command not found

If i run the command on local terminal (using Mac) this error happens

$ sm-local-ssh-training connect pytorch-mnist-2022-11-15-13-19-58-576
/opt/homebrew/bin/sm-local-ssh-training: line 17: python: command not found
INSTANCE_ID not provided

Am I missing something for debug setup?

[Issue] When use the MMS host model. e.g. HuggingFace Model, no any info in cloudwatch log and can not use ssh

model_data = 's3://kraft-source-bucket/huggingface_model/model.tar.gz'

from sagemaker.huggingface import HuggingFaceModel
from sagemaker_ssh_helper.wrapper import SSHModelWrapper
import sagemaker

create Hugging Face Model Class

huggingface_model = HuggingFaceModel(
transformers_version='4.17.0',
pytorch_version='1.10.2',
py_version='py38',
dependencies=[SSHModelWrapper.dependency_dir()],
model_data=model_data,
role=role
)
ssh_wrapper = SSHModelWrapper.create(huggingface_model, connection_wait_time_seconds=0)
huggingface_model.deploy(initial_instance_count=1,instance_type="ml.g4dn.xlarge",wait=False)

model.tar.gz

  • pytorch.xx.bin
  • code/
    • inference.py

cat inference.py
import argparse
import io
import json
import logging
import os
import sys
import subprocess
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data
import torch.utils.data.distributed
from PIL import Image
from torchvision import datasets, transforms
from torchvision.transforms import ToTensor
from model import Net
logger = logging.getLogger(name)
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.info(os.system("nvidia-smi"))
sys.path.append(os.path.join(os.path.dirname(file), "lib"))

import sagemaker_ssh_helper
sagemaker_ssh_helper.setup_and_start_ssh()

def model_fn(model_dir):
print(model_dir)
logger.info(model_dir)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.nn.DataParallel(Net())
with open(os.path.join(model_dir, "model.pth"), "rb") as f:
model.load_state_dict(torch.load(f))
return model.to(device)

def load_from_bytearray(request_body):
image_as_bytes = io.BytesIO(request_body)
image = Image.open(image_as_bytes)
image_tensor = ToTensor()(image).unsqueeze(0)
return image_tensor

def input_fn(request_body, request_content_type):
# if set content_type as 'image/jpg' or 'applicaiton/x-npy',
# the input is also a python bytearray
if request_content_type == "application/x-image":
image_tensor = load_from_bytearray(request_body)
else:
print("not support this type yet")
raise ValueError("not support this type yet")
return image_tensor

Perform prediction on the deserialized object, with the loaded model

def predict_fn(input_object, model):
output = model.forward(input_object)
pred = output.max(1, keepdim=True)[1]

return {"predictions": pred.item()}

Serialize the prediction result into the desired response content type

def output_fn(predictions, response_content_type):
return json.dumps(predictions)

I run this code:
instance_ids = ssh_wrapper.get_instance_ids() # <--NEW--
print(f'To connect over SSM run: aws ssm start-session --target {instance_ids[0]}')

no any output and in cloudwatch log has no any related info about sagemaker-ssh-helper

[Question] Relax bucket and role

Hi, I am using customized SageMaker IAM role and bucket, would it be possible to relax both of these values?

For example:
python3 -m sagemaker_ssh_helper.deregister_old_instances_from_ssm --iam-role "ml-experiment-*.*"

[Feature] Alternative MultiDataModel instantiation with image_uri instead of model

Hi, thanks for developing this repo. Is it currently possible, or would it be possible to add capability, to connect to an inference endpoint when MultiDataModel is instantiated using the alternative method of providing an image_uri instead of a model? For example:

endpoint_name = "openmmlab-mms-" + datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

multi_model_s3uri = f"s3://{bucket}/openmmlab-mms/"

mme = MultiDataModel(
    name=endpoint_name,
    image_uri=image_uri,
    model_data_prefix=multi_model_s3uri,
    sagemaker_session=session,
    role=role,
)

predictor = Predictor(endpoint_name, sagemaker_session=session)
predictor.serializer = sagemaker.serializers.IdentitySerializer()
predictor.deserializer = sagemaker.deserializers.JSONDeserializer()

mme.add_model(model_data_source=estimator.model_data, model_data_path='modelA')

Additional context

The reason I am using MME in this way instead of providing an initial model, is because I'm using the MMDetection library's mmdet2torchserve.py tool. This results in my training job producing a model.tar.gz with the following four files:

MAR-INF/MANIFEST.json
mmdet_handler.py
config.py
bbox-mAP_epoch_12.pth

Where mmdet_handler.py is defined here and does not implement SageMaker's standard input_fn, predict_fn and output_fn. Ideally I would like to be able to use this export as-is without needing to download, extract and refactor the handler script in order to provide an entry_point (and inference does indeed appear to work correctly when I do this).

[bug] - `SageMaker_SSH_IDE.ipynb` does not work

Hi Team and @ivan-khvostishkov ,

I followed the instructions of Local IDE integration with SageMaker Studio over SSH for PyCharm / VSCode.

How to reproduce

I did the following steps:

  1. Created new SageMaker domain
  2. Updated the execution roles according the Setting up your AWS account with IAM and SSM configuration guide
  3. Launched a Studio personal application
  4. Created a new Space
  5. Launched JupyterLab environment
  6. Downloaded the latest version of this repo, uploaded to the notebook and unzipped
  7. Opened SageMaker_SSH_IDE.ipynb and selected the Python 3 (ipykernel) as kernel
  8. Started the execution of the cells in the notebook
  9. The following cell execution failed
%%sh
sm-ssh-ide configure
-> mkdir: cannot create directory ‘/opt/sagemaker-ssh-helper/’: Permission denied

Steps tried to solve the issue

I tried to following steps to solve the issue:

%%sh
whoami

-> sagemaker-user
%%sh
groups sagemaker-user

-> sagemaker-user : users
%%sh
ls -s /

total 0
lrwxrwxrwx   1 root   root      7 Oct  4 02:08 bin -> usr/bin
drwxr-xr-x   2 root   root      6 Apr 18  2022 boot
drwxr-xr-x   5 root   root    340 Jan 22 07:34 dev
drwxrwxr-x   1 root   root     66 Jan 22 07:34 etc
drwxrwxrwx   1 root   root     28 Nov  9 14:12 home
lrwxrwxrwx   1 root   root      7 Oct  4 02:08 lib -> usr/lib
lrwxrwxrwx   1 root   root      9 Oct  4 02:08 lib32 -> usr/lib32
lrwxrwxrwx   1 root   root      9 Oct  4 02:08 lib64 -> usr/lib64
lrwxrwxrwx   1 root   root     10 Oct  4 02:08 libx32 -> usr/libx32
drwxr-xr-x   2 root   root      6 Oct  4 02:08 media
drwxr-xr-x   2 root   root      6 Oct  4 02:08 mnt
drwxr-xr-x   1 root   root     55 Jan 22 07:34 opt
dr-xr-xr-x 133 nobody nogroup   0 Jan 22 07:34 proc
drwx------   1 root   root     20 Nov  9 14:16 root
drwxr-xr-x   1 root   root     25 Nov  9 14:17 run
lrwxrwxrwx   1 root   root      8 Oct  4 02:08 sbin -> usr/sbin
drwxr-xr-x   2 root   root      6 Oct  4 02:08 srv
dr-xr-xr-x  13 nobody nogroup   0 Jan 22 07:34 sys
drwxrwxrwt   1 root   root     32 Jan 22 07:43 tmp
drwxrwxr-x   1 root   root     19 Nov  9 00:31 usr
drwxr-xr-x   1 root   root     17 Oct  4 02:12 var

It looks like the sagemaker-user does not have permission to write to /opt.

Let's add sagemaker-user to root group and restart the sessions and the instance as well to be sure.

I still get the the same issue.

Possible solutions

  1. The sm-ssh-ide scripts utilises alternative directory and not /opt
  2. The permission of the sagemaker-user is fixed or ACL is added to the /opt directory.

Thanks,
Andor

Notebook `SageMaker_SSH_Notebook.ipynb` fails due to docker-compose

The notebook SageMaker_SSH_Notebook.ipynb throws an error related to docker compose:


INFO:sagemaker.local.image:docker command: docker-compose -f /tmp/tmpxkrcbq9c/docker-compose.yaml up --build --abort-on-container-exit

time="2023-11-14T20:11:59Z" level=warning msg="a network with name sagemaker-local exists but was not created by compose.\nSet `external: true` to use an existing network"
network sagemaker-local was found but has incorrect label com.docker.compose.network set to ""

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/sagemaker/local/image.py:296, in _SageMakerContainer.train(self, input_data_config, output_data_config, hyperparameters, environment, job_name)
    295 try:
--> 296     _stream_output(process)
    297 except RuntimeError as e:
    298     # _stream_output() doesn't have the command line. We will handle the exception
    299     # which contains the exit code and append the command line to it.

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/sagemaker/local/image.py:984, in _stream_output(process)
    983 if exit_code != 0:
--> 984     raise RuntimeError("Process exited with code: %s" % exit_code)
    986 return exit_code

RuntimeError: Process exited with code: 1

Not sure what is causing the error. I was able to run the same notebook content just three months ago. Any hint or suggestion will be greatly appreciated.

does ssh helper support sagemaker's remote debug's ssm connection?

ssh helper can't get ssm instance id for sagemaker remote debug job

using sagamaker remote debug , we can use ssm client to connect to training job container via :
aws ssm start-session --target sagemaker-training-job:${job_name}_algo-1

but when use ssh helper to do the ssh turnel , it can't find ssm instance id :
@6c7e67c16c37 ~ % sm-local-ssh-training connect ${job_name}
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:sagemaker-ssh-helper:Resolving training instance IDs through SSM tags
INFO:sagemaker-ssh-helper:Remote training logs are at https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FTrainingJobs$3FlogStreamNameFilter$3Dsd-finetuning-test-2024-03-15-08-32-30-966$252F
INFO:sagemaker-ssh-helper:Estimator metadata is at https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/jobs/sd-finetuning-test-2024-03-15-08-32-30-966
INFO:sagemaker-ssh-helper:SSMManager:Querying SSM instance IDs for training job sd-finetuning-test-2024-03-15-08-32-30-966, expected instance count = 0
INFO:sagemaker-ssh-helper:SSMManager:Using AWS Region: us-west-2
INFO:sagemaker-ssh-helper:SSMManager:No instance IDs found. Retrying. Is SSM Agent running on the remote? Check the remote logs. Seconds left before time out: 300

How to enable cloudwatch logs for SSM

As the SSM tunnel allows for downloading from the sagemaker instances we want to be able to log the activity.

What is required to set up logging on the instances?

Error on `dpkg` when running `sm-local-configure`

I'm following Local IDE integration with SageMaker Studio over SSH for PyCharm / VSCode

  1. Copy SageMaker_SSH_IDE.ipynb into SageMaker Studio and run it ✔
  2. On the local machine, install the library: pip install sagemaker-ssh-helper
  3. Make sure that you installed the latest AWS CLI v2 and the AWS Session Manager CLI plugin. Run the following command to perform the installation: sm-local-configure
    Note: I installed AWS CLI v2 using curl and unzip
    Note: I installed AWS Session Manager using curl and dpkg -i

The error:

$ sm-local-configure
Linux DK023900WSL 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Ubuntu 20.04.6 LTS \n \l

NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
Python 3.8.10
sagemaker-ssh-helper: Installing AWS CLI v2
/usr/local/bin/aws
WARNING: Skipping awscli as it is not installed.
Found same AWS CLI version: /usr/local/aws-cli/v2/2.11.19. Skipping install.
AWS default region -
AWS region -
      Name                    Value             Type    Location
      ----                    -----             ----    --------
   profile                <not set>             None    None
access_key     ****************W535 shared-credentials-file
secret_key     ****************WBOW shared-credentials-file
    region                eu-west-1      config-file    ~/.aws/config
dpkg: error: requested operation requires superuser privilege

Local Environment

Running locally on WSL

WSL version: 1.1.3.0
Kernel version: 5.15.90.1
WSLg version: 1.0.49
MSRDC version: 1.2.3770
Direct3D version: 1.608.2-61064218
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.19044.2846

Please advise!

[Feature Request] AWS SageMaker China region ssh helper support

When I use sagemaker-ssh-helper in China region, I get the following error:

ValueError Traceback (most recent call last)
/tmp/ipykernel_9247/960454657.py in <cell line: 1>()
----> 1 ssh_wrapper = SSHModelWrapper.create(ssh_model, connection_wait_time_seconds=0)
2
3 ssh_predictor = ssh_model.deploy(
4 initial_instance_count=1,
5 instance_type='ml.g4dn.xlarge',

~/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/sagemaker_ssh_helper/wrapper.py in create(cls, model, connection_wait_time_seconds)
194 @classmethod
195 def create(cls, model: sagemaker.model.Model, connection_wait_time_seconds: int = 600):
--> 196 result = SSHModelWrapper(model, connection_wait_time_seconds=connection_wait_time_seconds)
197 result._augment()
198 return result

~/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/sagemaker_ssh_helper/wrapper.py in init(self, model, ssm_iam_role, bootstrap_on_start, connection_wait_time_seconds)
174 bootstrap_on_start, connection_wait_time_seconds, model.sagemaker_session)
175 if self.ssm_iam_role == '':
--> 176 self.ssm_iam_role = SSHEnvironmentWrapper.ssm_role_from_iam_arn(model.role)
177 self.model = model
178

~/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/sagemaker_ssh_helper/wrapper.py in ssm_role_from_iam_arn(cls, iam_arn)
77 def ssm_role_from_iam_arn(cls, iam_arn: str):
78 if not iam_arn.startswith('arn:aws:iam::'):
---> 79 raise ValueError("iam_arn should start with 'arn:aws:iam::'")
80 role_position = iam_arn.find(":role/")
81 if role_position == -1:

ValueError: iam_arn should start with 'arn:aws:iam::'

Since resource ARNs in China often start with “arn:aws-cn:”, adjust your code to support China Regions

EC2 instance not needed for SSM setup?

The SSM setup guide (section 2) currently guides users to set up SSM Advanced Tier by first creating a minimal EC2 instance...

But from my tests, I don't think this is mandatory, at least in all cases?

I was able to enable Advanced Tier simply by opening Systems Manager > Fleet Manager in the Console and opening Account Management > Instance Tier settings. The below screenshot was taken after I'd already created an SSH Helper training job, but I could access the same screen beforehand too by clicking the orange "Get started" button if you navigate to Fleet Manager before any instances are set up.

image

I've been able to connect to the training job instance no problem, so pretty sure that at least in some cases users can just skip straight to step 2h? It would be nice to streamline the setup instructions if possible and maybe just give the EC2 option for troubleshooting problems?

The specific account I tested all the way through on is a management account in its AWS Organization (but the org only contains that one account).

Installation with `pip==23.1` gives PyYAML error

The newest version of pip produces unstable results, e.g., sagemaker-ssh-helper may fail to install in some SageMaker Studio kernels with the following error:

ERROR: Cannot uninstall ‘PyYAML’

The current workaround is to downgrade pip to the previous version:

pip install pip=23.0.1

How to install VSCode, other apps in WebVNC view?

Hi,

I've been following the instructions for getting WebVNC going and I've been successful in logging into the WebVNC environment. In the README.md file, you show that VSCode and PyCharm are running in this "browser-in-a-browser" environment.

My question is: how do you install these things in the WebVNC environment?

Furthermore, if I have a Dagster webserver serving a Dagit UI running on the KernelGateway app, if I were to configure the port forwarding properly, is it possible to see this UI in the WebVNC environment as well?

Thanks!

sm-local-configure only works with bash like installations - no Powershell/CMD support / Windows support at all

Hi there.
When trying to install your solution, I can't execute the configuration via sm-local-configure. As the script is written in bash, my Powershell will always try to open the file via Notepad instead of executing it.

Suggestions:

  • Please provide the script in other shell languages as well
  • Alternatively describe - in depth - what can be done instead of using this command

As of now, Windows users are excluded from using the plugin this way.


Update: Even if I install mingw64 and execute the command, it fails:

  1. Python3 isn't found (on windows installation, Python is now just called python)
  2. The execution fails at sudo as this is no command for mingw64 installation

Please update documentation and highlight, that Windows is unsupported by the solution as of now.

`sm-local-ssh-ide` stopped working and ssh asks for root password

Due to recent changes in SageMaker Studio related to file permissions, copying SSH keys over SSM into '/root/.ssh/authorized_keys' is not producing a desired effect. The following message appears when you run sm-local-ssh-ide from the local machine:

Connecting to mi-1234567890abcdef0 as proxy and starting port forwarding with the args: -L localhost:10022:localhost:22 -L localhost:5901:localhost:5901 -L localhost:8889:localhost:8889 -R 127.0.0.1:443:jetbrains-license-server.corp.amazon.com:443
Warning: Permanently added 'mi-1234567890abcdef0' (ED25519) to the list of known hosts.
root@mi-1234567890abcdef0's password: 

SageMaker SSH Helper will change the location of keys to '/etc/ssh/authorized_keys' and will release this change ASAP in the version v1.10.1.

Cannot connect to instance

Hi team,

Followed the steps in the notebook and everything worked well without errors except in the final local command sm-local-ssh-ide <<kernel_gateway_app_name>>. Here I get the following error:

ssh -o User=root -o IdentityFile="${SSH_KEY}" -o IdentitiesOnly=yes \
  -o ProxyCommand="aws ssm start-session --region '${CURRENT_REGION}' --target '${INSTANCE_ID}' --document-name AWS-StartSSHSession --parameters portNumber=%p" \
  -o ServerAliveInterval=15 -o ServerAliveCountMax=3 \
  -o StrictHostKeyChecking=no -N $PORT_FWD_ARGS "$INSTANCE_ID"

An error occurred (TargetNotConnected) when calling the StartSession operation: mi-01064afae0734b12b is not connected.
kex_exchange_identification: Connection closed by remote host
Connection closed by UNKNOWN port 65535

Do you know what could be going wrong?

Thanks,
João Pereira

JupyterServer URL suffix when tunnelling into KernelGateway app

Hi, I am using sagemaker-ssh-helper to create an SSH connection in between the JupyterServer app and the KernelGateway app.

In my example, I am running code-server (https://coder.com/) in the KernelGateway environment as our data scientists want to control the instance size and type of machine it runs on.

Running the coder server command in the KernelGateway env starts an HTTP listener at 127.0.0.1:3000. I then use the sagemaker-ssh-helper command sm-local-ssh-ide connect with an additional argument -L localhost:3000:localhost:3000 in order to also forward the connection on port 3000 from the KernelGateway env.

When I navigate to https://domain.studio.eu-west-2.sagemaker.aws/jupyter/default/proxy/3000, it initially loads the page (I see "Coder" in the browser tab) but then fails to load any UI elements. Looking at the developer console, this is happening because the UI resources are being fetched from https://domain.studio.eu-west-2.sagemaker.aws/some_UI_element.js instead of https://domain.studio.eu-west-2.sagemaker.aws/jupyter/default/proxy/3000/some_UI_element.js

If I manually navigate to https://domain.studio.eu-west-2.sagemaker.aws/jupyter/default/proxy/3000/some_UI_element.js, I can confirm that I am able to access the Javascript code.

Is there any configuration that I'm missing when running sm-local-ssh-ide connect to pass the entire URL to the downstream server, including the URL suffix that is added?

[Issue] `sm-local-configure` breaks on MacOS

I have tried following the directions to setup my Local IDE integration with SageMaker Studio over SSH for PyCharm / VSCode and have run into an issue on my MacOS device.

When I run sm-local-configure I get the following output message:

> sm-local-configure

Darwin <MY COMPUTER> 22.5.0 Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:19 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T6020 arm64
cat: /etc/issue: No such file or directory
cat: /etc/os-release: No such file or directory
Python 3.10.12
Password:
Sorry, try again.
Password:
sudo: apt-get: command not found

I believe the issue is this _install_unzip() function:

function _install_unzip() {
if _is_centos; then
sudo yum install -y unzip
else
sudo apt-get install -y --no-install-recommends unzip
fi
}

which is called by sm-local-configure

I think this has partially been handled in the _install_aws_cli function with a seperate function for MacOS, If it would be helpful I can submit a PR to add a check to those methods to see if unzip and curl are already installed — which for MacOS I think they are by default.

The function works as expected (I think) if when i commented out those 2 lines and installed the package locally.

System:

  • Operating System: MacOS Ventura 13.4
  • Processor: Apple M2 Pro

Issue] STS client is not using regional endpoints

Our use case is to start the SageMaker training job from the SageMaker Studio notebook and the studio is attached to a private vpc.
When I try to create SSHEstimatorWrapper using below from my studio notebook
ssh_wrapper = SSHEstimatorWrapper.create(pytorch_estimator, connection_wait_time_seconds=0)
I'm getting ConnectTimeoutError: Connect timeout on endpoint URL: "https://sts.amazonaws.com/" exception.
This is because since we have regional vpc endpoint and we can access only regional endpoints like https://sts.us-east-1.amazonaws.com/ . From here I see that this calls only global endpoint

We would need to pass region parameter during the initalization of sts boto3 client so that it uses sts regional endpoints based on teh region

Enable advanced-instances tier to use Session Manager with your on-premises instances

Hi when I configure my local connection to Sagemaker Studio with:

sm-local-ssh-ide connect <my kernel gateway>

The process fails in the last step with the following error:

An error occurred (BadRequest) when calling the StartSession operation: Enable advanced-instances tier to use Session Manager with your on-premises instances
Connection closed by UNKNOWN port 65535

What is happening here?

vscode connect fails

Trying to connect using vscode has failed since about a week ago,
I'm managing to connect with SSH fine, but vscode itself fails.

Looks like a permissions error?

Getting a tar error which not sure how to fix:

[11:38:35.267] stderr> tar: code: Cannot change ownership to uid 1000, gid 1000: Operation not permitted
[11:38:35.267] stderr> tar: Exiting with failure status due to previous errors
[11:38:35.268] > ERROR: tar exited with non-0 exit code: 0

[Question] Shell environment different from web terminal

Hi,

I manage to connect to my notebook instance with sm-local-ssh-notebook connect <my_notebook>, as described in the instructions. I log in as root, inside what seems to be a docker container (checked with this command), with no users under /home. In contrast, when launching a web-based terminal from the notebook, the environment is very different: the default user is ec2-user, user files are under /home/ec2-user, and it seems I am not in a docker container.

What extra steps are needed to have the same type of environment as in the web-terminal, but using sagemaker-ssh-helper? Basically, all I want is to be able to do exactly the same I can do in the web-based Terminal from my sagemaker-ssh-helper session.

Maybe this has been addressed somewhere, but I didn't find any reference.

Thanks for your help.

[Feature] Support HF accelerate and DeepSpeed for inference

[Issue]failed to find agent identity

Trying to set this up to connect our local VSCode instances to our sagemaker studio instances for better developer experience.

When running:
%%sh
sm-ssh-ide ssm-agent

We receive the following error:

Error occurred fetching the seelog config file path:  open /etc/amazon/ssm/seelog.xml: no such file or directory
Initializing new seelog logger
New Seelog Logger Creation Complete
2023-06-16 10:55:53 INFO Proxy environment variables:
2023-06-16 10:55:53 INFO https_proxy: 
2023-06-16 10:55:53 INFO http_proxy: 
2023-06-16 10:55:53 INFO no_proxy: 
2023-06-16 10:55:53 INFO Checking if agent identity type OnPrem can be assumed
2023-06-16 10:55:53 INFO Checking if agent identity type EC2 can be assumed
2023-06-16 10:55:53 ERROR [EC2Identity] failed to get identity instance id. Error: EC2MetadataError: failed to get IMDSv2 token and fallback to IMDSv1 is disabled
caused by: : 
	status code: 0, request id: 
caused by: RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": dial tcp 169.254.169.254:80: connect: invalid argument
2023-06-16 10:55:53 INFO Checking if agent identity type CustomIdentity can be assumed
2023-06-16 10:55:53 ERROR Agent failed to assume any identity
2023-06-16 10:55:53 ERROR failed to find identity, retrying: failed to find agent identity
2023-06-16 10:55:53 INFO Checking if agent identity type OnPrem can be assumed
2023-06-16 10:55:53 INFO Checking if agent identity type EC2 can be assumed
2023-06-16 10:55:54 ERROR [EC2Identity] failed to get identity instance id. Error: EC2MetadataError: failed to get IMDSv2 token and fallback to IMDSv1 is disabled
caused by: : 
	status code: 0, request id: 
caused by: RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": dial tcp 169.254.169.254:80: connect: invalid argument
2023-06-16 10:55:54 INFO Checking if agent identity type CustomIdentity can be assumed
2023-06-16 10:55:54 ERROR Agent failed to assume any identity
2023-06-16 10:55:54 ERROR failed to find identity, retrying: failed to find agent identity
2023-06-16 10:55:54 INFO Checking if agent identity type OnPrem can be assumed
2023-06-16 10:55:54 INFO Checking if agent identity type EC2 can be assumed
2023-06-16 10:55:54 ERROR [EC2Identity] failed to get identity instance id. Error: EC2MetadataError: failed to get IMDSv2 token and fallback to IMDSv1 is disabled
caused by: : 
	status code: 0, request id: 
caused by: RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": dial tcp 169.254.169.254:80: connect: invalid argument
2023-06-16 10:55:54 INFO Checking if agent identity type CustomIdentity can be assumed
2023-06-16 10:55:54 ERROR Agent failed to assume any identity
2023-06-16 10:55:54 ERROR failed to get identity: failed to find agent identity
2023-06-16 10:55:54 ERROR Error occurred when starting amazon-ssm-agent: failed to get identity: failed to find agent identity

Thoughts on using a configuration management framework?

It's pretty hard to get this up and running in an account that has restricted internet access.

I had fork and refactor almost all of the bash scripts. This was quite a challenge as they are a little unwieldy (I mean it is bash after all). So, I had a thought based on how I handle setting up dev environments on linux boxes.

Moving the install/run functionality to a declarative configuration management system would make maintaining, extending and using the project easier.

What would your thoughts be on managing the installs and configurations via something like Ansible? I recommended ansible since it's lightweight and easy to work with. Its a python package. So only need python which we already have. But it could be any config system.

The user experience could remain the same, the bash scripts would be shims around the config manager. Likely, it could be simplified. Not so many steps to get up and running, you just run a command and it gets the system in the desired state, instead of having to nohup a bunch of bash scripts.

It'd be easier to:

  • Allow options like install urls for the dependencies
  • Not rely on the working directory to source bash files
  • Avoid multiple re-installs to make it easier to run in a lifecycle config
  • Extend it by modifying or including additional config

I'd be willing to contribute work towards this since maintaining a copy of the bash scripts is quite painful. Already in the process of exploring a playbook for starting the ssh helper.

[Feature] SageMaker job as Studio kernel

Lately I work mainly in SageMaker Studio, and I'd really like to be able to debug / interact with a running job using the same UI.

Solution idea

Create a custom Studio kernel image using an IPython extension and/or custom magic through which users can connect to a running SSH Helper job and run notebook cells on that instead of the Studio app.

The user experience would be something like using EMR clusters in Studio:

  • One-time up-front job to build/register the custom "SageMakerSSH" image (maybe?)
  • User launches their SSH-helper-enabled job from "normal" notebook A and fetches the managed instance ID mi-1234567890abcdef0
  • User opens / switches to a notebook with SageMakerSSH kernel and runs something like
    • %load_ext sagemaker_ssh_helper.notebook to initialize the IPython extension
    • %sagemaker_ssh connect mi-1234567890abcdef0 to connect to the instance
    • From here on out, cells should run on the connected instance rather than the local Studio app unless a %%local cell magic is used: Same as how SageMaker Studio SparkMagic kernel works
    • Probably some kind of %sagemaker_ssh disconnect command would also be useful

Since the sagemaker_ssh_helper library is pip-installable, it might even be possible to get this working with default (e.g. Data Science 3.0) kernels? I'm not sure - assume it depends how much hacking is possible during IPython extension load vs what needs setting up in advance.

Why this route

To my knowledge, JupyterLab is a bit more fragmented in support for remote kernels than IDEs like VSCode/PyCharm/etc. It seems like there are ways to set up SSH kernels, but it's also a tricky topic to navigate because so many pages online are talking about "accessing your remotely-running Jupyter server" instead. Investigating the Jupyter standard kernel spec paths, I see /opt/conda/envs/studio/share/jupyter/kernels exists but contains only a single python3 kernel which doesn't appear in Studio UI. It looks like there's a custom sagemaker_nb2kg Python library that manages kernels, but no obvious integration points there for alternative kernel sources besides the studio "Apps" system - and sufficiently internal/complex that patching it seems like a bad idea.

...So it looks like directly registering the remote instance as a kernel in JupyterLab would be a non-starter.

If the magic-based approach works, it might also be possible to use with other existing kernel images (as mentioned above) and even inline in the same notebook after a training job is kicked off. Hopefully it would also enable toggling over to a new job/instance without having to run CLI commands to change the installed Jupyter kernels.

Issue trying with a SageMaker Notebook: "sagemaker-ssh-helper:SSMManager:SSH Helper not yet started? Retrying."

Hello, this looks like a great project but I have been struggling for a day and half to get it working.
I am trying to do the Local IDE integration with SageMaker Studio over SSH for PyCharm / VSCode but with SageMaker Notebook instead of Studio, and from a Windows environment.

First I have to say that, the fact that the instructions are spread out in many places (in the main README, the account setup page, the FAQ, etc.) do not help with clarity. Maybe, separate step-by-step end-to-end instructions for each use case would help.

I first tried to do this in a corporate environment, where SageMaker notebook are in a private VPC with no direct internet breakout in the account but going through a corporate firewall and proxy and with no IAM user but temporary credentials. As I miserably failed, I reverted to first test it in my personal environment with the SageMaker notebook deployed in the default AWS managed environment but I am still failing.

In my local environment, I think I got all the requirements (installing sagemaker-ssh-helper, running the ./sm-local-install-force script from an admin bash), configuring the AWS environment (SSM and IAM policies as mentioned here, copying and running the SageMaker_SSH_Notebook.ipynb notebook.
I also looked at the video for the SageMaker studio case, which is quite different, but according to the instructions the only difference is just the notebook we are execution. So I think I got the instructions correct but as it is failing I guess I am missing something...

I can see in SSM fleet manager the mi-****** node ID of the SageMaker notebook instance but the notebook script does not display it (not sure if that is normal or not). The last logs I have on the jupyter notebook outputs are:

j7h3m4c9pf-algo-1-jq0oi | # Running forever as daemon
j7h3m4c9pf-algo-1-jq0oi | amazon-ssm-agent
j7h3m4c9pf-algo-1-jq0oi | Initializing new seelog logger
j7h3m4c9pf-algo-1-jq0oi | New Seelog Logger Creation Complete
j7h3m4c9pf-algo-1-jq0oi | Applying config override from /etc/amazon/ssm/amazon-ssm-agent.json.
j7h3m4c9pf-algo-1-jq0oi | 2023/04/21 12:20:31 Found config file at /etc/amazon/ssm/amazon-ssm-agent.json.
j7h3m4c9pf-algo-1-jq0oi | 2023/04/21 12:20:31 processing appconfig overrides
j7h3m4c9pf-algo-1-jq0oi | 2023/04/21 12:20:31 Found config file at /etc/amazon/ssm/amazon-ssm-agent.json.
j7h3m4c9pf-algo-1-jq0oi | 2023/04/21 12:20:31 processing appconfig overrides
j7h3m4c9pf-algo-1-jq0oi | Applying config override from /etc/amazon/ssm/amazon-ssm-agent.json.
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO Proxy environment variables:
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO http_proxy: 
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO no_proxy: 
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO https_proxy: 
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO Checking if agent identity type OnPrem can be assumed
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO Agent will take identity from OnPrem
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO [amazon-ssm-agent] using named pipe channel for IPC
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO [amazon-ssm-agent] using named pipe channel for IPC
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO [amazon-ssm-agent] using named pipe channel for IPC
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO [amazon-ssm-agent] amazon-ssm-agent - v3.2.815.0
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO [amazon-ssm-agent] OS: linux, Arch: amd64
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO [amazon-ssm-agent] Starting Core Agent
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO [CredentialRefresher] credentialRefresher has started
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO [CredentialRefresher] Starting credentials refresher loop
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO [CredentialRefresher] Next credential rotation will be in 29.997203369883334 minutes
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:32 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] [WorkerProvider] Worker ssm-agent-worker is not running, starting worker process
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:32 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] [WorkerProvider] Worker ssm-agent-worker (pid:611) started
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:32 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] Monitor long running worker health every 60 seconds
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:50:31 INFO [CredentialRefresher] Next credential rotation will be in 29.997979336833332 minutes

Then if I try to do a AWS_PROFILE=default sm-local-ssh-notebook connect <<notebook-instance-name>>, then I get the following:

INFO:sagemaker-ssh-helper:SSMManager:Querying SSM instance IDs for SageMaker notebook instance test
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:sagemaker-ssh-helper:SSMManager:SSH Helper not yet started? Retrying. Seconds left: 300
INFO:sagemaker-ssh-helper:SSMManager:SSH Helper not yet started? Retrying. Seconds left: 290
INFO:sagemaker-ssh-helper:SSMManager:SSH Helper not yet started? Retrying. Seconds left: 280
INFO:sagemaker-ssh-helper:SSMManager:SSH Helper not yet started? Retrying. Seconds left: 270

But if I am understanding the instructions correctly this is not the SSH Helper on the local machine, correct? This is the SSH Helper on the SageMaker notebook, right?
Did I missed something obvious?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.