aws-samples / sagemaker-ssh-helper Goto Github PK
View Code? Open in Web Editor NEWA helper library to connect into Amazon SageMaker with AWS Systems Manager and SSH (Secure Shell)
License: MIT No Attribution
A helper library to connect into Amazon SageMaker with AWS Systems Manager and SSH (Secure Shell)
License: MIT No Attribution
Hey there, thanks a lot for creating these examples. I've noticed that some scripts have parts that mention sagemaker notebook instances, but they fail both on server and client side because they expect DOMAIN_ID, USER_ID etc.
Are these supposed to work on SageMaker notebook instances?
Hi there.
When trying to install your solution, I can't execute the configuration via sm-local-configure
. As the script is written in bash, my Powershell will always try to open the file via Notepad instead of executing it.
Suggestions:
As of now, Windows users are excluded from using the plugin this way.
Update: Even if I install mingw64 and execute the command, it fails:
python
)sudo
as this is no command for mingw64 installationPlease update documentation and highlight, that Windows is unsupported by the solution as of now.
This error message appears when a user executes 'sm-local' scripts and uses SageMaker defaults to set values from config file.
The reason is the recent non-backward compatible change related to SageMaker defaults :
https://github.com/aws/sagemaker-python-sdk/pull/3872/files
Hello, this looks like a great project but I have been struggling for a day and half to get it working.
I am trying to do the Local IDE integration with SageMaker Studio over SSH for PyCharm / VSCode but with SageMaker Notebook instead of Studio, and from a Windows environment.
First I have to say that, the fact that the instructions are spread out in many places (in the main README, the account setup page, the FAQ, etc.) do not help with clarity. Maybe, separate step-by-step end-to-end instructions for each use case would help.
I first tried to do this in a corporate environment, where SageMaker notebook are in a private VPC with no direct internet breakout in the account but going through a corporate firewall and proxy and with no IAM user but temporary credentials. As I miserably failed, I reverted to first test it in my personal environment with the SageMaker notebook deployed in the default AWS managed environment but I am still failing.
In my local environment, I think I got all the requirements (installing sagemaker-ssh-helper
, running the ./sm-local-install-force
script from an admin bash), configuring the AWS environment (SSM and IAM policies as mentioned here, copying and running the SageMaker_SSH_Notebook.ipynb
notebook.
I also looked at the video for the SageMaker studio case, which is quite different, but according to the instructions the only difference is just the notebook we are execution. So I think I got the instructions correct but as it is failing I guess I am missing something...
I can see in SSM fleet manager the mi-******
node ID of the SageMaker notebook instance but the notebook script does not display it (not sure if that is normal or not). The last logs I have on the jupyter notebook outputs are:
j7h3m4c9pf-algo-1-jq0oi | # Running forever as daemon
j7h3m4c9pf-algo-1-jq0oi | amazon-ssm-agent
j7h3m4c9pf-algo-1-jq0oi | Initializing new seelog logger
j7h3m4c9pf-algo-1-jq0oi | New Seelog Logger Creation Complete
j7h3m4c9pf-algo-1-jq0oi | Applying config override from /etc/amazon/ssm/amazon-ssm-agent.json.
j7h3m4c9pf-algo-1-jq0oi | 2023/04/21 12:20:31 Found config file at /etc/amazon/ssm/amazon-ssm-agent.json.
j7h3m4c9pf-algo-1-jq0oi | 2023/04/21 12:20:31 processing appconfig overrides
j7h3m4c9pf-algo-1-jq0oi | 2023/04/21 12:20:31 Found config file at /etc/amazon/ssm/amazon-ssm-agent.json.
j7h3m4c9pf-algo-1-jq0oi | 2023/04/21 12:20:31 processing appconfig overrides
j7h3m4c9pf-algo-1-jq0oi | Applying config override from /etc/amazon/ssm/amazon-ssm-agent.json.
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO Proxy environment variables:
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO http_proxy:
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO no_proxy:
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO https_proxy:
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO Checking if agent identity type OnPrem can be assumed
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO Agent will take identity from OnPrem
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO [amazon-ssm-agent] using named pipe channel for IPC
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO [amazon-ssm-agent] using named pipe channel for IPC
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO [amazon-ssm-agent] using named pipe channel for IPC
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO [amazon-ssm-agent] amazon-ssm-agent - v3.2.815.0
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO [amazon-ssm-agent] OS: linux, Arch: amd64
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO [amazon-ssm-agent] Starting Core Agent
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO [CredentialRefresher] credentialRefresher has started
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO [CredentialRefresher] Starting credentials refresher loop
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:31 INFO [CredentialRefresher] Next credential rotation will be in 29.997203369883334 minutes
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:32 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] [WorkerProvider] Worker ssm-agent-worker is not running, starting worker process
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:32 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] [WorkerProvider] Worker ssm-agent-worker (pid:611) started
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:20:32 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] Monitor long running worker health every 60 seconds
j7h3m4c9pf-algo-1-jq0oi | 2023-04-21 12:50:31 INFO [CredentialRefresher] Next credential rotation will be in 29.997979336833332 minutes
Then if I try to do a AWS_PROFILE=default sm-local-ssh-notebook connect <<notebook-instance-name>>
, then I get the following:
INFO:sagemaker-ssh-helper:SSMManager:Querying SSM instance IDs for SageMaker notebook instance test
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:sagemaker-ssh-helper:SSMManager:SSH Helper not yet started? Retrying. Seconds left: 300
INFO:sagemaker-ssh-helper:SSMManager:SSH Helper not yet started? Retrying. Seconds left: 290
INFO:sagemaker-ssh-helper:SSMManager:SSH Helper not yet started? Retrying. Seconds left: 280
INFO:sagemaker-ssh-helper:SSMManager:SSH Helper not yet started? Retrying. Seconds left: 270
But if I am understanding the instructions correctly this is not the SSH Helper on the local machine, correct? This is the SSH Helper on the SageMaker notebook, right?
Did I missed something obvious?
It's pretty hard to get this up and running in an account that has restricted internet access.
I had fork and refactor almost all of the bash scripts. This was quite a challenge as they are a little unwieldy (I mean it is bash after all). So, I had a thought based on how I handle setting up dev environments on linux boxes.
Moving the install/run functionality to a declarative configuration management system would make maintaining, extending and using the project easier.
What would your thoughts be on managing the installs and configurations via something like Ansible? I recommended ansible since it's lightweight and easy to work with. Its a python package. So only need python which we already have. But it could be any config system.
The user experience could remain the same, the bash scripts would be shims around the config manager. Likely, it could be simplified. Not so many steps to get up and running, you just run a command and it gets the system in the desired state, instead of having to nohup a bunch of bash scripts.
It'd be easier to:
I'd be willing to contribute work towards this since maintaining a copy of the bash scripts is quite painful. Already in the process of exploring a playbook for starting the ssh helper.
Currently shared spaces are not supported:
https://docs.aws.amazon.com/sagemaker/latest/dg/domain-space.html
Hi when I configure my local connection to Sagemaker Studio with:
sm-local-ssh-ide connect <my kernel gateway>
The process fails in the last step with the following error:
An error occurred (BadRequest) when calling the StartSession operation: Enable advanced-instances tier to use Session Manager with your on-premises instances
Connection closed by UNKNOWN port 65535
What is happening here?
The notebook SageMaker_SSH_Notebook.ipynb
throws an error related to docker compose:
INFO:sagemaker.local.image:docker command: docker-compose -f /tmp/tmpxkrcbq9c/docker-compose.yaml up --build --abort-on-container-exit
time="2023-11-14T20:11:59Z" level=warning msg="a network with name sagemaker-local exists but was not created by compose.\nSet `external: true` to use an existing network"
network sagemaker-local was found but has incorrect label com.docker.compose.network set to ""
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/sagemaker/local/image.py:296, in _SageMakerContainer.train(self, input_data_config, output_data_config, hyperparameters, environment, job_name)
295 try:
--> 296 _stream_output(process)
297 except RuntimeError as e:
298 # _stream_output() doesn't have the command line. We will handle the exception
299 # which contains the exit code and append the command line to it.
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/sagemaker/local/image.py:984, in _stream_output(process)
983 if exit_code != 0:
--> 984 raise RuntimeError("Process exited with code: %s" % exit_code)
986 return exit_code
RuntimeError: Process exited with code: 1
Not sure what is causing the error. I was able to run the same notebook content just three months ago. Any hint or suggestion will be greatly appreciated.
Hi, I am using customized SageMaker IAM role and bucket, would it be possible to relax both of these values?
For example:
python3 -m sagemaker_ssh_helper.deregister_old_instances_from_ssm --iam-role "ml-experiment-*.*"
model_data = 's3://kraft-source-bucket/huggingface_model/model.tar.gz'
from sagemaker.huggingface import HuggingFaceModel
from sagemaker_ssh_helper.wrapper import SSHModelWrapper
import sagemaker
huggingface_model = HuggingFaceModel(
transformers_version='4.17.0',
pytorch_version='1.10.2',
py_version='py38',
dependencies=[SSHModelWrapper.dependency_dir()],
model_data=model_data,
role=role
)
ssh_wrapper = SSHModelWrapper.create(huggingface_model, connection_wait_time_seconds=0)
huggingface_model.deploy(initial_instance_count=1,instance_type="ml.g4dn.xlarge",wait=False)
model.tar.gz
cat inference.py
import argparse
import io
import json
import logging
import os
import sys
import subprocess
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data
import torch.utils.data.distributed
from PIL import Image
from torchvision import datasets, transforms
from torchvision.transforms import ToTensor
from model import Net
logger = logging.getLogger(name)
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.info(os.system("nvidia-smi"))
sys.path.append(os.path.join(os.path.dirname(file), "lib"))
import sagemaker_ssh_helper
sagemaker_ssh_helper.setup_and_start_ssh()
def model_fn(model_dir):
print(model_dir)
logger.info(model_dir)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.nn.DataParallel(Net())
with open(os.path.join(model_dir, "model.pth"), "rb") as f:
model.load_state_dict(torch.load(f))
return model.to(device)
def load_from_bytearray(request_body):
image_as_bytes = io.BytesIO(request_body)
image = Image.open(image_as_bytes)
image_tensor = ToTensor()(image).unsqueeze(0)
return image_tensor
def input_fn(request_body, request_content_type):
# if set content_type as 'image/jpg' or 'applicaiton/x-npy',
# the input is also a python bytearray
if request_content_type == "application/x-image":
image_tensor = load_from_bytearray(request_body)
else:
print("not support this type yet")
raise ValueError("not support this type yet")
return image_tensor
def predict_fn(input_object, model):
output = model.forward(input_object)
pred = output.max(1, keepdim=True)[1]
return {"predictions": pred.item()}
def output_fn(predictions, response_content_type):
return json.dumps(predictions)
I run this code:
instance_ids = ssh_wrapper.get_instance_ids() # <--NEW--
print(f'To connect over SSM run: aws ssm start-session --target {instance_ids[0]}')
no any output and in cloudwatch log has no any related info about sagemaker-ssh-helper
Hi team,
Followed the steps in the notebook and everything worked well without errors except in the final local command sm-local-ssh-ide <<kernel_gateway_app_name>>. Here I get the following error:
ssh -o User=root -o IdentityFile="${SSH_KEY}" -o IdentitiesOnly=yes \
-o ProxyCommand="aws ssm start-session --region '${CURRENT_REGION}' --target '${INSTANCE_ID}' --document-name AWS-StartSSHSession --parameters portNumber=%p" \
-o ServerAliveInterval=15 -o ServerAliveCountMax=3 \
-o StrictHostKeyChecking=no -N $PORT_FWD_ARGS "$INSTANCE_ID"
An error occurred (TargetNotConnected) when calling the StartSession operation: mi-01064afae0734b12b is not connected.
kex_exchange_identification: Connection closed by remote host
Connection closed by UNKNOWN port 65535
Do you know what could be going wrong?
Thanks,
João Pereira
Due to recent changes in SageMaker Studio related to file permissions, copying SSH keys over SSM into '/root/.ssh/authorized_keys' is not producing a desired effect. The following message appears when you run sm-local-ssh-ide from the local machine:
Connecting to mi-1234567890abcdef0 as proxy and starting port forwarding with the args: -L localhost:10022:localhost:22 -L localhost:5901:localhost:5901 -L localhost:8889:localhost:8889 -R 127.0.0.1:443:jetbrains-license-server.corp.amazon.com:443
Warning: Permanently added 'mi-1234567890abcdef0' (ED25519) to the list of known hosts.
root@mi-1234567890abcdef0's password:
SageMaker SSH Helper will change the location of keys to '/etc/ssh/authorized_keys' and will release this change ASAP in the version v1.10.1.
Trying to set this up to connect our local VSCode instances to our sagemaker studio instances for better developer experience.
When running:
%%sh
sm-ssh-ide ssm-agent
We receive the following error:
Error occurred fetching the seelog config file path: open /etc/amazon/ssm/seelog.xml: no such file or directory
Initializing new seelog logger
New Seelog Logger Creation Complete
2023-06-16 10:55:53 INFO Proxy environment variables:
2023-06-16 10:55:53 INFO https_proxy:
2023-06-16 10:55:53 INFO http_proxy:
2023-06-16 10:55:53 INFO no_proxy:
2023-06-16 10:55:53 INFO Checking if agent identity type OnPrem can be assumed
2023-06-16 10:55:53 INFO Checking if agent identity type EC2 can be assumed
2023-06-16 10:55:53 ERROR [EC2Identity] failed to get identity instance id. Error: EC2MetadataError: failed to get IMDSv2 token and fallback to IMDSv1 is disabled
caused by: :
status code: 0, request id:
caused by: RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": dial tcp 169.254.169.254:80: connect: invalid argument
2023-06-16 10:55:53 INFO Checking if agent identity type CustomIdentity can be assumed
2023-06-16 10:55:53 ERROR Agent failed to assume any identity
2023-06-16 10:55:53 ERROR failed to find identity, retrying: failed to find agent identity
2023-06-16 10:55:53 INFO Checking if agent identity type OnPrem can be assumed
2023-06-16 10:55:53 INFO Checking if agent identity type EC2 can be assumed
2023-06-16 10:55:54 ERROR [EC2Identity] failed to get identity instance id. Error: EC2MetadataError: failed to get IMDSv2 token and fallback to IMDSv1 is disabled
caused by: :
status code: 0, request id:
caused by: RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": dial tcp 169.254.169.254:80: connect: invalid argument
2023-06-16 10:55:54 INFO Checking if agent identity type CustomIdentity can be assumed
2023-06-16 10:55:54 ERROR Agent failed to assume any identity
2023-06-16 10:55:54 ERROR failed to find identity, retrying: failed to find agent identity
2023-06-16 10:55:54 INFO Checking if agent identity type OnPrem can be assumed
2023-06-16 10:55:54 INFO Checking if agent identity type EC2 can be assumed
2023-06-16 10:55:54 ERROR [EC2Identity] failed to get identity instance id. Error: EC2MetadataError: failed to get IMDSv2 token and fallback to IMDSv1 is disabled
caused by: :
status code: 0, request id:
caused by: RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": dial tcp 169.254.169.254:80: connect: invalid argument
2023-06-16 10:55:54 INFO Checking if agent identity type CustomIdentity can be assumed
2023-06-16 10:55:54 ERROR Agent failed to assume any identity
2023-06-16 10:55:54 ERROR failed to get identity: failed to find agent identity
2023-06-16 10:55:54 ERROR Error occurred when starting amazon-ssm-agent: failed to get identity: failed to find agent identity
Sometimes I'd like to copy a file from my local host to a running SageMaker instance after the instance is already started. Is it possible to do this with SSM, or through some other solution?
Currently, I use an S3 bucket as a proxy (local host -> s3 bucket -> SageMaker instance). This works, but requires a few steps. I would prefer to directly transfer the file.
SageMaker SSH Helper works only in SageMaker Studio Classic:
As of November 30, 2023, the previous Amazon SageMaker Studio experience is now named Amazon SageMaker Studio Classic.
It needs testing and development for the new SageMaker Studio experience
UPD: Preliminary idea is to use custom images based on SageMaker distribution.
Hi Team and @ivan-khvostishkov ,
I followed the instructions of Local IDE integration with SageMaker Studio over SSH for PyCharm / VSCode.
I did the following steps:
Studio
personal applicationPython 3 (ipykernel)
as kernel%%sh
sm-ssh-ide configure
-> mkdir: cannot create directory ‘/opt/sagemaker-ssh-helper/’: Permission denied
I tried to following steps to solve the issue:
%%sh
whoami
-> sagemaker-user
%%sh
groups sagemaker-user
-> sagemaker-user : users
%%sh
ls -s /
total 0
lrwxrwxrwx 1 root root 7 Oct 4 02:08 bin -> usr/bin
drwxr-xr-x 2 root root 6 Apr 18 2022 boot
drwxr-xr-x 5 root root 340 Jan 22 07:34 dev
drwxrwxr-x 1 root root 66 Jan 22 07:34 etc
drwxrwxrwx 1 root root 28 Nov 9 14:12 home
lrwxrwxrwx 1 root root 7 Oct 4 02:08 lib -> usr/lib
lrwxrwxrwx 1 root root 9 Oct 4 02:08 lib32 -> usr/lib32
lrwxrwxrwx 1 root root 9 Oct 4 02:08 lib64 -> usr/lib64
lrwxrwxrwx 1 root root 10 Oct 4 02:08 libx32 -> usr/libx32
drwxr-xr-x 2 root root 6 Oct 4 02:08 media
drwxr-xr-x 2 root root 6 Oct 4 02:08 mnt
drwxr-xr-x 1 root root 55 Jan 22 07:34 opt
dr-xr-xr-x 133 nobody nogroup 0 Jan 22 07:34 proc
drwx------ 1 root root 20 Nov 9 14:16 root
drwxr-xr-x 1 root root 25 Nov 9 14:17 run
lrwxrwxrwx 1 root root 8 Oct 4 02:08 sbin -> usr/sbin
drwxr-xr-x 2 root root 6 Oct 4 02:08 srv
dr-xr-xr-x 13 nobody nogroup 0 Jan 22 07:34 sys
drwxrwxrwt 1 root root 32 Jan 22 07:43 tmp
drwxrwxr-x 1 root root 19 Nov 9 00:31 usr
drwxr-xr-x 1 root root 17 Oct 4 02:12 var
It looks like the sagemaker-user
does not have permission to write to /opt
.
Let's add sagemaker-user
to root
group and restart the sessions and the instance as well to be sure.
I still get the the same issue.
sm-ssh-ide
scripts utilises alternative directory and not /opt
sagemaker-user
is fixed or ACL
is added to the /opt
directory.Thanks,
Andor
I'm following Local IDE integration with SageMaker Studio over SSH for PyCharm / VSCode
pip install sagemaker-ssh-helper
✔sm-local-configure
❌curl
and unzip
curl
and dpkg -i
The error:
$ sm-local-configure
Linux DK023900WSL 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Ubuntu 20.04.6 LTS \n \l
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
Python 3.8.10
sagemaker-ssh-helper: Installing AWS CLI v2
/usr/local/bin/aws
WARNING: Skipping awscli as it is not installed.
Found same AWS CLI version: /usr/local/aws-cli/v2/2.11.19. Skipping install.
AWS default region -
AWS region -
Name Value Type Location
---- ----- ---- --------
profile <not set> None None
access_key ****************W535 shared-credentials-file
secret_key ****************WBOW shared-credentials-file
region eu-west-1 config-file ~/.aws/config
dpkg: error: requested operation requires superuser privilege
Running locally on WSL
WSL version: 1.1.3.0
Kernel version: 5.15.90.1
WSLg version: 1.0.49
MSRDC version: 1.2.3770
Direct3D version: 1.608.2-61064218
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.19044.2846
Please advise!
Is there any plan to support it in BJS/ZHY?
Hi, I am using sagemaker-ssh-helper to create an SSH connection in between the JupyterServer app and the KernelGateway app.
In my example, I am running code-server (https://coder.com/) in the KernelGateway environment as our data scientists want to control the instance size and type of machine it runs on.
Running the coder server
command in the KernelGateway env starts an HTTP listener at 127.0.0.1:3000. I then use the sagemaker-ssh-helper command sm-local-ssh-ide connect
with an additional argument -L localhost:3000:localhost:3000
in order to also forward the connection on port 3000 from the KernelGateway env.
When I navigate to https://domain.studio.eu-west-2.sagemaker.aws/jupyter/default/proxy/3000
, it initially loads the page (I see "Coder" in the browser tab) but then fails to load any UI elements. Looking at the developer console, this is happening because the UI resources are being fetched from https://domain.studio.eu-west-2.sagemaker.aws/some_UI_element.js
instead of https://domain.studio.eu-west-2.sagemaker.aws/jupyter/default/proxy/3000/some_UI_element.js
If I manually navigate to https://domain.studio.eu-west-2.sagemaker.aws/jupyter/default/proxy/3000/some_UI_element.js
, I can confirm that I am able to access the Javascript code.
Is there any configuration that I'm missing when running sm-local-ssh-ide connect
to pass the entire URL to the downstream server, including the URL suffix that is added?
You've mentioned on reddit it's possible to connect to notebook instance in sagemaker with the library. (https://www.reddit.com/r/aws/comments/gibbtg/guide_sagemaker_ssh_to_notebook_instances/)
I couldn't see how to do it in the readme; I've tried to just do sagemaker_ssh_helper.setup_and_start_ssh()
in script, but it asks for one of the wrappers and I don't see one that would match.
WARNING: SageMaker SSH Helper is not correctly initialized. Did you forget to call wrapper.create() _before_ fit() / run() / transform() / deploy()?
Could you please point the direction on how to approach this?
hello team
we have a byoc sagemaker endpoint which want to debug, but I found the ssh helper only has byos mode sagemaker endpoint to setup:
model = estimator.create_model(
entry_point='inference_ssh.py',
source_dir='source_dir/inference/',
dependencies=[SSHModelWrapper.dependency_dir()] # <--NEW
# (alternatively, add sagemaker_ssh_helper into requirements.txt
# inside source dir) --
)
is there anyway we can debug BYOC sm endpoint? do you have any guide for that?
Thanks
ssh helper can't get ssm instance id for sagemaker remote debug job
using sagamaker remote debug , we can use ssm client to connect to training job container via :
aws ssm start-session --target sagemaker-training-job:${job_name}_algo-1
but when use ssh helper to do the ssh turnel , it can't find ssm instance id :
@6c7e67c16c37 ~ % sm-local-ssh-training connect ${job_name}
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
INFO:sagemaker-ssh-helper:Resolving training instance IDs through SSM tags
INFO:sagemaker-ssh-helper:Remote training logs are at https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FTrainingJobs$3FlogStreamNameFilter$3Dsd-finetuning-test-2024-03-15-08-32-30-966$252F
INFO:sagemaker-ssh-helper:Estimator metadata is at https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/jobs/sd-finetuning-test-2024-03-15-08-32-30-966
INFO:sagemaker-ssh-helper:SSMManager:Querying SSM instance IDs for training job sd-finetuning-test-2024-03-15-08-32-30-966, expected instance count = 0
INFO:sagemaker-ssh-helper:SSMManager:Using AWS Region: us-west-2
INFO:sagemaker-ssh-helper:SSMManager:No instance IDs found. Retrying. Is SSM Agent running on the remote? Check the remote logs. Seconds left before time out: 300
ValueError Traceback (most recent call last)
/tmp/ipykernel_9247/960454657.py in <cell line: 1>()
----> 1 ssh_wrapper = SSHModelWrapper.create(ssh_model, connection_wait_time_seconds=0)
2
3 ssh_predictor = ssh_model.deploy(
4 initial_instance_count=1,
5 instance_type='ml.g4dn.xlarge',
~/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/sagemaker_ssh_helper/wrapper.py in create(cls, model, connection_wait_time_seconds)
194 @classmethod
195 def create(cls, model: sagemaker.model.Model, connection_wait_time_seconds: int = 600):
--> 196 result = SSHModelWrapper(model, connection_wait_time_seconds=connection_wait_time_seconds)
197 result._augment()
198 return result
~/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/sagemaker_ssh_helper/wrapper.py in init(self, model, ssm_iam_role, bootstrap_on_start, connection_wait_time_seconds)
174 bootstrap_on_start, connection_wait_time_seconds, model.sagemaker_session)
175 if self.ssm_iam_role == '':
--> 176 self.ssm_iam_role = SSHEnvironmentWrapper.ssm_role_from_iam_arn(model.role)
177 self.model = model
178
~/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/sagemaker_ssh_helper/wrapper.py in ssm_role_from_iam_arn(cls, iam_arn)
77 def ssm_role_from_iam_arn(cls, iam_arn: str):
78 if not iam_arn.startswith('arn:aws:iam::'):
---> 79 raise ValueError("iam_arn should start with 'arn:aws:iam::'")
80 role_position = iam_arn.find(":role/")
81 if role_position == -1:
ValueError: iam_arn should start with 'arn:aws:iam::'
Since resource ARNs in China often start with “arn:aws-cn:”, adjust your code to support China Regions
Hello,
Is it possible to use one of the sagemaker-ssh-helper scripts to enable port forwarding between SageMaker Studio's JupyterServer and a running KernelGateway app container? I am developing a Streamlit application which runs on localhost on a given port, e.g. 8053. If I start the application from the System Terminal I can access the application UI in the browser under domain.studio.region.sagemaker.aws/jupyter/default/proxy/8053. However the app is computationally intensive, so I would like to run it from a container Image Terminal instead, in order to take advantage of the container's resources, while still being able to access the application UI in the browser as before. I tried to forward the port on which the application is running inside the container to the Jupyter Server port using sm-local-ssh-ide
, but go the following error: SSMManager:No instance IDs found
Perhaps this is not the intended use of the script? Your help would be greatly appreciated as I am new to SageMaker.
The newest version of pip produces unstable results, e.g., sagemaker-ssh-helper may fail to install in some SageMaker Studio kernels with the following error:
ERROR: Cannot uninstall ‘PyYAML’
The current workaround is to downgrade pip to the previous version:
pip install pip=23.0.1
Is there an ETA or roadmap ballpark to support:
def get_endpoint_instance_ids(self, endpoint_name, timeout_in_sec=0):
raise AssertionError("Not supported yet.")
It will be nice to have a way list all available instances to connect
Hi, thanks for your wonderful work in this project. I am trying to do this part Remote debugging with PyCharm Debug Server over SSH and I have a question:
When trying to debug a training session, the command sm-local-ssh-training
failed even after using root. To clarify, I am running this inside the ssh connection.
$ ./sm-local-ssh-training connect pytorch-mnist-2022-11-15-09-46-35-587
sh: 11: ./sm-local-ssh-training: Permission denied
$ sudo ./sm-local-ssh-training connect pytorch-mnist-2022-11-15-09-46-35-587
./sm-local-ssh-training: line 17: python: command not found
If i run the command on local terminal (using Mac) this error happens
$ sm-local-ssh-training connect pytorch-mnist-2022-11-15-13-19-58-576
/opt/homebrew/bin/sm-local-ssh-training: line 17: python: command not found
INSTANCE_ID not provided
Am I missing something for debug setup?
Now SageMaker SSH Helper starts VNC server and Jupyter server along with sshd. The new option will allow to start only the minimal necessary service sshd. The user will need to comment the second line in the IDE notebook cell and uncomment the third one:
sm-ssh-ide stop
sm-ssh-ide start
#sm-ssh-ide start --ssh-only
Lately I work mainly in SageMaker Studio, and I'd really like to be able to debug / interact with a running job using the same UI.
Create a custom Studio kernel image using an IPython extension and/or custom magic through which users can connect to a running SSH Helper job and run notebook cells on that instead of the Studio app.
The user experience would be something like using EMR clusters in Studio:
mi-1234567890abcdef0
%load_ext sagemaker_ssh_helper.notebook
to initialize the IPython extension%sagemaker_ssh connect mi-1234567890abcdef0
to connect to the instance%%local
cell magic is used: Same as how SageMaker Studio SparkMagic kernel works%sagemaker_ssh disconnect
command would also be usefulSince the sagemaker_ssh_helper
library is pip-installable, it might even be possible to get this working with default (e.g. Data Science 3.0
) kernels? I'm not sure - assume it depends how much hacking is possible during IPython extension load vs what needs setting up in advance.
To my knowledge, JupyterLab is a bit more fragmented in support for remote kernels than IDEs like VSCode/PyCharm/etc. It seems like there are ways to set up SSH kernels, but it's also a tricky topic to navigate because so many pages online are talking about "accessing your remotely-running Jupyter server" instead. Investigating the Jupyter standard kernel spec paths, I see /opt/conda/envs/studio/share/jupyter/kernels
exists but contains only a single python3
kernel which doesn't appear in Studio UI. It looks like there's a custom sagemaker_nb2kg
Python library that manages kernels, but no obvious integration points there for alternative kernel sources besides the studio "Apps" system - and sufficiently internal/complex that patching it seems like a bad idea.
...So it looks like directly registering the remote instance as a kernel in JupyterLab would be a non-starter.
If the magic-based approach works, it might also be possible to use with other existing kernel images (as mentioned above) and even inline in the same notebook after a training job is kicked off. Hopefully it would also enable toggling over to a new job/instance without having to run CLI commands to change the installed Jupyter kernels.
The error message appears in the logs when starting SSM agent.
It's safe to ignore it, because it doesn't affect the operation of SageMaker SSH Helper.
We should get rid of this error, to avoid confusion.
Hi, thanks for developing this repo. Is it currently possible, or would it be possible to add capability, to connect to an inference endpoint when MultiDataModel is instantiated using the alternative method of providing an image_uri
instead of a model
? For example:
endpoint_name = "openmmlab-mms-" + datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
multi_model_s3uri = f"s3://{bucket}/openmmlab-mms/"
mme = MultiDataModel(
name=endpoint_name,
image_uri=image_uri,
model_data_prefix=multi_model_s3uri,
sagemaker_session=session,
role=role,
)
predictor = Predictor(endpoint_name, sagemaker_session=session)
predictor.serializer = sagemaker.serializers.IdentitySerializer()
predictor.deserializer = sagemaker.deserializers.JSONDeserializer()
mme.add_model(model_data_source=estimator.model_data, model_data_path='modelA')
The reason I am using MME in this way instead of providing an initial model, is because I'm using the MMDetection library's mmdet2torchserve.py tool. This results in my training job producing a model.tar.gz
with the following four files:
MAR-INF/MANIFEST.json
mmdet_handler.py
config.py
bbox-mAP_epoch_12.pth
Where mmdet_handler.py
is defined here and does not implement SageMaker's standard input_fn
, predict_fn
and output_fn
. Ideally I would like to be able to use this export as-is without needing to download, extract and refactor the handler script in order to provide an entry_point
(and inference does indeed appear to work correctly when I do this).
Hi,
I manage to connect to my notebook instance with sm-local-ssh-notebook connect <my_notebook>
, as described in the instructions. I log in as root, inside what seems to be a docker container (checked with this command), with no users under /home
. In contrast, when launching a web-based terminal from the notebook, the environment is very different: the default user is ec2-user
, user files are under /home/ec2-user
, and it seems I am not in a docker container.
What extra steps are needed to have the same type of environment as in the web-terminal, but using sagemaker-ssh-helper
? Basically, all I want is to be able to do exactly the same I can do in the web-based Terminal from my sagemaker-ssh-helper session.
Maybe this has been addressed somewhere, but I didn't find any reference.
Thanks for your help.
In addition to real-time inference, need support for batch inference.
This is the request to add SSH Helper support for Accelerate with SageMaker:
https://huggingface.co/docs/accelerate/usage_guides/sagemaker
The training jobs are launched by accelerate launch ./examples/sagemaker_example.py
which is a limiting factor for SSH Helper and other capabilities.
Using HF accelerate or DeepSpeed engine for inference:
Also compare with this DeepSpeed example:
https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb
See:
Thank you for the great library! It works fine when I SSH in directly, no issues there. However, if I connect with VSCode, it'll work fine for the most part - until I see a log:
[sagemaker-ssh-helper][sm-setup-ssh][start-ssh] 2024-04-21 07:28:33 INFO [CredentialRefresher] Next credential rotation will be in 29.997197441433332 minutes
Then, the machine will fail within a minute or two of this message, and give the error InternalServerError: We encountered an internal error. Please try again.
in Sagemaker.
Weirdly enough, this is regardless of the instance type, amount of memory, etc. Also - it doesn't always happen the first time that message is sent, so I'm not sure if it's exactly that issue, or something else. Regardless, the machine fails with an internal server error only when I connect with VS Code after some amount of time connected.
Trying to connect using vscode has failed since about a week ago,
I'm managing to connect with SSH fine, but vscode itself fails.
Looks like a permissions error?
Getting a tar
error which not sure how to fix:
[11:38:35.267] stderr> tar: code: Cannot change ownership to uid 1000, gid 1000: Operation not permitted
[11:38:35.267] stderr> tar: Exiting with failure status due to previous errors
[11:38:35.268] > ERROR: tar exited with non-0 exit code: 0
Hi, will this package work to connect VS Code to a standard SM notebook? I'm currently using an EC2 bastion to ssh through, so would be really nice if this simplified the process.
Only PyTorch and TensorFlow are supported.
Hi - I have high hopes for the sagemaker-ssh-helper, for which thanks!
After setup, upon running
ssh -i ~/.ssh/sagemaker-ssh-gw -p 10022 root@localhost
at the end I get:
An error occurred (BadRequest) when calling the StartSession operation: Enable advanced-instances tier to use Session Manager with your on-premises instances
kex_exchange_identification: Connection closed by remote host
Connection closed by UNKNOWN port 65535
The error aside, does this mean the library will only work with the "advanced-instances tier"?? How does one know if one has this? I am trying to SSH into a plain old sagemaker instance... Help!
As the SSM tunnel allows for downloading from the sagemaker instances we want to be able to log the activity.
What is required to set up logging on the instances?
I'm trying to implement local IDE access to Sagemaker Studio by following the instructions found here
Specifically, I've gone for implementing the Lifecycle config script and not the iPython notebook.
It seems to all go well until the very last step in which the amazon-ssm-agent
is invoked. It appears to try and call out to IMDS, but AWS themselves say that IMDS access is blocked in Sagemaker Studio.
What should I do in this case? Error logs attached below:
2023-09-12 11:57:23 INFO Checking if agent identity type OnPrem can be assumed
2023-09-12 11:57:23 INFO Checking if agent identity type EC2 can be assumed
2023-09-12 11:57:23 ERROR [EC2Identity] Failed to get instance info from IMDS. Err: failed to get identity instance id. Error: EC2MetadataError: failed to get IMDSv2 token and fallback to IMDSv1 is disabled
caused by: :
status code: 0, request id:
caused by: RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": dial tcp 169.254.169.254:80: connect: invalid argument
2023-09-12 11:57:23 INFO Checking if agent identity type CustomIdentity can be assumed
2023-09-12 11:57:23 ERROR Agent failed to assume any identity
2023-09-12 11:57:23 ERROR failed to get identity: failed to find agent identity
2023-09-12 11:57:23 ERROR Error occurred when starting amazon-ssm-agent: failed to get identity: failed to find agent identity
sagemaker-ssh-helper version 2.1.0 used.
Hi is there a way to connect to a Sagemaker inference endpoint running my custom Docker container? If it matters, I'm using https://github.com/aws/sagemaker-inference-toolkit
Our use case is to start the SageMaker training job from the SageMaker Studio notebook and the studio is attached to a private vpc.
When I try to create SSHEstimatorWrapper using below from my studio notebook
ssh_wrapper = SSHEstimatorWrapper.create(pytorch_estimator, connection_wait_time_seconds=0)
I'm getting ConnectTimeoutError: Connect timeout on endpoint URL: "https://sts.amazonaws.com/"
exception.
This is because since we have regional vpc endpoint and we can access only regional endpoints like https://sts.us-east-1.amazonaws.com/ . From here I see that this calls only global endpoint
We would need to pass region parameter during the initalization of sts boto3 client so that it uses sts regional endpoints based on teh region
This link leads me to believe no https://aws.amazon.com/systems-manager/pricing/#Session_Manager. but SSM's pricing is a bit confusing to me and I don't wanna be surprised.
And does this feature still need requesting to be allowlisted: https://docs.aws.amazon.com/sagemaker/latest/dg/ssm-access.html#Allowlist
I don't see anything in the readme.
The SSM setup guide (section 2) currently guides users to set up SSM Advanced Tier by first creating a minimal EC2 instance...
But from my tests, I don't think this is mandatory, at least in all cases?
I was able to enable Advanced Tier simply by opening Systems Manager > Fleet Manager in the Console and opening Account Management > Instance Tier settings. The below screenshot was taken after I'd already created an SSH Helper training job, but I could access the same screen beforehand too by clicking the orange "Get started" button if you navigate to Fleet Manager before any instances are set up.
I've been able to connect to the training job instance no problem, so pretty sure that at least in some cases users can just skip straight to step 2h? It would be nice to streamline the setup instructions if possible and maybe just give the EC2 option for troubleshooting problems?
The specific account I tested all the way through on is a management account in its AWS Organization (but the org only contains that one account).
Hi,
I've been following the instructions for getting WebVNC going and I've been successful in logging into the WebVNC environment. In the README.md file, you show that VSCode and PyCharm are running in this "browser-in-a-browser" environment.
My question is: how do you install these things in the WebVNC environment?
Furthermore, if I have a Dagster webserver serving a Dagit UI running on the KernelGateway app, if I were to configure the port forwarding properly, is it possible to see this UI in the WebVNC environment as well?
Thanks!
When initiate local ssh session using sm-local-ssh-ide, it does not support passing named profile to the command line.
sm-local-ssh-ide <<kernel_gateway_app_name>> --profile dev
I have tried following the directions to setup my Local IDE integration with SageMaker Studio over SSH for PyCharm / VSCode and have run into an issue on my MacOS device.
When I run sm-local-configure
I get the following output message:
> sm-local-configure
Darwin <MY COMPUTER> 22.5.0 Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:19 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T6020 arm64
cat: /etc/issue: No such file or directory
cat: /etc/os-release: No such file or directory
Python 3.10.12
Password:
Sorry, try again.
Password:
sudo: apt-get: command not found
I believe the issue is this _install_unzip() function:
sagemaker-ssh-helper/sagemaker_ssh_helper/sm-helper-functions
Lines 40 to 46 in 049f97b
which is called by sm-local-configure
I think this has partially been handled in the _install_aws_cli
function with a seperate function for MacOS, If it would be helpful I can submit a PR to add a check to those methods to see if unzip
and curl
are already installed — which for MacOS I think they are by default.
The function works as expected (I think) if when i commented out those 2 lines and installed the package locally.
System:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.