amzn / amazon-ray Goto Github PK

Staging area for ongoing enhancements to Ray focused on improving integration with AWS and other Amazon technologies.

License: Apache License 2.0

Starlark 1.30% Shell 0.68% Python 62.33% C++ 25.56% C 0.25% HTML 0.18% TypeScript 1.29% Dockerfile 0.07% Java 6.94% Makefile 0.01% Jupyter Notebook 0.01% CSS 0.05% JavaScript 0.01% Smarty 0.01% Cython 1.08% PowerShell 0.24%

amazon-ray's People

Contributors

Stargazers

Watchers

amazon-ray's Issues

[autoscaler] while running `ray up`, client cannot connect to head node when client node/head node are in the same private subnet.

What is the problem?

In my setting, the client machine is in the same VPC as the requested instances in the cluster.

When a client machine runs ray up, requirement seems to be that head node must be in a VPC subnet that enables "“Auto-assign public IPv4 address”". When the subnet doesn't enable this (is a private subnet), the client machine cannot connect to the head node; below is the error message.

<1/1> Setting up head node
  Prepared bootstrap config
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Waiting for IP
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
      Not yet available, retrying in 10 seconds
(ends with never getting the IP address).

Ray version and other system information (Python version, TensorFlow version, OS):
Ray: 1.1.0, Python: 3.7.7
Client machine cat /etc/os-release

NAME="Amazon Linux AMI"
VERSION="2018.03"
ID="amzn"
ID_LIKE="rhel fedora"
VERSION_ID="2018.03"
PRETTY_NAME="Amazon Linux AMI 2018.03"
ANSI_COLOR="0;33"
CPE_NAME="cpe:/o:amazon:linux:2018.03:ga"
HOME_URL="http://aws.amazon.com/amazon-linux-ami/"

Reproduction (REQUIRED)

Note that client is in the SAME VPC as the 4 subnets getting requested for head_node and worker_node.

Below is my cluster_config.yaml; the only important bit though is the 4 SubnetIds.

I confirmed that when I change the subnet setting of head_node's specified SubnetIds to "“Auto-assign public IPv4 address”" (via AWS VPC Console) client is able to find the head node and cluster launch is successful.


cluster_name: jkkwon_ray_test

min_workers: 5

max_workers: 10

upscaling_speed: 1.0

docker:
    image: "rayproject/ray-ml:latest-gpu"
    container_name: "ray_container"
    pull_before_run: True
    run_options: []


idle_timeout_minutes: 5

provider:
    type: aws
    region: us-west-2
    availability_zone: us-west-2a,us-west-2b,us-west-2c,us-west-2d


auth:
    ssh_user: ubuntu

head_node:
    InstanceType: r5.12xlarge
    ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
    SecurityGroupIds:
        - "sg-08ed97f6d08d451f6"
    SubnetIds: [
        "subnet-0180e9267b994bf97",  # us-west-2a, 8187 IP addresses. 10.0.32.0/19
        "subnet-073e6e0338bf209cb",  # us-west-2b, 8187 IP addresses. 10.0.64.0/19
        "subnet-03caa10b59288efae",  # us-west-2c, 8187 IP addresses. 10.0.96.0/19
        "subnet-06dd6dbb8caf5c310",  # us-west-2d, 8187 IP addresses. 10.0.128.0/19
    ]
    # You can provision additional disk space with a conf as follows
    BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
              VolumeSize: 100

worker_nodes:
    InstanceType: r5.12xlarge
    ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
    SecurityGroupIds:
        - "sg-08ed97f6d08d451f6"
    SubnetIds: [
        "subnet-0180e9267b994bf97",  # us-west-2a, 8187 IP addresses. 10.0.32.0/19
        "subnet-073e6e0338bf209cb",  # us-west-2b, 8187 IP addresses. 10.0.64.0/19
        "subnet-03caa10b59288efae",  # us-west-2c, 8187 IP addresses. 10.0.96.0/19
        "subnet-06dd6dbb8caf5c310",  # us-west-2d, 8187 IP addresses. 10.0.128.0/19
    ]
    InstanceMarketOptions:
        MarketType: spot


file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands: []

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Unfortunately I am working in a private AWS account so to reproduce, I suggest getting the SubnetIds that match your AWS account.

[ no ] I have verified my script runs in a clean environment and reproduces the issue.
[ yes ] I have verified the issue also occurs with the latest wheels.

Provide APIs for writing to user logs for publication to CloudWatch

Describe your feature request

Currently, Amazon-Ray AMI allows for logs written to /tmp/ray/user/ locations to become auto published to CloudWatch.
This means that users of Amazon-Ray AMI could potentially end up with duplicate code in configuring Python loggers to write to this location.
The goal of this feature request is to provide Python APIs for users so that they do not have to write the logger-configuration logic and just call logger = configure_application_logger(logging.getLogger())

[autoscaler] Using `ami-0f92e9d2b63bc61a2` fails with error "ERROR: ray-1.2.0.dev0-cp36-cp36m-manylinux2014_x86_64.whl is not a supported wheel on this platform."

Problem

I am using ami-00f92e9d2b63bc61a2 which is supposed to be the ami for Linux - Python 3.7 - Ray 1.2.0.

I am using below yaml file, where my docker image 048211272910.dkr.ecr.us-west-2.amazonaws.com/jkkwon-batscli:zarr is a custom image based off of 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.3.1-cpu-py37-ubuntu18.04.

cluster_name: jkkwon_ray_test

min_workers: 10
max_workers: 100
upscaling_speed: 1.0

docker: "
    image: "048211272910.dkr.ecr.us-west-2.amazonaws.com/jkkwon-batscli:zarr"
    container_name: "miamiml_container"
    pull_before_run: True

idle_timeout_minutes: 5

provider:
    type: aws
    region: us-west-2
    availability_zone: us-west-2a,us-west-2b,us-west-2c,us-west-2d
    cache_stopped_nodes: False

auth:
    ssh_user: ubuntu
    ssh_private_key: miami_dev_dask_emr_key_pair.pem

head_node:
    InstanceType: r5n.24xlarge
    ImageId: ami-0f92e9d2b63bc61a2 # https://github.com/amzn/amazon-ray
    SecurityGroupIds:
        - "sg-08ed97f6d08d451f6"
    SubnetIds: [
        "subnet-02876545b671b57b0"
    ]
    BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
              VolumeSize: 100
    KeyName: "miami_dev_dask_emr_key_pair"

worker_nodes:
    InstanceType: r5n.24xlarge
    ImageId: ami-0f92e9d2b63bc61a2 # https://github.com/amzn/amazon-ray
    SecurityGroupIds:
        - "sg-08ed97f6d08d451f6"
    SubnetIds: [
        "subnet-0180e9267b994bf97",  # us-west-2a, 8187 IP addresses. 10.0.32.0/19
        "subnet-073e6e0338bf209cb",  # us-west-2b, 8187 IP addresses. 10.0.64.0/19
        "subnet-03caa10b59288efae",  # us-west-2c, 8187 IP addresses. 10.0.96.0/19
        "subnet-06dd6dbb8caf5c310",  # us-west-2d, 8187 IP addresses. 10.0.128.0/19
    ]
    InstanceMarketOptions:
        MarketType: spot
    KeyName: "miami_dev_dask_emr_key_pair"

    
file_mounts_sync_continuously: False
rsync_exclude:
    - "**/.git"
    - "**/.git/**"
    - 
rsync_filter:
    - ".gitignore"

initialization_commands: []

head_setup_commands: []

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

The problem is that running ray up fails with message



  [6/7] Running setup commands
    (0/2) echo 'export PATH="$HOME/anaco...
Shared connection to 10.0.0.34 closed.
    (1/2) pip install -U https://s3-us-w...
ERROR: ray-1.2.0.dev0-cp36-cp36m-manylinux2014_x86_64.whl is not a supported wheel on this platform.
WARNING: You are using pip version 20.3.3; however, version 21.0.1 is available.
You should consider upgrading via the '/usr/local/bin/python3.7 -m pip install --upgrade pip' command.
Shared connection to 10.0.0.34 closed.
  New status: update-failed
  !!!
  SSH command failed.
  !!!
  
  Failed to setup head node.

When NOT using the docker image, I am able to actually get the Ray cluster up and running. But when I log onto it with ray attach and look at Python console, I get below:

Python 3.6.10 |Anaconda, Inc.| (default, Mar 25 2020, 23:51:54) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
[1]+  Stopped                 python
ubuntu@ip-10-0-0-108:~$ python3
Python 3.6.10 |Anaconda, Inc.| (default, Mar 25 2020, 23:51:54) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

I am wondering if Ray wheel was mis-uploaded for 3.6 version, not 3.7 version?

Thanks!

[autoscaler] Minimizing EC2 Cluster Launcher's Permissions

Ray comes with a built-in cluster launcher that deploys a Ray cluster. It launches a single head node and uses that node to launch the cluster’s worker nodes. As a result, an IAM role with appropriate permissions should be granted to the laucher. In most use cases (especially for research purposes) this is not an issue as the user has admin role. However, when the launch task is delegated to another service (e.g., a Fargate Task), we'd like to clearly define the permissions required.

Note that this role is for the launcher to launch a cluster and it's different from the role of the head node itself. As detailed in
[autoscaler] Ray Cluster Launcher on AWS | Minimizing Permissions #9327, this issue deals with ray-ec2-launcher's IAM role only. Triming down the permissions granted to ray-head-v1 and ray-worker-v1 should be a separate issue.

[autoscaler] Improve experience when EC2 does not have capacity for worker nodes

Hello -

After I spinned up the cluster with ray up my_cluster.yaml, my workload wasn't really getting handled well by the Ray cluster. I tried ray monitor my_cluster.yaml then found out that the logs were flooded with below messages:

==> /tmp/ray/session_latest/logs/monitor.err <==
ssh: connect to host 10.0.80.64 port 22: Connection timed out

==> /tmp/ray/session_latest/logs/monitor.log <==
2021-02-10 05:24:28,068 INFO node_launcher.py:78 -- NodeLauncher1: Got 5 nodes to launch.
2021-02-10 05:24:28,186 INFO node_launcher.py:78 -- NodeLauncher1: Launching 5 nodes, type ray-legacy-worker-node-type.

==> /tmp/ray/session_latest/logs/monitor.out <==
2021-02-10 05:24:27,659 INFO node_provider.py:408 -- create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient r5n.24xlarge capacity in the Availability Zone you requested (us-west-2b). Our system will be working on provisioning additional capacity. You can currently get r5n.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2a, us-west-2c., retrying.
2021-02-10 05:24:27,683 INFO node_provider.py:408 -- create_instances: Attempt failed with An error occurred (RequestLimitExceeded) when calling the RunInstances operation (reached max retries: 0): Request limit exceeded., retrying.
2021-02-10 05:24:27,715 INFO node_provider.py:378 -- Launched 5 nodes [subnet_id=subnet-0180e9267b994bf97]
2021-02-10 05:24:27,715 INFO node_provider.py:397 -- Launched instance i-03fb4297fc5f3f1cd [state=pending, info=pending]
2021-02-10 05:24:27,789 INFO updater.py:273 -- SSH still not available (SSH command failed.), retrying in 5 seconds.

==> /tmp/ray/session_latest/logs/monitor.log <==
2021-02-10 05:24:28,538 ERROR node_launcher.py:72 -- Launch failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 370, in _create_node
    created = self.ec2_fail_fast.create_instances(**conf)
  File "/usr/local/lib/python3.7/site-packages/boto3/resources/factory.py", line 520, in do_action
    response = action(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/boto3/resources/action.py", line 83, in __call__
    response = getattr(parent.meta.client, operation_name)(*args, **params)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 676, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (r5n.24xlarge) is not supported in your requested Availability Zone (us-west-2d). Please retry your request by not specifying an Availability Zone or choosing us-west-2a, us-west-2b, us-west-2c.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/node_launcher.py", line 70, in run
    self._launch_node(config, count, node_type)
  File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/node_launcher.py", line 60, in _launch_node
    self.provider.create_node(node_config, node_tags, count)
  File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 311, in create_node
    created_nodes_dict = self._create_node(node_config, tags, count)
  File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 403, in _create_node
    "Failed to launch instances. Max attempts exceeded.")
  File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py", line 585, in abort
    raise exc_cls("Exiting due to cli_logger.abort()")
click.exceptions.ClickException: Exiting due to cli_logger.abort()
2021-02-10 05:24:28,538 INFO node_launcher.py:78 -- NodeLauncher0: Got 2 nodes to launch.
2021-02-10 05:24:28,774 INFO node_launcher.py:78 -- NodeLauncher0: Launching 2 nodes, type ray-legacy-worker-node-type.

==> /tmp/ray/session_latest/logs/monitor.out <==
2021-02-10 05:24:28,538 PANIC node_provider.py:403 -- Failed to launch instances. Max attempts exceeded.
2021-02-10 05:24:29,000 INFO node_provider.py:408 -- create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient r5n.24xlarge capacity in the Availability Zone you requested (us-west-2a). Our system will be working on provisioning additional capacity. You can currently get r5n.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2b, us-west-2c., retrying.
2021-02-10 05:24:29,023 INFO node_provider.py:408 -- create_instances: Attempt failed with An error occurred (RequestLimitExceeded) when calling the RunInstances operation (reached max retries: 0): Request limit exceeded., retrying.

So it looks like the requested instance type isn't really available by EC2, and Ray isn't able to spin up desired worker nodes. This means I need to run ray down, modify the my_cluster.yaml, then retry with a different instance type.

I was wondering if we can improve this experience. Perhaps check the EC2 instance capacity for at least minimum # of workers before telling the user that cluster is launched? Or perhaps let user specify list of instance types that they're OK with?

Ray AMIs for European Regions

Please provide the "Amazon Ray Images" AMIs also in European regions like eu-west-1 and eu-central-1.

Thanks a lot for your amazing work!

AMI publishment comes with `latest` tag

I was wondering what it would take to have AMI that is something like amzn-ray-latest! This would help us reduce our maintenance cost in updating what we specify for amzn-ray AMIs and updating it to be latest when we get updates.

[Feature] Multi-Node Batch Support

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Looking at AWS Batch and the support Batch has for Multi-Nodes, I was wondering if perhaps there may be a way to incorporate into the Ray Cluster Configuration support to use the Mulit-Node Batch jobs. What I find appealing with this is that if I were to terminate the head node Batch Job, all the worker nodes would terminate as well. The other appealing aspect is that AWS Batch has a Step Function integration. It would be nice to invoke a Batch Job that would spin up a mult-node cluster for Ray to use. With the limited tests I have done, I liked how if you abort a step function execution, the Multi-Node Batch job is terminated and all the nodes (head and worker) terminate.

I am happy to work on the code and submit a Pull Request, But I would need some help from the Multi-Node Batch team at AWS as there are a few questions about the Multi-Nodes Batch service that i am not real familar with yet.

Use case

Our basic use case is in imagery processing. We use Ray in a producer/consumer model to spin up consumers across a cluster of compute instances to process large blocks of imagery in parallel. The current issue is clean up when things go wrong. There are two ways things go wrong currently (the ray cluster has worker nodes that die and leave the cluster in a bad state, and something happens with logging in ray and the jobs never end even though the processing is complete). When these things happen we have defensive services that try to cleanup. These defensive services need to do a lot because the Ray cluster isn't a managed service. Running in Batch, which is a managed service makes clean up easier. We can terminate a batch job and the step function retry logic will spin up a new batch job and retry. In a Multi-Node scenario, it would spin up a new cluster and retry.

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[autoscaler] Documentation instructions for mounting EFS does not work when `docker` is specified

In this documentation page: https://docs.ray.io/en/latest/cluster/aws-tips.html#

The instructions only work when docker is not specified.
When docker is specified, the efs-related commands inside setup_commands array will try to run inside the Docker container and fail due to not having sudo installed.
I suggest improving the documentation page to be more accurate about only working when docker is not specified. In addition, it would be great to include a working example when docker container is getting used.

Sample yaml:

cluster_name: jkkwon_ray_test

min_workers: 5

max_workers: 10

upscaling_speed: 1.0

docker: 
    image: "048211272910.dkr.ecr.us-west-2.amazonaws.com/barsecrrepo-1cda8d0d3d9ee1867bae37291b6adc586a3f650c:308796b3-5c89-4a7c-83d0-5ce0abad3094_MiamiMLImage_main"
    container_name: "ray_container"
    pull_before_run: True
    run_options: []

idle_timeout_minutes: 5

provider:
    type: aws
    region: us-west-2
    availability_zone: us-west-2a,us-west-2b,us-west-2c,us-west-2d
    cache_stopped_nodes: False

auth:
    ssh_user: ubuntu
    ssh_private_key: miami_dev_dask_emr_key_pair.pem

head_node:
    InstanceType: r5.12xlarge
    ImageId: latest_dlami
    SecurityGroupIds:
        - "sg-08ed97f6d08d451f6"
    SubnetIds: [
        "subnet-02876545b671b57b0"
    ]
    # You can provision additional disk space with a conf as follows
    BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
              VolumeSize: 100
    KeyName: "miami_dev_dask_emr_key_pair"

worker_nodes:
    InstanceType: r5.12xlarge
    ImageId: latest_dlami
    SecurityGroupIds:
        - "sg-08ed97f6d08d451f6"
    SubnetIds: [
        "subnet-0180e9267b994bf97",  # us-west-2a, 8187 IP addresses. 10.0.32.0/19
        "subnet-073e6e0338bf209cb",  # us-west-2b, 8187 IP addresses. 10.0.64.0/19
        "subnet-03caa10b59288efae",  # us-west-2c, 8187 IP addresses. 10.0.96.0/19
        "subnet-06dd6dbb8caf5c310",  # us-west-2d, 8187 IP addresses. 10.0.128.0/19
    ]
    # Run workers on spot by default. Comment this out to use on-demand.
    InstanceMarketOptions:
        MarketType: spot
    KeyName: "miami_dev_dask_emr_key_pair"
    
file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands:
    - aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 048211272910.dkr.ecr.us-west-2.amazonaws.com;

# List of shell commands to run to set up nodes.
setup_commands:
      - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
      - sudo kill -9 `sudo lsof /var/lib/dpkg/lock-frontend | awk '{print $2}' | tail -n 1`;
        sudo pkill -9 apt-get;
        sudo pkill -9 dpkg;
        sudo dpkg --configure -a;
        sudo apt-get -y install binutils;
        cd $HOME;
        git clone https://github.com/aws/efs-utils;
        cd $HOME/efs-utils;
        ./build-deb.sh;
        sudo apt-get -y install ./build/amazon-efs-utils*deb;
        cd $HOME;
        mkdir efs;
        sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-098309a3.efs.us-west-2.amazonaws.com:/ efs;
        sudo chmod 777 efs;    
        

head_setup_commands: []

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Sample output:



WARNING: You are using pip version 20.3.3; however, version 21.0 is available.
You should consider upgrading via the '/usr/local/bin/python3.7 -m pip install --upgrade pip' command.
Shared connection to 34.220.27.124 closed.
    (1/2) sudo kill -9 `sudo lsof /var/l...
bash: sudo: command not found
bash: sudo: command not found
bash: sudo: command not found
bash: sudo: command not found
bash: sudo: command not found
bash: sudo: command not found
Cloning into 'efs-utils'...
remote: Enumerating objects: 142, done.
remote: Counting objects: 100% (142/142), done.
remote: Compressing objects: 100% (74/74), done.
remote: Total 792 (delta 79), reused 100 (delta 51), pack-reused 650
Receiving objects: 100% (792/792), 234.63 KiB | 6.02 MiB/s, done.
Resolving deltas: 100% (462/462), done.
+ pwd
+ BASE_DIR=/root/efs-utils
+ BUILD_ROOT=/root/efs-utils/build/debbuild
+ VERSION=1.28.2
+ RELEASE=1
+ DEB_SYSTEM_RELEASE_PATH=/etc/os-release
+ UBUNTU18_REGEX=Ubuntu 18
+ UBUNTU20_REGEX=Ubuntu 20
+ DEBIAN11_REGEX=Debian GNU/Linux bullseye
+ echo Cleaning deb build workspace
Cleaning deb build workspace
+ rm -rf /root/efs-utils/build/debbuild
+ mkdir -p /root/efs-utils/build/debbuild
+ echo Creating application directories
Creating application directories
+ mkdir -p /root/efs-utils/build/debbuild/etc/amazon/efs
+ mkdir -p /root/efs-utils/build/debbuild/etc/init/
+ mkdir -p /root/efs-utils/build/debbuild/etc/systemd/system
+ mkdir -p /root/efs-utils/build/debbuild/sbin
+ mkdir -p /root/efs-utils/build/debbuild/usr/bin
+ mkdir -p /root/efs-utils/build/debbuild/var/log/amazon/efs
+ mkdir -p /root/efs-utils/build/debbuild/usr/share/man/man8
+ [ -f /etc/os-release ]
+ grep -e Ubuntu 18 -e Debian GNU/Linux bullseye+  -e Ubuntu 20
grep PRETTY_NAME /etc/os-release
+ echo PRETTY_NAME="Ubuntu 18.04.5 LTS"
PRETTY_NAME="Ubuntu 18.04.5 LTS"
+ echo Correcting python executable
Correcting python executable
+ sed -i -e s/python|python2/python3/ dist/amazon-efs-utils.control
+ sed -i -e 1 s/^.*$/\#!\/usr\/bin\/env python3/ src/watchdog/__init__.py
+ sed -i -e 1 s/^.*$/\#!\/usr\/bin\/env python3/ src/mount_efs/__init__.py
+ echo Copying application files
Copying application files
+ install -p -m 644 dist/amazon-efs-mount-watchdog.conf /root/efs-utils/build/debbuild/etc/init
+ install -p -m 644 dist/amazon-efs-mount-watchdog.service /root/efs-utils/build/debbuild/etc/systemd/system
+ install -p -m 444 dist/efs-utils.crt /root/efs-utils/build/debbuild/etc/amazon/efs
+ install -p -m 644 dist/efs-utils.conf /root/efs-utils/build/debbuild/etc/amazon/efs
+ install -p -m 755 src/mount_efs/__init__.py /root/efs-utils/build/debbuild/sbin/mount.efs
+ install -p -m 755 src/watchdog/__init__.py /root/efs-utils/build/debbuild/usr/bin/amazon-efs-mount-watchdog
+ echo Copying install scripts
Copying install scripts
+ install -p -m 755 dist/scriptlets/after-install-upgrade /root/efs-utils/build/debbuild/postinst
+ install -p -m 755 dist/scriptlets/before-remove /root/efs-utils/build/debbuild/prerm
+ install -p -m 755 dist/scriptlets/after-remove /root/efs-utils/build/debbuild/postrm
+ echo Copying control file
Copying control file
+ install -p -m 644 dist/amazon-efs-utils.control /root/efs-utils/build/debbuild/control
+ echo Copying conffiles
Copying conffiles
+ install -p -m 644 dist/amazon-efs-utils.conffiles /root/efs-utils/build/debbuild/conffiles
+ echo Copying manpages
Copying manpages
+ install -p -m 644 man/mount.efs.8 /root/efs-utils/build/debbuild/usr/share/man/man8/mount.efs.8
+ echo Creating deb binary file
Creating deb binary file
+ echo 2.0
+ echo Setting permissions
Setting permissions
+ find /root/efs-utils/build/debbuild -type d
+ xargs chmod 755
+ echo Creating tar
Creating tar
+ cd /root/efs-utils/build/debbuild
+ tar czf control.tar.gz control conffiles postinst prerm postrm --owner=0 --group=0
+ tar czf data.tar.gz etc sbin usr var --owner=0 --group=0
+ cd /root/efs-utils
+ echo Building deb
Building deb
+ DEB=/root/efs-utils/build/debbuild/amazon-efs-utils-1.28.2-1_all.deb
+ ar r /root/efs-utils/build/debbuild/amazon-efs-utils-1.28.2-1_all.deb /root/efs-utils/build/debbuild/debian-binary
ar: creating /root/efs-utils/build/debbuild/amazon-efs-utils-1.28.2-1_all.deb
+ ar r /root/efs-utils/build/debbuild/amazon-efs-utils-1.28.2-1_all.deb /root/efs-utils/build/debbuild/control.tar.gz
+ ar r /root/efs-utils/build/debbuild/amazon-efs-utils-1.28.2-1_all.deb /root/efs-utils/build/debbuild/data.tar.gz
+ echo Copying deb to output directory
Copying deb to output directory
+ cp /root/efs-utils/build/debbuild/amazon-efs-utils-1.28.2-1_all.deb build/
bash: sudo: command not found
bash: sudo: command not found
bash: sudo: command not found
Shared connection to 34.220.27.124 closed.

[autoscaler] Improve documentation for spinning up a Ray cluster with a non-public Docker image

Hello-

Just a suggestion for including a documentation for spinning up a Ray cluster with a non-public Docker image.

You have to add below line to avoid "no basic auth credentials" error from docker pull step of ray up - this wasn't particularly clear from any existing docs under https://docs.ray.io/en/latest/cluster/cloud.html.

initialization_commands:
    - aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 048211272910.dkr.ecr.us-west-2.amazonaws.com;