Code Monkey home page Code Monkey logo

Comments (7)

Michaelvll avatar Michaelvll commented on June 16, 2024 1
  1. We do not explicitly mount ~/.sky to VM/container
  2. SSH key changes on new VM/container.

The error does not happen when I run the same container from my local (launches a different spot controller each time), but when launched from a VM (with a service account), it tries to access the same spot controller and fails.

Thanks for sharing more details @subhamde8247! One hypothesis of this would be that both the linux username and the python -c "import socket; print(socket.gethostname())" is the same for the multiple containers, causing the user hash we generated to identify different machines based on the two values being the same, which leads to using the same spot controller.

To confirm the hypothesis, it would be nice to check if cat ~/.sky/user_hash has the same value across multiple VM/containers.

There are several workarounds:

  1. share the SSH key across multiple VM/container by explicitly upload/mounting those keys to them.
  2. Or, randomly generate a user hash for each VM/container whenever it is firstly provisioned, by randomly generating the ~/.sky/user_hash: python -c "import uuid; print(uuid.uuid4().hex[:8])" > ~/.sky/user_hash

We will also look into the issue and see if the username and the python -c "import socket; print(socket.gethostname())" not sufficient for identifying a user : )

from skypilot.

subhamde8247 avatar subhamde8247 commented on June 16, 2024

One hack we have been using now: use gcloud compute instances list to list all VMs that match the pattern sky-spot-controller- at the end of each run and deleting them.

Wondering if I am missing something or if there is a better solution.

from skypilot.

concretevitamin avatar concretevitamin commented on June 16, 2024

from a docker container inside GCP VM that has a service account attached to it

Thanks for the report @subhamde8247! Could you share some details of this client VM? For example,

  • what does “gcloud auth list” show inside this container? Does it only have the service account, or also some static credential files?
  • On each day, is the spot launch triggered from a different container on this client VM, or the same container?

from skypilot.

subhamde8247 avatar subhamde8247 commented on June 16, 2024
  1. “gcloud auth list” only shows the service account
  2. The client VM is deleted each day after running, and a new VM is created next day, and a new container started within that VM. So spot launch is triggered from a different container each day attached to a new VM instance.

from skypilot.

concretevitamin avatar concretevitamin commented on June 16, 2024

@subhamde8247 Got it. Some followups:

  • Do you mount the same ~/.sky to the new VM/container everyday? This would explain why the new VM/container reuse the same spot controller.
  • Does the SSH key, ~/.ssh/sky-key{.pub}, change on the new VM/container? A newly generated key would explain why the connection to the same spot controller VM is unsuccessful.
    • If this is the case, a fix should be mounting the same ~/.ssh/sky-key{.pub} to the new VM/container everyday. This way the same spot controller can be reused, and during idle periods it'd be autostopped to save costs.

from skypilot.

subhamde8247 avatar subhamde8247 commented on June 16, 2024
  1. We do not explicitly mount ~/.sky to VM/container
  2. SSH key changes on new VM/container.

The error does not happen when I run the same container from my local (launches a different spot controller each time), but when launched from a VM (with a service account), it tries to access the same spot controller and fails.

from skypilot.

subhamde8247 avatar subhamde8247 commented on June 16, 2024

Confirmed that cat ~/.sky/user_hash is same for multiple runs of docker container when launched from multiple VMs. However, the hashes are different for multiple runs when same container is run from my local.

explicitly upload/mounting those keys

yeah, this will add some complexity of storing these keys in GCP secret manager, and properly loading them during the container start-up each day.

randomly generate a user hash for each VM/container

we don't mind having a new spot controller for each daily job. The only issue - old spot controllers are not auto-downed and we are left with a bunch of stopped spot controller instances in our VM list (which we have to manually delete). If available, a --down option for spot controller would work for us.

from skypilot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.