Code Monkey home page Code Monkey logo

Comments (28)

btrude avatar btrude commented on August 24, 2024 6

I made a jukebox docker image after proving that my local 2080 ti wasn't going to cut it for training. I have only had the opportunity to test it on vast.ai with a 1070 and then 2x V100s but both sampling and training seem to be working. You can spin up a vast instance for less than a dollar an hour and start messing around with it using btrude/jukebox-docker:latest as your image. IMO this is the easiest way to get going with this project from a hobbyist perspective. There are some minor tweaks to be made to the image but overall it works straight out of the box on vast so if anyone tries it please let me know if it is working for you (especially outside of vast).

from jukebox.

btrude avatar btrude commented on August 24, 2024 2

I should say, I installed the libraries on my Mac. I haven't installed anything on the Vast.ai server yet since I don't know how or even if I'm supposed to. Just clarifying. Thank you! Best wishes

When using vast.ai or just docker on its own you are using virtual machines that have little or no connection to your local machine. So in this case you don't need to install drivers or any software other than ssh in order to connect to a vast instance and run the code (meaning that you can safely remove nvidia related software from your mac and you already will have ssh as it is built into macos). Also, the entire point of docker images is that you should not have to install anything and can just begin using them immediately after they are loaded (unless you have some specific need like I outline below). Picking up from my instructions above you should do the following:

Follow this guide https://help.github.com/en/github/authenticating-to-github/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent through to this page https://help.github.com/en/github/authenticating-to-github/adding-a-new-ssh-key-to-your-github-account but instead of putting the ssh key into your github account put it into the ssh key box on this page: https://vast.ai/console/account/

Create a vast instance (you need 16GB of vram for anything more than 1b_lyrics with n_samples ~9 so go for a p100 or v100 if thats what you care about otherwise a 2080 ti/1080 ti will be cheapest and remember at least 30gb of disk space or you will get errors and nothing will work) and go to https://vast.ai/console/instances/ and wait for it to spin up. Sometimes you have to click the start button quite a few times before it will actually begin (maybe this isn't necessary and it will just do it on its own?). My images should be cached so if you don't see the blue button transition to "Connect" within a few minutes then its most likely broken and you should destroy and start over until you are given the "Connect" option after it says the image has successfully loaded (I bring this up because sometimes the instances fail to load and its not obvious through the ui, this will probably save someone time talking to customer support/wasting credits).

Click "Connect" and a modal will pop up with an ssh command, copy that command into your terminal and type "yes" when prompted and then cd /opt/jukebox/ as vast does not take you to the docker image workdir for whatever reason. You should now be connected to the vast instances inside an instance of tmux. tmux allows your processes to stay running even after you have disconnected from the instance which is potentially important depending on how long you intend to use it for. See https://tmuxcheatsheet.com/ for important tmux commands, or just do ctrl+b, d to detach from the session, then type exit to exit ssh when you are done. Generally when I detach from an instance I just use nvidia-smi in a separate terminal window (you can connect to the instance in multiple windows using the original ssh command as many times as you need) to determine if the process has finished or not (when the gpu utilization has gone to zero), but if you need to reattach, follow the instructions in the cheat sheet.

In order to pass your own dataset, prompt, or original code, or to recover any samples you made you will have to use scp (which should also be built-in to macos). Take the ssh command provided to you by vast, e.g: ssh -p 16090 [email protected] -L 8080:localhost:8080 and pass the relevant info to scp like:

scp -P 16090 [email protected]:/opt/jukebox/path/to/file.wav ~/path/on/my/local/mac

So if you wanted to transfer a file from the default example in this repo's readme to your desktop it would look like this:

scp -P 16090 [email protected]:/opt/jukebox/sample_5b/level_0/item_0.wav ~/Desktop depending on which specific file, or just:

scp -r -P 16090 [email protected]:/opt/jukebox/sample_5b/ ~/Desktop if you want to transfer an entire directory. You can also go in the opposite direction if you need to send things to the instance like:
scp -r -P 16090 ~/Desktop/my_audio_dataset/ [email protected]:/opt/jukebox/

Anyone just messing with sampling should note that the metadata in sample.py is hard-coded so you may want to install nano apt-get install nano and then nano jukebox/sample.py, then arrow (nano is a command line text editor) down to line 188 and change the defaults to whatever you want (see here: https://github.com/openai/jukebox/tree/master/jukebox/data/ids for the default options; v3=1b, v2=5b). Ctrl + x, y to save and exit nano.

from jukebox.

carchrae avatar carchrae commented on August 24, 2024 1

@diffractometer - sounds right.

if you do have a cuda supported nvidia card (but not the cuda libs installed) you could probably still run the docker + gpu extensions locally.

otherwise, if you deploy one of these amis it sounds like you get a gpu-enabled docker pre-installed. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html

from jukebox.

carchrae avatar carchrae commented on August 24, 2024 1

i suspect you'd hit the same error in jupyter as it seems the code requires cuda/nccl

https://github.com/openai/jukebox/blob/master/jukebox/utils/dist_utils.py#L42

i guess this is really a bug in documentation (a common one) that the code requires an nvidia gpu. looking at the other issues getting reported, i think you also need a gpu with a lot of ram. i am yet to try it on my card (it has only 6gb ram)

from jukebox.

diffractometer avatar diffractometer commented on August 24, 2024 1

@Jimmiexjames I'm having good luck using the colab notebook, at least just getting it running. I ended up using the paid plan to stop timeouts.

from jukebox.

btrude avatar btrude commented on August 24, 2024 1

@btrude thanks for the update. I'm not sure where to actually find that docker by your description, can you share please? Cheers.

https://hub.docker.com/r/btrude/jukebox-docker I also added btrude/jukebox-docker:apex for faster training

from jukebox.

btrude avatar btrude commented on August 24, 2024 1

Hi, noob programmer here -- can I run this on a vast.ai server? How?

Buy credits -> Create -> Edit Image & Config -> Scroll to last option and click the right-hand Select, then type the name of either of my images into the prompt -> Allocate ~30gb of disk space (at least 15+ of that is for the models which get downloaded each time) and click the bottom most Select button

After that you're on your own šŸ„¼

from jukebox.

LeapGamer avatar LeapGamer commented on August 24, 2024 1

Yes, it was a problem with too little memory. I was able to get it all working by finding an instance with enough memory. Cheers!

from jukebox.

carchrae avatar carchrae commented on August 24, 2024

given the pain of getting the correct version of nvidia drivers on a system all lined up, i wonder if a docker image of this repo would help. (or alternatively, run this project inside the tf docker image)

https://www.tensorflow.org/install/docker

setting docker up for gpu access requires some extra steps (see step 2 in link above), but was pretty straight forward.

from jukebox.

diffractometer avatar diffractometer commented on August 24, 2024

I'm having the same issue. I agree a docker would be great. Is NCCL a dep does anyone know?

from jukebox.

diffractometer avatar diffractometer commented on August 24, 2024

@carchrae I think running a docker image on an EC2 instance, if you don't have CUDA on a Mac, is the way to go right?

from jukebox.

diffractometer avatar diffractometer commented on August 24, 2024

@carchrae awesome, I'll check that out. My other friend said I should just concentrate on getting it running locally in a jupyter notebook

from jukebox.

diffractometer avatar diffractometer commented on August 24, 2024

Ah, dang. That makes sense, especially given the results. I'll keep poking...

from jukebox.

carchrae avatar carchrae commented on August 24, 2024

hmm - or maybe not? and there is a cpu only flag. (i bet it'll be damn slow tho!)

device = torch.device("cuda", local_rank) if use_cuda else torch.device("cpu")

from jukebox.

diffractometer avatar diffractometer commented on August 24, 2024

ah, yeah I saw that line and tried to change it, but every time I ran it the error still bubbled up Distributed package doesn't have NCCL built in

from jukebox.

carchrae avatar carchrae commented on August 24, 2024

did you get any more useful error from this output?

            print(f"Caught error during NCCL init (attempt {attempt_idx} of {n_attempts}): {e}")

also, love the comment on the next line

            sleep(1 + (0.01 * mpi_rank))  # Sleep to avoid thundering herd

from jukebox.

carchrae avatar carchrae commented on August 24, 2024

so, i went through the install steps, and the sample seems to work for me (it is still running/downloading stuff)

my system: ubuntu 18.04, cuda lib installed 10.2.89, gtx 1060 w/ 6gb.

tom@saturn:~/projects/learning/jukebox$ python jukebox/sample.py --model=5b_lyrics --name=sample_5b --levels=3 --sample_length_in_seconds=20 --total_sample_length_in_seconds=180 --sr=44100 --n_samples=6 --hop_fraction=0.5,0.5,0.125
Using cuda True
{'name': 'sample_5b', 'levels': 3, 'sample_length_in_seconds': 20, 'total_sample_length_in_seconds': 180, 'sr': 44100, 'n_samples': 6, 'hop_fraction': (0.5, 0.5, 0.125)}
Setting sample length to 881920 (i.e. 19.998185941043083 seconds) to be multiple of 128
Downloading from gce
Restored from /home/tom/.cache/jukebox-assets/models/5b/vqvae.pth.tar
0: Loading vqvae in eval mode
Conditioning on 1 above level(s)
Checkpointing convs
Checkpointing convs
Loading artist IDs from /home/tom/projects/learning/jukebox/jukebox/data/ids/v2_artist_ids.txt
Loading artist IDs from /home/tom/projects/learning/jukebox/jukebox/data/ids/v2_genre_ids.txt
Level:0, Cond downsample:4, Raw to tokens:8, Sample length:65536
Downloading from gce

from jukebox.

prafullasd avatar prafullasd commented on August 24, 2024

The project does require a GPU to run, it could work on CPU but hasn't been tested and will almost surely be very slow.

@maraoz The NCCL error you see is in initialising torch.distributed, which technically isn't needed for sampling but is unfortunately still present in the code. Maybe initialise it with a different backend eg: setup_dist_from_mpi(backend="gloo"), or remove distributed/mpi all together as done here #36 (comment)

from jukebox.

diffractometer avatar diffractometer commented on August 24, 2024

@prafullasd @carchrae thank you for your input, looks like I need to spend a couple of days working on my tooling before I can attempt a build, so I'm going familiarize myself with notebooks. If there's anyway I can help with a docker build in the meantime, testing at least haha ;) lmk

from jukebox.

Jimmiexjames avatar Jimmiexjames commented on August 24, 2024

Wowzers... this seems a little more complicated than I thought. I have the hardware and ram and even the choice of iOS vs Mac.... but I donā€™t code and I usually donā€™t pirate so Iā€™m ā€œoveremcumberedā€ by this foggy paranoia about this entire thing.

from jukebox.

stevebanik avatar stevebanik commented on August 24, 2024

@prafullasd how do you initialize with the gloo backend? Is that option passed to sample.py?

from jukebox.

diffractometer avatar diffractometer commented on August 24, 2024

@btrude thanks for the update. I'm not sure where to actually find that docker by your description, can you share please? Cheers.

from jukebox.

perlman-izzy avatar perlman-izzy commented on August 24, 2024

Hi, noob programmer here -- can I run this on a vast.ai server? How?

from jukebox.

perlman-izzy avatar perlman-izzy commented on August 24, 2024

from jukebox.

perlman-izzy avatar perlman-izzy commented on August 24, 2024

from jukebox.

LeapGamer avatar LeapGamer commented on August 24, 2024

btrude, first of all, you are amazing. I got everything working through vast.ai and your docker image for the 1B model from your instructions above. When I get a p100 instance going and try to use the 5B model, it says Killed while trying to load the priors. I read that the notebook handles this differently than sample.py, and you can solve the problem by either increasing swap or editing the code in sample.py to match the notebook regarding how it loads the priors. Did you run into this problem with the 5B model on vast.ai? I tried changing the size of my swap but it seems the recommended way of doing that is through the docker config. Vast.ai only has a 1G swap by default, and it doesn't seem like you can change it once connected.

from jukebox.

btrude avatar btrude commented on August 24, 2024

btrude, first of all, you are amazing. I got everything working through vast.ai and your docker image for the 1B model from your instructions above. When I get a p100 instance going and try to use the 5B model, it says Killed while trying to load the priors. I read that the notebook handles this differently than sample.py, and you can solve the problem by either increasing swap or editing the code in sample.py to match the notebook regarding how it loads the priors. Did you run into this problem with the 5B model on vast.ai? I tried changing the size of my swap but it seems the recommended way of doing that is through the docker config. Vast.ai only has a 1G swap by default, and it doesn't seem like you can change it once connected.

After it says "Killed" what does echo $? say? If it's 137 then yeah, you're out of memory and need to pick an instance with more memory. I don't think I've ever had OOM problems though, the only time I ever saw "Killed" was when I didn't allocate enough disk space.

from jukebox.

cicinwad avatar cicinwad commented on August 24, 2024

Wowzers... this seems a little more complicated than I thought. I have the hardware and ram and even the choice of iOS vs Mac.... but I donā€™t code and I usually donā€™t pirate so Iā€™m ā€œoveremcumberedā€ by this foggy paranoia about this entire thing.

I use an iOS, I know.

from jukebox.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.