runpod / containers Goto Github PK
View Code? Open in Web Editor NEW๐ณ | Dockerfiles for the RunPod container images used for our official templates.
Home Page: https://hub.docker.com/u/runpod
License: MIT License
๐ณ | Dockerfiles for the RunPod container images used for our official templates.
Home Page: https://hub.docker.com/u/runpod
License: MIT License
is it possible to use volume mount path
env variable for --ServerApp.preferred_dir={from env}
or like --notebook-dir={from env}
user can start with volume mount path
in my case /content
instead of /workspace
Line 19 in 564385a
I just built (using docker buildx bake) and deployed an image of stable diffusion webui. Everything started fine, but webui doesn`t react to any of my clicks. I skipped the creation of runpod.yaml file, but just because I don't understand the purpose of it and how to fulfil it. I am quite new to this. Sorry if my problem is really silly. Would be happy for any help ^)
#16 317.6 RuntimeError: Couldn't install torch.
#16 317.6 Command: "/workspace/venv/bin/python3" -m pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
The 12.1.0 container is EOL (https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md) and will be deleted soon.
This is printed on boot:
2024-04-17T06:32:00.716216802Z *************************
2024-04-17T06:32:00.716263593Z ** DEPRECATION NOTICE! **
2024-04-17T06:32:00.716538501Z *************************
2024-04-17T06:32:00.716629960Z THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
2024-04-17T06:32:00.716661327Z https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md
I just followed the tutorial on RunPod Automatic WEBUI. and the custom safe tensor specified is not loaded when running inference. Instead, what is loaded is the default SD "model".
the the modded to load probably, we need to specify on this endpoint sdapi/v1/options:
"sd_model_checkpoint": "Anything-V3.0-pruned.ckpt [2700c435]",
response = requests.post(url=f'{url}/sdapi/v1/options', json=option_payload)
If I follow the link
https://github.com/runpod/containers/tree/main/gpt4all
from
https://hub.docker.com/r/runpod/gpt4all#!
I get a404 - page not found
error.
This allows people to fine-tune LLMs and test them without any coding experience. It has become fairly popular and receives regular updates:
I noticed that image generation was significantly slower in new versions of the runpod official A1111 image. Looking into it, it seems like it's due to xformers not being installed, or not loading correctly for whatever reason.
To reproduce (giving my specific steps, but I think it'd occur on secure cloud, and non-3090 machines too):
runpod/stable-diffusion:web-ui-10.0.0
)runpod/stable-diffusion:web-automatic-6.0.1
, and also observe that in the A1111 UI at the bottom of the page, it says xformers: N/A
instead of xformers: <version number>
Here's a snippet from the startup logs:
2023-07-29T06:28:19.671508453Z
2023-07-29T06:28:19.671510697Z ---
2023-07-29T06:28:20.581231736Z Python 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0]
2023-07-29T06:28:20.581256173Z Version: v1.5.1
2023-07-29T06:28:20.581258698Z Commit hash: 68f336bd994bed5442ad95bad6b6ad5564a5409a
2023-07-29T06:28:20.581260371Z
2023-07-29T06:28:20.581261924Z
2023-07-29T06:28:20.581263447Z Launching Web UI with arguments: -f --port 3000 --xformers --skip-install --listen --enable-insecure-extension-access
2023-07-29T06:28:20.581282242Z no module 'xformers'. Processing without...
2023-07-29T06:28:20.581286390Z no module 'xformers'. Processing without...
2023-07-29T06:28:20.581287803Z No module 'xformers'. Proceeding without it.
Hi Runpod, it would be great if you can either upgrade the current template for PyTorch and CUDA to the new version or create a new template with the newer version of PyTorch and CUDA since some libraries have a dependency on this.
The workers won't actually be able to start up. I fixed this in my own build and it worked. https://github.com/runpod/containers/blob/main/serverless-automatic/start.sh#L7
I'm trying to build the img locally running:
docker build -t runpod/stable-diffusion-comfyui-custom -f official-templates/stable-diffusion-comfyui/Dockerfile .
From the root of the repository (all files have full permission as well).
Regardless I'm getting the following error:
[+] Building 1.1s (12/12) FINISHED docker:default
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 2.93kB 0.0s
=> ERROR [internal] load metadata for docker.io/library/scripts:latest 1.0s
=> CANCELED [internal] load metadata for docker.io/nvidia/cuda:11.8.0-base-ubuntu22.04 1.0s
=> ERROR [internal] load metadata for docker.io/library/proxy:latest 1.0s
=> CANCELED [internal] load metadata for docker.io/runpod/stable-diffusion:models-1.0.0 1.0s
=> CANCELED [internal] load metadata for docker.io/runpod/stable-diffusion-models:2.1 1.0s
=> [auth] library/proxy:pull token for registry-1.docker.io 0.0s
=> [auth] nvidia/cuda:pull token for registry-1.docker.io 0.0s
=> [auth] library/scripts:pull token for registry-1.docker.io 0.0s
=> [auth] runpod/stable-diffusion-models:pull token for registry-1.docker.io 0.0s
=> [auth] runpod/stable-diffusion:pull token for registry-1.docker.io 0.0s
------
> [internal] load metadata for docker.io/library/scripts:latest:
------
------
> [internal] load metadata for docker.io/library/proxy:latest:
------
Dockerfile:70
--------------------
68 | # Start Scripts
69 | COPY pre_start.sh /pre_start.sh
70 | >>> COPY --from=scripts start.sh /
71 | RUN chmod +x /start.sh
72 |
--------------------
ERROR: failed to solve: scripts: pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
Can you please help me figure out what I'm doing wrong? Keep in mind that I have not modified anything from the Dockerfile yet :( (I also succesfully did ''docker login'')
Image appears to with with PCIe but not SXM5
I am trying to understand where this container comes from https://hub.docker.com/r/runpod/tensorflow
It links to this git repo but the folder has been deleted.
I would like to run a newer version of tensorflow but don't understand how I could update the container that currently exists in runpod for tensorflow
If we can start a notebook from a URL, this feature would become very helpful for both users and template creators.
In my case, I should instruct the user to enter the following URL: https://github.com/camenduru/stable-diffusion-webui-runpod, and then copy and paste the code into a new notebook. However, this manual process can be avoided by using JupyterLab's 'start notebook' feature.
This error appears when relaunching the webui process after installing ControlNet v1.1.142
running : runpod/stable-diffusion:web-automatic-5.0.0
2023-05-06T18:42:40.063374181Z Building wheel for pycairo (pyproject.toml): finished with status 'error'
2023-05-06T18:42:40.063378201Z Failed to build pycairo
2023-05-06T18:42:40.063381781Z
2023-05-06T18:42:40.063385181Z stderr: error: subprocess-exited-with-error
2023-05-06T18:42:40.063388871Z
2023-05-06T18:42:40.063392351Z ร Building wheel for pycairo (pyproject.toml) did not run successfully.
2023-05-06T18:42:40.063398041Z โ exit code: 1
2023-05-06T18:42:40.063401791Z โฐโ> [12 lines of output]
2023-05-06T18:42:40.063405601Z running bdist_wheel
2023-05-06T18:42:40.063409130Z running build
2023-05-06T18:42:40.063412730Z running build_py
2023-05-06T18:42:40.063416320Z creating build
2023-05-06T18:42:40.063419860Z creating build/lib.linux-x86_64-cpython-310
2023-05-06T18:42:40.063423460Z creating build/lib.linux-x86_64-cpython-310/cairo
2023-05-06T18:42:40.063427140Z copying cairo/__init__.py -> build/lib.linux-x86_64-cpython-310/cairo
2023-05-06T18:42:40.063430940Z copying cairo/__init__.pyi -> build/lib.linux-x86_64-cpython-310/cairo
2023-05-06T18:42:40.063434690Z copying cairo/py.typed -> build/lib.linux-x86_64-cpython-310/cairo
2023-05-06T18:42:40.063438410Z running build_ext
2023-05-06T18:42:40.063441890Z 'pkg-config' not found.
2023-05-06T18:42:40.063445430Z Command ['pkg-config', '--print-errors', '--exists', 'cairo >= 1.15.10']
2023-05-06T18:42:40.063449340Z [end of output]
2023-05-06T18:42:40.063452820Z
2023-05-06T18:42:40.063456250Z note: This error originates from a subprocess, and is likely not a problem with pip.
2023-05-06T18:42:40.063460050Z ERROR: Failed building wheel for pycairo
2023-05-06T18:42:40.063463640Z ERROR: Could not build wheels for pycairo, which is required to install pyproject.toml-based projects
it can be solved by installing the following before attempting to run the controlnet installer:
apt-get install libcairo2 libcairo2-dev
FROM runpod/pytorch:2.2.1-py3.10-cuda12.1.1-devel-ubuntu22.04
using this docker file and running
import inference.models.yolo_world.yolo_world
YOLO = inference.models.yolo_world.yolo_world.YOLOWorld(model_id="yolo_world/l")
causes the following error:
UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
Creating inference sessions
UserWarning: Specified provider 'OpenVINOExecutionProvider' is not in available provider names.Available providers: 'TensorrtExecutionProvider, CUDAExecutionProvider, CPUExecutionProvider'
EP Error /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 804: forward compatibility was attempted on non supported HW ; GPU=-593199125 ; hostname=0a84033fcf95 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=238 ; expr=cudaSetDevice(info_.device_id);
when using ['CUDAExecutionProvider', 'OpenVINOExecutionProvider', 'CPUExecutionProvider']
Falling back to ['CUDAExecutionProvider', 'CPUExecutionProvider'] and retrying.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 383, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 435, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
RuntimeError: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 804: forward compatibility was attempted on non supported HW ; GPU=-593199125 ; hostname=0a84033fcf95 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=238 ; expr=cudaSetDevice(info_.device_id);
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/app/scripts/temp.py", line 4, in <module>
YOLO = inference.models.yolo_world.yolo_world.YOLOWorld(model_id="yolo_world/l")
File "/usr/local/lib/python3.10/dist-packages/inference/models/yolo_world/yolo_world.py", line 54, in __init__
clip_model = Clip(model_id="clip/ViT-B-32")
File "/usr/local/lib/python3.10/dist-packages/inference/models/clip/clip_model.py", line 65, in __init__
self.visual_onnx_session = onnxruntime.InferenceSession(
File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 394, in __init__
raise fallback_error from e
File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 389, in __init__
self._create_inference_session(self._fallback_providers, None)
File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 435, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
RuntimeError: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 804: forward compatibility was attempted on non supported HW ; GPU=-593199125 ; hostname=0a84033fcf95 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=238 ; expr=cudaSetDevice(info_.device_id);
The same python script using
FROM pytorch/pytorch:2.2.2-cuda12.1-cudnn8-runtime
works as expected.
nvidia-smi
Mon Jun 3 22:59:43 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| N/A 63C P0 25W / 80W | 1538MiB / 8192MiB | 94% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
docker-compose.yaml
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [ gpu ]
I am creating a pod that uses HF's text-generation-interface (TGI) Docker container (see image_name below). I can create a pod successfully as long as I do not pass in the --quantize parameter within the docker_args. For example, if I pass in docker_args="--model-id "tiiuae/falcon-7b-instruct" --num-shard 1 --quantize bitsandbytes"
The error in the container log has...2023-08-10T11:30:29.101220272-06:00 /opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
and ends with the message: 2023-08-10T11:30:29.101315592-06:00 ValueError: quantization is not available on CPU
HF support's comment when I asked on GitHub: it seems more to me that the GPU is not detected in the docker image, and that error message is bogus stemming from that. (I can run fine with 1.0.0 with bnb on a simple docker + gpu environement).
Another comment just made by HF GitHub: Something about shm not being properly set or something.
... If I try with the other quantization option gptq
, the container throws a signal 4
. Is the container seeing the GPU? What is going on with bitsandbytes? Why signal 4. I am hoping to minimize the amount of memory and inference time. Help very much appreciated.
Here is my call to create_pod: pod = runpod.create_pod( name=model_id, image_name="ghcr.io/huggingface/text-generation-inference:1.0.0", gpu_type_id=gpu_type, cloud_type=cloud_type, docker_args=f"--model-id {model_id} --num-shard {num_shard} -quantize {quantize}", gpu_count=gpu_count, volume_in_gb=volume_in_gb, container_disk_in_gb=5, ports="80/http", volume_mount_path="/data", # min_vcpu_count=2, # min_memory_in_gb=15, )
The specs on my community pod is 1 x RTX 3090 9 vCPU 37 GB RAM.
Thank you.
Then trying to train on some models like Lykon/Dreamshape it fails.
There is an os error that it cannot find config.json.
Please check it
It would be very nice to have an environment variable to launch oobabooga (https://github.com/runpod/containers/blob/main/oobabooga/start.sh) with the API. See https://github.com/oobabooga/text-generation-webui#api for more information on this.
Otherwise, the only way to launch text-generation-web-ui with the API is to build my own docker image from scratch.
An alternative would be an ARGS environment variable to let us pass whatever we need to the python app.
Thank you for your consideration :)
Hi RunPod!
I am experiencing issues with the performance of TensorFlow on your A100 80GB machines. The problems seem to originate from an apparent version mismatch between CUDA, cuDNN, and cuBLAS, which is not aligning properly with the version of TensorFlow currently utilized on your systems.
Additionally, I have noticed significantly slow training times on my setups that are beyond what is normally expected. This sluggish performance is particularly noticeable when compared with a 40GB Colab A100 machine which often even outperforms your 1 A100 80GB setup.
Here are the error messages I am receiving:
When initiating training on my single A100 80GB machine:
2023-07-15 00:38:25.585795: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8401
Could not load symbol cublasGetSmCountTarget from libcublas.so.11. Error: /usr/local/cuda/lib64/libcublas.so.11: undefined symbol: cublasGetSmCountTarget
2023-07-15 00:38:25.781595: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
And also, when I prepared my data, model, and everything else on my 4xA100 80GB machine a while back:
2023-06-22 20:07:13.541513: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8401
Could not load symbol cublasGetSmCountTarget from libcublas.so.11. Error: /usr/local/cuda/lib64/libcublas.so.11: undefined symbol: cublasGetSmCountTarget
2023-06-22 20:07:13.881071: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2023-06-22 20:07:14.015923: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8401
2023-06-22 20:07:14.563243: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8401
2023-06-22 20:07:15.052808: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8401
I picked RunPod as my go-to choice when I decided to move on from Colab, thanks to the potential I saw in your platform. Despite the current, let's call them firmware challenges, I'm hopeful that you get your systems up do date and fixed.
All the best!
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Will use his docker and then a FROM to build our own, adding our files on top.
run a container instance from runpod/base:0.5.1-cpu
image
docker run --name base -it -d runpod/base:0.5.1-cpu
after,into the container
docker exec -it base /bin/bash
I want install docker-ce in the base container ,follow docker doc https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository
apt-get update; \
apt-get install -y sudo \
ca-certificates \
vim \
curl; \
sudo install -m 0755 -d /etc/apt/keyrings; \
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc; \
sudo chmod a+r /etc/apt/keyrings/docker.asc; \
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null; \
sudo apt-get update; \
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
my question, docker is not running, please tell me what todo, thanks
Hello,
Template: https://github.com/runpod/containers/tree/main/official-templates/stable-diffusion-comfyui
User comment:
I'm trying to install ComfyUI Manager the standard way with git clone into the custom_nodes folder and it doesn't appear in the UI. I don't know of any other way. Am I missing something?
OK nevermind. I figured it out. I had to install torch vision. The extension is 1.5MB and it's the basic one that let's you download other extensions, so it would be convenient to include it.
Thanks!
JM
The Automatic WEBUI is currently just hitting the text to Image endpoint:
check_api_availability("http://127.0.0.1:3000/sdapi/v1/txt2img")
How can we also hit the the Image to Image endpoint?
thanks
This is the second time I've tried to use a base image to host on runpod, and it's the second time it hasn't worked. It's frustrating. Please fix
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.