Hey all, I've been trying to spin up a WIS VM in TrueNAS Scale 23.10. Here's a quick

I pulled early this week. Here's my process: $ git clone <a href="h

To add to everything <a class="user-mention notranslate" data-hovercard-type="user" da

CUDA error starting up WIS about willow-inference-server HOT 10 CLOSED

ceinstaller commented on July 28, 2024

CUDA error starting up WIS

from willow-inference-server.

Comments (10)

nikito commented on July 28, 2024 1

Are you running the latest pull of WIS? It looks like this is loading every model which is likely eating up your VRAM. You can help this by only loading the model you want to primarily use. If you go to the wis folder and copy settings.py to custom_settings.py, you can then modify settings in this file to define what models you want to load. For instance in mine I only load the large model to save on VRAM:

Additionally if you need to save further vram, you can lower concurrent_gpu_chunks to 1 (default is 2), which will make inference use less VRAM, at the cost of slightly slower response time (probably unnoticeable for most tasks).
Also note, be sure your whisper_model_default is set to the model you are loading!

Once you save your changes, rebuild/deploy the container.

Hope this helps!

from willow-inference-server.

ceinstaller commented on July 28, 2024 1

YES!!!! Nick, you're the man of the hour, the tower of power!

from willow-inference-server.

nikito commented on July 28, 2024 1

Actual speed of inference is generally down to how many cuda cores you have and the architecture of the gpu. You also may want to use some of that vram for tts 🙂

from willow-inference-server.

nikito commented on July 28, 2024

Are any other processes sharing the gpu? What's the beginning of the log look like at startup?

from willow-inference-server.

ceinstaller commented on July 28, 2024

No, the GPU is dedicated to WIS. Here's the full log:

$ ./utils.sh run
Using configuration overrides from .env file
[+] Running 2/0
✔ Container willow-inference-server-wis-1 Created 0.0s
✔ Container willow-inference-server-nginx-1 Created 0.0s
Attaching to nginx-1, wis-1
wis-1 |
wis-1 | =====================
wis-1 | == NVIDIA TensorRT ==
wis-1 | =====================
wis-1 |
wis-1 | NVIDIA Release 23.08 (build 66128967)
wis-1 | NVIDIA TensorRT Version 8.6.1
wis-1 | Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
wis-1 |
wis-1 | Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
wis-1 |
wis-1 | https://developer.nvidia.com/tensorrt
wis-1 |
wis-1 | Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
wis-1 |
wis-1 | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
wis-1 | By pulling and using the container, you accept the terms and conditions of this license:
wis-1 | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
wis-1 |
wis-1 | To install Python sample dependencies, run /opt/tensorrt/python/python_setup.sh
wis-1 |
wis-1 | To install the open-source samples corresponding to this TensorRT release version
wis-1 | run /opt/tensorrt/install_opensource.sh. To build the open source parsers,
wis-1 | plugins, and samples for current top-of-tree on master or a different branch,
wis-1 | run /opt/tensorrt/install_opensource.sh -b
wis-1 | See https://github.com/NVIDIA/TensorRT for more information.
nginx-1 | /docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
nginx-1 | /docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
nginx-1 | /docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
nginx-1 | 10-listen-on-ipv6-by-default.sh: info: IPv6 listen already enabled
nginx-1 | /docker-entrypoint.sh: Sourcing /docker-entrypoint.d/15-local-resolvers.envsh
nginx-1 | /docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
nginx-1 | /docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
nginx-1 | /docker-entrypoint.sh: Configuration complete; ready for start up
wis-1 |
wis-1 | [2024-01-26 04:40:25 +0000] [92] [DEBUG] Current configuration:
wis-1 | config: ./gunicorn.conf.py
wis-1 | wsgi_app: None
wis-1 | bind: ['0.0.0.0:19000']
wis-1 | backlog: 2048
wis-1 | workers: 1
wis-1 | worker_class: uvicorn.workers.UvicornWorker
wis-1 | threads: 1
wis-1 | worker_connections: 1000
wis-1 | max_requests: 0
wis-1 | max_requests_jitter: 0
wis-1 | timeout: 0
wis-1 | graceful_timeout: 10
wis-1 | keepalive: 3600
wis-1 | limit_request_line: 4094
wis-1 | limit_request_fields: 100
wis-1 | limit_request_field_size: 8190
wis-1 | reload: False
wis-1 | reload_engine: auto
wis-1 | reload_extra_files: []
wis-1 | spew: False
wis-1 | check_config: False
wis-1 | print_config: False
wis-1 | preload_app: False
wis-1 | sendfile: None
wis-1 | reuse_port: False
wis-1 | chdir: /app
wis-1 | daemon: False
wis-1 | raw_env: []
wis-1 | pidfile: None
wis-1 | worker_tmp_dir: None
wis-1 | user: 0
wis-1 | group: 0
wis-1 | umask: 0
wis-1 | initgroups: False
wis-1 | tmp_upload_dir: None
wis-1 | secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
wis-1 | forwarded_allow_ips: ['127.0.0.1']
wis-1 | accesslog: None
wis-1 | disable_redirect_access_to_syslog: False
wis-1 | access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
wis-1 | errorlog: -
wis-1 | loglevel: debug
wis-1 | capture_output: False
wis-1 | logger_class: gunicorn.glogging.Logger
wis-1 | logconfig: None
wis-1 | logconfig_dict: {}
wis-1 | logconfig_json: None
wis-1 | syslog_addr: udp://localhost:514
wis-1 | syslog: False
wis-1 | syslog_prefix: None
wis-1 | syslog_facility: user
wis-1 | enable_stdio_inheritance: False
wis-1 | statsd_host: None
wis-1 | dogstatsd_tags:
wis-1 | statsd_prefix:
wis-1 | proc_name: None
wis-1 | default_proc_name: main:app
wis-1 | pythonpath: None
wis-1 | paste: None
wis-1 | on_starting: <function OnStarting.on_starting at 0x7ff89227fac0>
wis-1 | on_reload: <function OnReload.on_reload at 0x7ff89227fbe0>
wis-1 | when_ready: <function WhenReady.when_ready at 0x7ff89227fd00>
wis-1 | pre_fork: <function Prefork.pre_fork at 0x7ff89227fe20>
wis-1 | post_fork: <function Postfork.post_fork at 0x7ff89227ff40>
wis-1 | post_worker_init: <function PostWorkerInit.post_worker_init at 0x7ff8922a00d0>
wis-1 | worker_int: <function WorkerInt.worker_int at 0x7ff8922a01f0>
wis-1 | worker_abort: <function WorkerAbort.worker_abort at 0x7ff8922a0310>
wis-1 | pre_exec: <function PreExec.pre_exec at 0x7ff8922a0430>
wis-1 | pre_request: <function PreRequest.pre_request at 0x7ff8922a0550>
wis-1 | post_request: <function PostRequest.post_request at 0x7ff8922a05e0>
wis-1 | child_exit: <function ChildExit.child_exit at 0x7ff8922a0700>
wis-1 | worker_exit: <function WorkerExit.worker_exit at 0x7ff8922a0820>
wis-1 | nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7ff8922a0940>
wis-1 | on_exit: <function OnExit.on_exit at 0x7ff8922a0a60>
wis-1 | ssl_context: <function NewSSLContext.ssl_context at 0x7ff8922a0b80>
wis-1 | proxy_protocol: False
wis-1 | proxy_allow_ips: ['127.0.0.1']
wis-1 | keyfile: None
wis-1 | certfile: None
wis-1 | ssl_version: 2
wis-1 | cert_reqs: 0
wis-1 | ca_certs: None
wis-1 | suppress_ragged_eofs: True
wis-1 | do_handshake_on_connect: False
wis-1 | ciphers: None
wis-1 | raw_paste_global_conf: []
wis-1 | strip_header_spaces: False
wis-1 | [2024-01-26 04:40:25 +0000] [92] [INFO] Starting gunicorn 21.2.0
wis-1 | [2024-01-26 04:40:25 +0000] [92] [DEBUG] Arbiter booted
wis-1 | [2024-01-26 04:40:25 +0000] [92] [INFO] Listening at: http://0.0.0.0:19000 (92)
wis-1 | [2024-01-26 04:40:25 +0000] [92] [INFO] Using worker: uvicorn.workers.UvicornWorker
wis-1 | [2024-01-26 04:40:25 +0000] [93] [INFO] Booting worker with pid: 93
wis-1 | [2024-01-26 04:40:25 +0000] [92] [DEBUG] 1 workers
wis-1 | [2024-01-26 04:40:30 +0000] [93] [INFO] Willow Inference Server is starting... Please wait.
wis-1 | [2024-01-26 04:40:31 +0000] [93] [INFO] CUDA: Detected 1 device(s)
wis-1 | [2024-01-26 04:40:31 +0000] [93] [INFO] CUDA: Device 0 name: NVIDIA GeForce GTX 1660 SUPER
wis-1 | [2024-01-26 04:40:31 +0000] [93] [INFO] CUDA: Device 0 capability: 75
wis-1 | [2024-01-26 04:40:31 +0000] [93] [INFO] CUDA: Device 0 total memory: 6225002496 bytes
wis-1 | [2024-01-26 04:40:31 +0000] [93] [INFO] CUDA: Device 0 free memory: 6152650752 bytes
wis-1 | [2024-01-26 04:40:31 +0000] [93] [INFO] Started server process [93]
wis-1 | [2024-01-26 04:40:31 +0000] [93] [INFO] Waiting for application startup.
wis-1 | [2024-01-26 04:40:31 +0000] [93] [INFO] CTRANSLATE: Supported compute types for device cuda are {'float16', 'int8_float16', 'float32', 'int8_float32', 'int8'}- using configured int8_float16
wis-1 | [2024-01-26 04:40:31 +0000] [93] [INFO] Loading whisper model: tiny
wis-1 | [2024-01-26 04:40:31 +0000] [93] [INFO] Loading whisper model: base
wis-1 | [2024-01-26 04:40:32 +0000] [93] [INFO] Loading whisper model: small
wis-1 | [2024-01-26 04:40:33 +0000] [93] [INFO] Loading whisper model: medium
wis-1 | [2024-01-26 04:40:37 +0000] [93] [INFO] Loading whisper model: large
wis-1 | [2024-01-26 04:40:49 +0000] [93] [INFO] Warming models...
wis-1 | [2024-01-26 04:40:50 +0000] [93] [DEBUG] WHISPER: Loading audio took 1266.142 ms
wis-1 | [2024-01-26 04:40:50 +0000] [93] [DEBUG] WHISPER: Feature extraction took 13.823 ms
wis-1 | [2024-01-26 04:40:50 +0000] [93] [DEBUG] WHISPER: Forcing language en
wis-1 | [2024-01-26 04:40:50 +0000] [93] [DEBUG] WHISPER: Using model tiny with beam size 1
wis-1 | [2024-01-26 04:40:50 +0000] [93] [DEBUG] Processing GPU batch 1 of expected 1
wis-1 | terminate called after throwing an instance of 'std::runtime_error'
wis-1 | what(): CUDA failed with error an illegal memory access was encountered
^CGracefully stopping... (press Ctrl+C again to force)

Thanks for reply, let me know if there's anything else I can do to help!

~Mark

from willow-inference-server.

ceinstaller commented on July 28, 2024

I pulled early this week. Here's my process:

$ git clone https://github.com/toverainc/willow-inference-server.git && cd willow-inference-server

$ ./deps/ubuntu.sh (This will install Docker)

$ ./utils.sh install (Hurry up and wait)

$ ./utils.sh gen-cert [IP Address]

$ ./utils.sh run (For testing, once everything is working add '-d')

I'll make the changes after lunch and rebuild.

I'm here because of openHAB, so my AI kung fu is pretty weak. Is there a reason to run more than one model? If this works, should I slowly add models back in until it breaks then run the most I can?

Thanks again,

~Mark

from willow-inference-server.

nikito commented on July 28, 2024

There's no real reason to run more than one model other than if you want to utilize different models for different things; the server supports setting the desired model on each request, but if none is supplied it will use the default. Some people feed things to large while feeding other things to medium or small for instance; The larger the model the more compute needed for response so the response time will increase. As a result a user may want to send normal speech commands to medium for instance if their GPU is a bit slow to have that sub-1-second response time, while they may use the large model for things like generating subtitles for audio, where they don't care if it takes a while because they don't need immediate response. (that's a more advanced use case and probably not relevant for you :) )

from willow-inference-server.

ceinstaller commented on July 28, 2024

Ah, excellent! Thank you for the speedy response, I'm hoping this works and gets me going.

~Mark

from willow-inference-server.

ceinstaller commented on July 28, 2024

OK, things appear to be working all around! I can use my ESP box to turn on a light using openHAB. (I am seeing an error in the WAS console, but I'll open another issue for that.)

When I run nvidia-smi now, I see that 2.5GB of VRAM is being used. Seeing as how I'm only using the large model, are there other settings I can tweak to increase performance? To use your example from above, can I increase concurrent_gpu_chunks to get more performance by using more unused VRAM?

Thanks,

~Mark

from willow-inference-server.

kristiankielhofner commented on July 28, 2024

To add to everything @nikito has said, the Whisper models WIS uses fundamentally operate on 30 second chunks so batching (chunks) is really only relevant with speech segments longer than that.

While our defaults may seem "out of whack" for Willow use cases we have a surprising number of users who utilize WIS with WebRTC, POST to /api/asr, etc functionality that often get tripped up on chunk length batching. In VRAM constrained environments it's an issue but generally speaking it's a reasonable default.

from willow-inference-server.

CUDA error starting up WIS about willow-inference-server HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent