Code Monkey home page Code Monkey logo

toverainc / willow-inference-server Goto Github PK

View Code? Open in Web Editor NEW
327.0 17.0 27.0 3.35 MB

Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS

License: Apache License 2.0

Dockerfile 2.39% Shell 7.23% Python 54.54% JavaScript 22.72% HTML 12.21% CSS 0.90%
cuda deep-learning llama llm privacy speech-recognition speech-to-text text-to-speech vicuna webrtc

willow-inference-server's Introduction

Willow Inference Server

Watch the WIS WebRTC Demo

Willow Inference Server (WIS) is a focused and highly optimized language inference server implementation. Our goal is to "automagically" enable performant, cost-effective self-hosting of released state of the art/best of breed models to enable speech and language tasks:

  • Primarily targeting CUDA with support for low-end (cheap) devices such as the Tesla P4, GTX 1060, and up. Don't worry - it screams on an RTX 4090 too! (See benchmarks). Can also run CPU-only.
  • Memory optimized - all three default Whisper (base, medium, large-v2) models loaded simultaneously with TTS support inside of 6GB VRAM. LLM support defaults to int4 quantization (conversion scripts included). ASR/STT + TTS + Vicuna 13B require roughly 18GB VRAM. Less for 7B, of course!
  • ASR. Heavy emphasis - Whisper optimized for very high quality as-close-to-real-time-as-possible speech recognition via a variety of means (Willow, WebRTC, POST a file, integration with devices and client applications, etc). Results in hundreds of milliseconds or less for most intended speech tasks.
  • TTS. Primarily provided for assistant tasks (like Willow!) and visually impaired users.
  • LLM. Optionally pass input through a provided/configured LLM for question answering, chatbot, and assistant tasks. Currently supports LLaMA deriviates with strong preference for Vicuna (the author likes 13B). Built in support for quantization to int4 to conserve GPU memory.
  • Support for a variety of transports. REST, WebRTC, Web Sockets (primarily for LLM).
  • Performance and memory optimized. Leverages CTranslate2 for Whisper support and AutoGPTQ for LLMs.
  • Willow support. WIS powers the Tovera hosted best-effort example server Willow users enjoy.
  • Support for WebRTC - stream audio in real-time from browsers or WebRTC applications to optimize quality and response time. Heavily optimized for long-running sessions using WebRTC audio track management. Leave your session open for days at a time and have self-hosted ASR transcription within hundreds of milliseconds while conserving network bandwidth and CPU!
  • Support for custom TTS voices. With relatively small audio recordings WIS can create and manage custom TTS voices. See API documentation for more information.

With the goal of enabling democratization of this functionality WIS will detect available CUDA VRAM, compute platform support, etc and optimize and/or disable functionality automatically (currently in order - ASR, TTS, LLM). For all supported Whisper models (large-v2, medium, and base) loaded simultaneously current minimum supported hardware is GTX 1060 3GB (6GB for ASR and TTS). User applications across all supported transports are able to programatically select and configure Whisper models and parameters (model size, beam, language detection/translation, etc) and TTS voices on a per-request basis depending on the needs of the application to balance speed/quality.

Note that we are primarily targeting CUDA - the performance, cost, and power usage of cheap GPUs like the Tesla P4 and GTX 1060 is too good to ignore. We'll make our best effort to support CPU wherever possible for current and future functionality but our emphasis is on performant latency-sensitive tasks even with low-end GPUs like the GTX 1070/Tesla P4 (as of this writing roughly $100 USD on the used market - and plenty of stock!).

Getting started

Dependencies (run once for initial install)

For CUDA support you will need to have the NVIDIA drivers for your supported hardware. We recommend Nvidia driver version 530.

# Clone this repo:
git clone https://github.com/toverainc/willow-inference-server.git && cd willow-inference-server

# Ensure you have nvidia-container-toolkit and not nvidia-docker
# On Arch Linux:
yay -S libnvidia-container-tools libnvidia-container nvidia-container-toolkit docker-buildx

# Ubuntu:
./deps/ubuntu.sh

Install, configure, and start WIS

# Install
./utils.sh install

# Generate self-signed TLS cert (or place a "real" one at nginx/key.pem and nginx/cert.pem)
./utils.sh gen-cert [your hostname]

# Start WIS
./utils.sh run

Note that (like Willow) Willow Inference Server is very early and advancing rapidly! Users are encouraged to contribute (hence the build requirement). For the 1.0 release of WIS we will provide ready to deploy Docker containers.

Links and Resources

Willow: Configure Willow to use https://[your host]:19000/api/willow then build and flash.

WebRTC demo client: https://[your host]:19000/rtc

API documentation for REST interface: https://[your host]:19000/api/docs

Configuration

System runtime can be configured by placing a .env file in the WIS root to override any variables set by utils.sh. You can also change more WIS specific parameters by copying settings.py to custom_settings.py.

Windows Support

WIS has been successfully tested on Windows with WSL (Windows Subsystem for Linux). With ASR and STT only requiring a total of 4GB VRAM WIS can be run concurrently with standard Windows desktop tasks on GPUs with 8GB VRAM.

Benchmarks

Device Model Beam Size Speech Duration (ms) Inference Time (ms) Realtime Multiple
RTX 4090 large-v2 5 3840 140 27x
RTX 3090 large-v2 5 3840 219 17x
H100 large-v2 5 3840 294 12x
H100 large-v2 5 10688 519 20x
H100 large-v2 5 29248 1223 23x
GTX 1060 large-v2 5 3840 1114 3x
Tesla P4 large-v2 5 3840 1099 3x
GTX 1070 large-v2 5 3840 742 5x
RTX 4090 medium 1 3840 84 45x
RTX 3090 medium 1 3840 140 27x
GTX 1070 medium 1 3840 424 9x
GTX 1070 medium 1 10688 564 18x
GTX 1070 medium 1 29248 1118 26x
GTX 1060 medium 1 3840 588 6x
Tesla P4 medium 1 3840 586 6x
RTX 4090 medium 1 29248 377 77x
RTX 3090 medium 1 29248 520 56x
GTX 1060 medium 1 29248 1612 18x
Tesla P4 medium 1 29248 1730 16x
GTX 1070 base 1 3840 70 54x
GTX 1070 base 1 10688 92 115x
GTX 1070 base 1 29248 195 149x
RTX 4090 base 1 180000 277 648x (not a typo)
RTX 3090 base 1 180000 435 414x (not a typo)
RTX 3090 tiny 1 180000 366 491x (not a typo)
GTX 1070 tiny 1 3840 46 82x
GTX 1070 tiny 1 10688 64 168x
GTX 1070 tiny 1 29248 135 216x
Threadripper PRO 5955WX tiny 1 3840 140 27x
Threadripper PRO 5955WX base 1 3840 245 15x
Threadripper PRO 5955WX small 1 3840 641 5x
Threadripper PRO 5955WX medium 1 3840 1614 2x
Threadripper PRO 5955WX large 1 3840 3344 1x

As you can see the realtime multiple increases dramatically with longer speech segments. Note that these numbers will also vary slightly depending on broader system configuration - CPU, RAM, etc.

When using WebRTC or Willow end-to-end latency in the browser/Willow and supported applications is the numbers above plus network latency for response - with the advantage being you can skip the "upload" portion as audio is streamed in realtime!

We are very interested in working with the community to optimize WIS for CPU. We haven't focused on it because we consider medium beam 1 to be the minimum configuration for intended tasks and CPUs cannot currently meet our latency targets.

Comparison Benchmarks

Raspberry Pi Benchmarks run on Raspberry Pi 4 4GB Debian 11.7 aarch64 with faster-whisper version 0.5.1. Canakit 3 AMP USB-C power adapter and fan. All models int8 with OMP_NUM_THREADS=4 and language set as en. Same methodology as timing above with model load time excluded (WIS keeps models loaded). All inference time numbers rounded down. Max temperatures as reported by vcgencmd measure_temp were 57.9 C.

Device Model Beam Size Speech Duration (ms) Inference Time (ms) Realtime Multiple
Pi tiny 1 3840 3333 1.15x
Pi base 1 3840 6207 0.62x
Pi medium 1 3840 50807 0.08x
Pi large-v2 1 3840 91036 0.04x

More coming soon!

CUDA

We understand the focus and emphasis on CUDA may be troubling or limiting for some users. We will provide additional CPU vs GPU benchmarks but spoiler alert: a $100 used GPU from eBay will beat the fastest CPUs on the market while consuming less power at SIGNIFICANTLY lower cost. GPUs are very fundamentally different architectually and while there is admirable work being done with CPU optimized projects such as whisper.cpp and CTranslate2 we believe that GPUs will maintain drastic speed, cost, and power advantages for the forseeable future. That said, we are interested in getting feedback (and PRs!) from WIS users to make full use of CTranslate2 to optimize for CPU.

GPU Sweet Spot - May 2023

Perusing eBay and other used marketplaces the GTX 1070 seems to be the best performance/price ratio for ASR/STT and TTS while leaving VRAM room for the future. The author ordered an EVGA GTX 1070 FTW ACX3.0 for $120 USD with shipping and tax on 5/19/2023.

To support LLM/Vicuna an RTX 3090/4090 is suggested. RTX 3090 being sold for approximately $800 as of this writing (5/23/2023).

LLM

WIS supports LLM on compatible CUDA devices with sufficient memory (varies depending on model selected).

From WIS root:

cp settings.py custom_settings.py

Edit custom_settings.py and set chatbot_model_path to an AutoGPTQForCausalLM compatible model path from Hugging Face (example provided). The model will be automatically downloaded, cached, and loaded from Hugging Face. Depending on the GPTQ format and configuration for your chosen model you may need to also change chatbot_model_basename. The various other parameters (temperature, top_p, etc) can also be set in custom_settings.py (defaults provided).

Make sure to set support_chatbot to True.

Then start/restart WIS.

Once loaded you can view the chatbot API documentation at https://[your host]:19000/api/docs.

WebRTC Tricks

The author has a long background with VoIP, WebRTC, etc. We deploy some fairly unique "tricks" to support long-running WebRTC sessions while conserving bandwidth and CPU. In between start/stop of audio record we pause (and then resume) the WebRTC audio track to bring bandwidth down to 5 kbps at 5 packets per second at idle while keeping response times low. This is done to keep ICE active and any NAT/firewall pinholes open while minimizing bandwidth and CPU usage. Did I mention it's optimized?

Start/stop of sessions and return of results uses WebRTC data channels.

WebRTC Client Library

See the Willow TypeScript Client repo to integrate WIS WebRTC support into your own frontend.

Fun Ideas

  • Integrate WebRTC with Home Assistant dashboard to support streaming audio directly from the HA dashboard on desktop or mobile.
  • Desktop/mobile transcription apps (look out for a future announcement on this!).
  • Desktop/mobile assistant apps - Willow everywhere!

The Future (in no particular order)

Better TTS

We're looking for feedback from the community on preferred implementations, voices, etc. See the open issue.

TTS Caching

Why do it again when you're saying the same thing? Support on-disk caching of TTS output for lightning fast TTS response times.

Support for more languages

Meta released MMS on 5/22/2023, supporting over 1,000 languages across speech to text and text to speech!

Code refactoring and modularization

WIS is very early and we will refactor, modularize, and improve documentation well before the 1.0 release.

Chaining of functions (apps)

We may support user-defined modules to chain together any number of supported tasks within one request, enabling such things as:

Speech in -> LLM -> Speech out

Speech in -> Arbitrary API -> Speech out

...and more, directly in WIS!

willow-inference-server's People

Contributors

freethenation avatar kristiankielhofner avatar lachesis avatar nikito avatar richardklafter avatar stintel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

willow-inference-server's Issues

Support GPU compute that doesn't require closed-source graphics drivers

From the README, you are actually giving a rationale for GPU compute generally when explaining the choice of CUDA:

Note that we are primarily targeting CUDA - the performance, cost, and power usage of cheap GPUs like the Tesla P4 and GTX 1060 is too good to ignore. We'll make our best effort to support CPU wherever possible for current and future functionality but our emphasis is on performant latency-sensitive tasks even with low-end GPUs like the GTX 1060/Tesla P4 (as of this writing roughly $100 USD on the used market - and plenty of stock!).

I understand that nVidia dominates the GPU market, and that CUDA is ubiquitous, but using CUDA puts your project in the ironic position of effectively requiring closed-source software on users' machines for usable performance, which would seem to be antithetical to Willow's stated ethos (and additionally asks users to give money to a company that's notoriously hostile to OSS).

I'm not really hugely familiar with the GPU compute landscape, but maybe a first step could be for the project to use HIPIFY to (partially?) automate the generation of HIP code, which would at least provide a performant alternative without requiring closed-source drivers and on hardware from a far more open-source friendly company - and AIUI HIP is supported on the open-source amdgpu driver.

I don't know what the picture is like for Intel GPU hardware, but finding a solution there would be great too, since their integrated graphics are ubiquitous and would presumably still be better than CPU-only?

There's also OpenCL of course but I don't know what the features/porting situation is like there.

Evaluate TTS Engines

SpeechT5 is included because it's in Transformers and it's an easy first pick for TTS.

There are several others (in no particular order):

Tortoise

Coqui

Toucan

MMS

Now that WIS has been released I'm very interested in feedback from the community to evaluate different engines, voices, etc so we can select the best default for future versions of WIS.

Add licence file/headers

There's no licence information I can find about this project overall, apart from it being described as "open source" in the about section. What is the overall licence, and can you add a licence file and licence headers to the source files please? Thanks!

Upon system reboot after using nohup ./utils.sh run, the nginx docker container does not come back, and the WIS container is stuck in restarting state

As the title says, I ran the utils.sh run command in the background, which successfully created the docker containers and had the app running. Given that the docker compose file had the container set to restart unless stopped, I figured if I rebooted the system as a test the containers would come back online on their own, but as it turns out only the WIS container came back, and the nginx one was no longer present in the container list. Seems as a result the WIS container was hung trying to restart.

I'll admit I'm not the most experienced docker user ๐Ÿ˜† so I may be misunderstanding what is in the docker compose file. I did change the restart policy on both containers to "always" instead of "unless-stopped" and that made it so when I reboot they both come back online on their own without me having to log in and manually run the utils.sh. Just figured I'd call it out in case that is unintended, if it is intentional then please disregard ๐Ÿ˜„

EDIT: Figure this is understood, but this is on the wisng branch not the main branch.

Improve handling of recording data

We currently use libav python bindings for the handling of Opus frames -> WAV for Whisper. This is currently done via a file on disk, which is gross for a variety of reasons.

As a first pass we should at least try BytesIO or something.

FR: force a specific language

As the title says. I was able to use detect_language true, but I would like to force a specific language - as usage now is a bit of a hit and miss. Of course it might be planned, i just would like it to happen wit a bit of prio hence the ticket

Better Recording

Currently for WebRTC ASR we do the following:

  • Client opens page
  • Media stream is started and recorder runs on server side
  • User clicks stop and the recorded file is sent to Whisper
  • Whisper sends ASR results back to client over data channel

Ideally what we would do is:

  • Client opens page
  • Media stream is started (for now)
  • User clicks start to start recording
  • User clicks stop and the recorded file is sent to Whisper
  • Whisper sends ASR results back to client over data channel
  • User can repeat start/stop as often as they would like

In this approach the user has better control of when the recording actually starts. This will work well when integrated with the speech microphone because we can trigger start/stop over the datachannel with record button press and release (push to talk, essentially). Not only that, they can record multiple statements in the same WebRTC session.

Publish Docker image

Is it possible to publish a Docker image so we can just docker run --gpus=all toverainc/willow-inference-server and have it work? That would make it trivial to launch our own server.

Scalability/performance

I think aiortc is the best approach from a latency standpoint - the media/audio is in memory until the moment Whisper is triggered and the audio is passed to it.

However, I'm skeptical that anything implemented in Python can handle large amounts of traffic. We're currently using a packetization interval of 20ms which translates to at least 50pps (packets per second) per direction per media stream. I've seen this present a challenge with the most optimized C/C++ implementations for media handling. We also potentially have the option of adjusting the ptime parameters of Opus because two-way latency isn't an issue for us and Opus implements forward error correction and packet loss concealment so losing a packet representing a larger chunk of audio with a higher ptime isn't as much of a concern as it is with other codec. We will also need to test is aiortc supports FEC and PLC...

Larger ptimes also drastically reduce the amount of bandwidth consumed.

Nothing that can't be done by throwing hardware at it and we'll probably have GPU and other procesing issues first anyway... Maybe?

Needed sudo to build

When running any docker command like in ./build.sh I needed to add sudo to make it work, e.g.:

DOCKER_BUILDKIT=1 sudo docker build -t willow-inference-server:"$TAG" .

Forgo the need for TLS certificates

TLS certificates kind of complicate the setup a bit. It would be good if the WIS either exposed both HTTPS and HTTP ports (so I could ignore the TLS port for an internal setup), or it just didn't use TLS at all, so I can run my own ingress in front.

I think that running TLS termination inside the container is too much coupling, as most users would either be running their own TLS, if they want their stuff accessible from the outside, be running a VPN, making TLS redundant, or just not be exposing the server to the internet.

Prune requirements

Now that we're actually using requirements.txt we should prune it to make sure we really need all of that (currently at 118 deps).

Allow users to configure functionality

Some users want TTS. Some want ASR. Some want both. Some want different models (large, medium, all, etc). Don't load functionality, models, etc the user doesn't want/need.

Tweak API responses for expected routes accordingly.

Add ability to use TTS with a provided voice

I've heard rumors of someone creating an implementation that can create embeddings from user provided voice input. If we could have an API endpoint (preferably in conjunction with nice UI) to allow users to do this I think that would be a very cool feature :).

Support docker-less installations

Hi there,

I guess my first comment is that GitHub discussions aren't enabled on this repo, so I'm putting this here as an issue ๐Ÿ™ƒ.

I'm going to run this on a Proxmox server, and because I want to share the GPU across multiple services, a VM won't work well. Instead I would like to use an LXC.

Problem is, docker doesn't play nice with LXCs. So if there is a way to do a "bare metal" install (or one specifically designed for an LXC if you have a lot of Proxmox users) instead of docker, it should help a lot for those running other services such as Frigate NVR, Plex/Emby/Jellyfin, Stable Diffusion, Compreface, etc since LXC can share a GPU.

I hope this is a possibility, but I'll confess I'm not advanced enough to know how to make it possible.

Use traefik

Need to make a docker-compose for this project with Traefik to get a legit certificate from LE

Document issues with default docker userspace proxy and large numbers of ports

For the WebRTC dynamic media port range the default docker daemon settings will launch $NUMPORTS of userspace docker UDP proxies - which takes a long time and is a waste of resources. We should dramatically warn when they don't have /etc/docker/daemon.json:

{
    "userland-proxy": false,
    "iptables": true,
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

We may be able to get this via the Docker API but needs more research.

We should also update the default port range to ~10 ports or something so even if they miss the warning, don't know how to do it, or don't need more concurrent clients even default userspace UDP proxies aren't a big deal.

Handle "no speech" for ASR

When whisper doesn't get speech it tends to return "you" as the ASR infer output. We should check for this condition and other variants and return something more meaningful to the user.

Return progress from model?

Response times are already fantastic but I'm wondering if there's a way to provide output progress during model execution earlier via the datachannel.

Looking through the ctranslate2 docs it doesn't look like it's possible but keeping this issue here as a placeholder.

Publish as Home Assistant addon

the title says it all having WIS as home assistant addon will lower the barrier to entry for many people.

This project is amazing keep up the amazing work !!

TTS Does not handle numbers in text

I tried testing the TTS using a generated text response from my HA instance as follows:
Currently, the weather is 55 degrees with partly cloudy skies. Under present weather conditions, the temperature feels like 55 degrees. In the next few hours you can expect more of the same, with a temperature of 55 degrees.

What I noticed is the TTS generated silence for the number 55, but spoke all the other text. Seems it does not know how to handle numeric values?

I also noticed it did similar when trying to report time, such as 8:55AM. I also didn't yet try it, but I imagine it may have similar trouble handling date strings as well. Maybe there's a way to have it handle these specific numeric formats?

EDIT: just tried the string "Today is Thursday, June 01 2023." and it was silent on all the numbers. Also tried "Today is Thursday, June 1st 2023." and it says "st" on the 1st part, silence on all other numbers.

Download models now fails

+ '[' openai/whisper-large-v2 ']'
+ MODEL=openai/whisper-large-v2
++ echo openai/whisper-large-v2
++ sed -e s,/,-,g
+ MODEL_OUT=openai-whisper-large-v2
+ export CT2_VERBOSE=1
+ CT2_VERBOSE=1
+ export QUANT=float16
+ QUANT=float16
+ ct2-transformers-converter --force --model openai/whisper-large-v2 --quantization float16 --output_dir models/openai-whisper-large-v2
Downloading (โ€ฆ)lve/main/config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1.99k/1.99k [00:00<00:00, 110kB/s]
Downloading pytorch_model.bin: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 6.17G/6.17G [09:26<00:00, 10.9MB/s]
Downloading (โ€ฆ)okenizer_config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 800/800 [00:00<00:00, 171kB/s]
Downloading (โ€ฆ)olve/main/vocab.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 836k/836k [00:00<00:00, 1.14MB/s]
Downloading (โ€ฆ)/main/tokenizer.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2.20M/2.20M [00:00<00:00, 5.84MB/s]
Downloading (โ€ฆ)main/normalizer.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 52.7k/52.7k [00:00<00:00, 1.70MB/s]
Downloading (โ€ฆ)in/added_tokens.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2.08k/2.08k [00:00<00:00, 480kB/s]
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ /usr/local/bin/ct2-transformers-converter:8 in <module>                      โ”‚
โ”‚                                                                              โ”‚
โ”‚   5 from ctranslate2.converters.transformers import main                     โ”‚
โ”‚   6 if __name__ == '__main__':                                               โ”‚
โ”‚   7 โ”‚   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])     โ”‚
โ”‚ โฑ 8 โ”‚   sys.exit(main())                                                     โ”‚
โ”‚   9                                                                          โ”‚
โ”‚                                                                              โ”‚
โ”‚ /usr/local/lib/python3.8/dist-packages/ctranslate2/converters/transformers.p โ”‚
โ”‚ y:942 in main                                                                โ”‚
โ”‚                                                                              โ”‚
โ”‚    939 โ”‚   โ”‚   revision=args.revision,                                       โ”‚
โ”‚    940 โ”‚   โ”‚   low_cpu_mem_usage=args.low_cpu_mem_usage,                     โ”‚
โ”‚    941 โ”‚   )                                                                 โ”‚
โ”‚ โฑ  942 โ”‚   converter.convert_from_args(args)                                 โ”‚
โ”‚    943                                                                       โ”‚
โ”‚    944                                                                       โ”‚
โ”‚    945 if __name__ == "__main__":                                            โ”‚
โ”‚                                                                              โ”‚
โ”‚ /usr/local/lib/python3.8/dist-packages/ctranslate2/converters/converter.py:5 โ”‚
โ”‚ 0 in convert_from_args                                                       โ”‚
โ”‚                                                                              โ”‚
โ”‚    47 โ”‚   โ”‚   Returns:                                                       โ”‚
โ”‚    48 โ”‚   โ”‚     Path to the output directory.                                โ”‚
โ”‚    49 โ”‚   โ”‚   """                                                            โ”‚
โ”‚ โฑ  50 โ”‚   โ”‚   return self.convert(                                           โ”‚
โ”‚    51 โ”‚   โ”‚   โ”‚   args.output_dir,                                           โ”‚
โ”‚    52 โ”‚   โ”‚   โ”‚   vmap=args.vocab_mapping,                                   โ”‚
โ”‚    53 โ”‚   โ”‚   โ”‚   quantization=args.quantization,                            โ”‚
โ”‚                                                                              โ”‚
โ”‚ /usr/local/lib/python3.8/dist-packages/ctranslate2/converters/converter.py:8 โ”‚
โ”‚ 9 in convert                                                                 โ”‚
โ”‚                                                                              โ”‚
โ”‚    86 โ”‚   โ”‚   โ”‚   โ”‚   % output_dir                                           โ”‚
โ”‚    87 โ”‚   โ”‚   โ”‚   )                                                          โ”‚
โ”‚    88 โ”‚   โ”‚                                                                  โ”‚
โ”‚ โฑ  89 โ”‚   โ”‚   model_spec = self._load()                                      โ”‚
โ”‚    90 โ”‚   โ”‚   if model_spec is None:                                         โ”‚
โ”‚    91 โ”‚   โ”‚   โ”‚   raise NotImplementedError(                                 โ”‚
โ”‚    92 โ”‚   โ”‚   โ”‚   โ”‚   "This model is not supported by CTranslate2 or this co โ”‚
โ”‚                                                                              โ”‚
โ”‚ /usr/local/lib/python3.8/dist-packages/ctranslate2/converters/transformers.p โ”‚
โ”‚ y:103 in _load                                                               โ”‚
โ”‚                                                                              โ”‚
โ”‚    100 โ”‚   โ”‚   โ”‚   โ”‚   kwargs["low_cpu_mem_usage"] = self._low_cpu_mem_usage โ”‚
โ”‚    101 โ”‚   โ”‚   โ”‚                                                             โ”‚
โ”‚    102 โ”‚   โ”‚   โ”‚   model = self.load_model(model_class, self._model_name_or_ โ”‚
โ”‚ โฑ  103 โ”‚   โ”‚   โ”‚   tokenizer = self.load_tokenizer(                          โ”‚
โ”‚    104 โ”‚   โ”‚   โ”‚   โ”‚   tokenizer_class,                                      โ”‚
โ”‚    105 โ”‚   โ”‚   โ”‚   โ”‚   self._model_name_or_path,                             โ”‚
โ”‚    106 โ”‚   โ”‚   โ”‚   โ”‚   use_fast=False,                                       โ”‚
โ”‚                                                                              โ”‚
โ”‚ /usr/local/lib/python3.8/dist-packages/ctranslate2/converters/transformers.p โ”‚
โ”‚ y:127 in load_tokenizer                                                      โ”‚
โ”‚                                                                              โ”‚
โ”‚    124 โ”‚   โ”‚   return model_class.from_pretrained(model_name_or_path, **kwar โ”‚
โ”‚    125 โ”‚                                                                     โ”‚
โ”‚    126 โ”‚   def load_tokenizer(self, tokenizer_class, model_name_or_path, **k โ”‚
โ”‚ โฑ  127 โ”‚   โ”‚   return tokenizer_class.from_pretrained(model_name_or_path, ** โ”‚
โ”‚    128 โ”‚                                                                     โ”‚
โ”‚    129 โ”‚   def get_model_file(self, filename):                               โ”‚
โ”‚    130 โ”‚   โ”‚   if os.path.isdir(self._model_name_or_path):                   โ”‚
โ”‚                                                                              โ”‚
โ”‚ /usr/local/lib/python3.8/dist-packages/transformers/models/auto/tokenization โ”‚
โ”‚ _auto.py:723 in from_pretrained                                              โ”‚
โ”‚                                                                              โ”‚
โ”‚   720 โ”‚   โ”‚   โ”‚   โ”‚   return tokenizer_class_fast.from_pretrained(pretrained โ”‚
โ”‚   721 โ”‚   โ”‚   โ”‚   else:                                                      โ”‚
โ”‚   722 โ”‚   โ”‚   โ”‚   โ”‚   if tokenizer_class_py is not None:                     โ”‚
โ”‚ โฑ 723 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   return tokenizer_class_py.from_pretrained(pretrain โ”‚
โ”‚   724 โ”‚   โ”‚   โ”‚   โ”‚   else:                                                  โ”‚
โ”‚   725 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   raise ValueError(                                  โ”‚
โ”‚   726 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   "This tokenizer cannot be instantiated. Please โ”‚
โ”‚                                                                              โ”‚
โ”‚ /usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base. โ”‚
โ”‚ py:1811 in from_pretrained                                                   โ”‚
โ”‚                                                                              โ”‚
โ”‚   1808 โ”‚   โ”‚   โ”‚   else:                                                     โ”‚
โ”‚   1809 โ”‚   โ”‚   โ”‚   โ”‚   logger.info(f"loading file {file_path} from cache at  โ”‚
โ”‚   1810 โ”‚   โ”‚                                                                 โ”‚
โ”‚ โฑ 1811 โ”‚   โ”‚   return cls._from_pretrained(                                  โ”‚
โ”‚   1812 โ”‚   โ”‚   โ”‚   resolved_vocab_files,                                     โ”‚
โ”‚   1813 โ”‚   โ”‚   โ”‚   pretrained_model_name_or_path,                            โ”‚
โ”‚   1814 โ”‚   โ”‚   โ”‚   init_configuration,                                       โ”‚
โ”‚                                                                              โ”‚
โ”‚ /usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base. โ”‚
โ”‚ py:1965 in _from_pretrained                                                  โ”‚
โ”‚                                                                              โ”‚
โ”‚   1962 โ”‚   โ”‚                                                                 โ”‚
โ”‚   1963 โ”‚   โ”‚   # Instantiate tokenizer.                                      โ”‚
โ”‚   1964 โ”‚   โ”‚   try:                                                          โ”‚
โ”‚ โฑ 1965 โ”‚   โ”‚   โ”‚   tokenizer = cls(*init_inputs, **init_kwargs)              โ”‚
โ”‚   1966 โ”‚   โ”‚   except OSError:                                               โ”‚
โ”‚   1967 โ”‚   โ”‚   โ”‚   raise OSError(                                            โ”‚
โ”‚   1968 โ”‚   โ”‚   โ”‚   โ”‚   "Unable to load vocabulary from file. "               โ”‚
โ”‚                                                                              โ”‚
โ”‚ /usr/local/lib/python3.8/dist-packages/transformers/models/whisper/tokenizat โ”‚
โ”‚ ion_whisper.py:293 in __init__                                               โ”‚
โ”‚                                                                              โ”‚
โ”‚   290 โ”‚   โ”‚   self.errors = errors  # how to handle errors in decoding       โ”‚
โ”‚   291 โ”‚   โ”‚   self.byte_encoder = bytes_to_unicode()                         โ”‚
โ”‚   292 โ”‚   โ”‚   self.byte_decoder = {v: k for k, v in self.byte_encoder.items( โ”‚
โ”‚ โฑ 293 โ”‚   โ”‚   with open(merges_file, encoding="utf-8") as merges_handle:     โ”‚
โ”‚   294 โ”‚   โ”‚   โ”‚   bpe_merges = merges_handle.read().split("\n")[1:-1]        โ”‚
โ”‚   295 โ”‚   โ”‚   bpe_merges = [tuple(merge.split()) for merge in bpe_merges]    โ”‚
โ”‚   296 โ”‚   โ”‚   self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges)))) โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Building requires docker's buildx

Hello,

As I'm trying to build on an arch linux based installation with docker up and running, I was faced with the following error message:

ERROR: BuildKit is enabled but the buildx component is missing or broken.
       Install the buildx component to build images with BuildKit:
       https://docs.docker.com/go/buildx/

Installing the docker-buildx package fixed it, but it might be worth mentioning it in the README.

Use dictation devices for audio input when available

On the client side (Electron/browser) our WebHID lib supports various dictation microphones.

When one of these devices is available as an audio source we should enforce/select usage of it for audio input.

Because we initialize the lib early and know if we have a supported device we can probably use that supported device type to select the audio source device (or even just regex match SpeechMike or PowerMic).

CPU Only Optimization

Later this week I will be receiving my test configuration for CPU only mode. I will be addressing:

  • More conditionals for run.sh and others for systems that don't have GPUs
  • Performance improvements on CPU only platforms
  • Probably a few other random things!

Unable to connect to WIS RTC demo Locally

Hello,
Managed to spin up an instance of WIS, and everything appears to be starting up correctly. However when I try to access the RTC page, it seems to try to make a connection but does not get any further. Here is the log data I can see:
iceConnectionLog disconnected signalingLog complete iceConnectionLog checking localDescription offer { "type": "offer", "sdp": "v=0\r\no=- 2102021303406714890 2 IN IP4 127.0.0.1\r\ns=-\r\nt=0 0\r\na=group:BUNDLE 0 1\r\na=extmap-allow-mixed\r\na=msid-semantic: WMS\r\nm=audio 64444 UDP/TLS/RTP/SAVPF 111 63 9 0 8 13 110 126\r\nc=IN IP4 redacted\r\na=rtcp:9 IN IP4 0.0.0.0\r\na=candidate:2850786284 1 udp 2122260223 192.168.17.1 64442 typ host generation 0 network-id 1\r\na=candidate:775075476 1 udp 2122194687 192.168.84.1 64443 typ host generation 0 network-id 3\r\na=candidate:3114883702 1 udp 2122129151 192.168.1.73 64444 typ host generation 0 network-id 2 network-cost 10\r\na=candidate:1655339575 1 udp 1685921535 redacted 64444 typ srflx raddr 192.168.1.73 rport 64444 generation 0 network-id 2 network-cost 10\r\na=candidate:3609487732 1 tcp 1518280447 192.168.17.1 9 typ host tcptype active generation 0 network-id 1\r\na=candidate:1358779404 1 tcp 1518214911 192.168.84.1 9 typ host tcptype active generation 0 network-id 3\r\na=candidate:3345397998 1 tcp 1518149375 192.168.1.73 9 typ host tcptype active generation 0 network-id 2 network-cost 10\r\na=ice-ufrag:3Tc7\r\na=ice-pwd:wcTlKOSBA2XDTkqkNxd/7hJv\r\na=ice-options:trickle\r\na=fingerprint:sha-256 45:DB:A6:1A:5D:72:A9:76:1A:F4:66:84:67:F2:39:A8:F4:D8:0A:C9:98:E1:A2:6F:C5:6B:FC:D2:B2:F8:16:1D\r\na=setup:actpass\r\na=mid:0\r\na=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level\r\na=extmap:2 http://www.webrtc.org/experiments/rtp-hdrext/abs-send-time\r\na=extmap:3 http://www.ietf.org/id/draft-holmer-rmcat-transport-wide-cc-extensions-01\r\na=extmap:4 urn:ietf:params:rtp-hdrext:sdes:mid\r\na=sendrecv\r\na=msid:- 568268e5-89b2-4660-a7b5-2b7de2060848\r\na=rtcp-mux\r\na=rtpmap:111 opus/48000/2\r\na=rtcp-fb:111 transport-cc\r\na=fmtp:111 minptime=10;useinbandfec=1\r\na=rtpmap:63 red/48000/2\r\na=fmtp:63 111/111\r\na=rtpmap:9 G722/8000\r\na=rtpmap:0 PCMU/8000\r\na=rtpmap:8 PCMA/8000\r\na=rtpmap:13 CN/8000\r\na=rtpmap:110 telephone-event/48000\r\na=rtpmap:126 telephone-event/8000\r\na=ssrc:3526239653 cname:Kwlyo7DLJDeRJm5R\r\na=ssrc:3526239653 msid:- 568268e5-89b2-4660-a7b5-2b7de2060848\r\nm=application 64447 UDP/DTLS/SCTP webrtc-datachannel\r\nc=IN IP4 redacted\r\na=candidate:2850786284 1 udp 2122260223 192.168.17.1 64445 typ host generation 0 network-id 1\r\na=candidate:775075476 1 udp 2122194687 192.168.84.1 64446 typ host generation 0 network-id 3\r\na=candidate:3114883702 1 udp 2122129151 192.168.1.73 64447 typ host generation 0 network-id 2 network-cost 10\r\na=candidate:1655339575 1 udp 1685921535 redacted 64447 typ srflx raddr 192.168.1.73 rport 64447 generation 0 network-id 2 network-cost 10\r\na=candidate:3609487732 1 tcp 1518280447 192.168.17.1 9 typ host tcptype active generation 0 network-id 1\r\na=candidate:1358779404 1 tcp 1518214911 192.168.84.1 9 typ host tcptype active generation 0 network-id 3\r\na=candidate:3345397998 1 tcp 1518149375 192.168.1.73 9 typ host tcptype active generation 0 network-id 2 network-cost 10\r\na=ice-ufrag:3Tc7\r\na=ice-pwd:wcTlKOSBA2XDTkqkNxd/7hJv\r\na=ice-options:trickle\r\na=fingerprint:sha-256 45:DB:A6:1A:5D:72:A9:76:1A:F4:66:84:67:F2:39:A8:F4:D8:0A:C9:98:E1:A2:6F:C5:6B:FC:D2:B2:F8:16:1D\r\na=setup:actpass\r\na=mid:1\r\na=sctp-port:5000\r\na=max-message-size:262144\r\n" } iceGatheringLog complete iceGatheringLog gathering signalingLog new added track to peer connection

When I open the debug console, I see this response from asr:
{ "sdp": "v=0\r\no=- 3893853050 3893853050 IN IP4 0.0.0.0\r\ns=-\r\nt=0 0\r\na=group:BUNDLE 0 1\r\na=msid-semantic:WMS *\r\nm=audio 10035 UDP/TLS/RTP/SAVPF 111\r\nc=IN IP4 172.17.0.2\r\na=recvonly\r\na=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level\r\na=extmap:4 urn:ietf:params:rtp-hdrext:sdes:mid\r\na=mid:0\r\na=msid:36c53f4f-8f1f-458c-a189-9c1747570917 33ffaf28-0aeb-4bec-9886-fdb0d7e452c3\r\na=rtcp:9 IN IP4 0.0.0.0\r\na=rtcp-mux\r\na=ssrc:2402540680 cname:5246c902-d951-4bbd-b3ab-2f4ade706e73\r\na=rtpmap:111 opus/48000/2\r\na=candidate:9333c84bcc1b0bf56713df9036e6b4d9 1 udp 2130706431 172.17.0.2 10035 typ host\r\na=candidate:c58f5770074e5a6227e87732712d9300 1 udp 1694498815 redacted 10035 typ srflx raddr 172.17.0.2 rport 10035\r\na=end-of-candidates\r\na=ice-ufrag:qcEp\r\na=ice-pwd:IQ2ymsMj77ZqsnuWS6As6h\r\na=fingerprint:sha-256 6F:B3:6F:97:62:8A:7B:01:8D:A9:4E:0D:D5:B4:D9:E0:B4:99:97:DF:85:53:BA:B5:AE:07:54:5C:1F:BB:4C:32\r\na=setup:active\r\nm=application 10035 UDP/DTLS/SCTP webrtc-datachannel\r\nc=IN IP4 172.17.0.2\r\na=mid:1\r\na=sctp-port:5000\r\na=max-message-size:65536\r\na=candidate:9333c84bcc1b0bf56713df9036e6b4d9 1 udp 2130706431 172.17.0.2 10035 typ host\r\na=candidate:c58f5770074e5a6227e87732712d9300 1 udp 1694498815 redacted 10035 typ srflx raddr 172.17.0.2 rport 10035\r\na=end-of-candidates\r\na=ice-ufrag:qcEp\r\na=ice-pwd:IQ2ymsMj77ZqsnuWS6As6h\r\na=fingerprint:sha-256 6F:B3:6F:97:62:8A:7B:01:8D:A9:4E:0D:D5:B4:D9:E0:B4:99:97:DF:85:53:BA:B5:AE:07:54:5C:1F:BB:4C:32\r\na=setup:active\r\n", "type": "answer" }

Let me know if any other info would be helpful here, thanks! :)

Make sure we free dynamic media ports on client disconnect/disappear

Given the nature of webrtc and the support for those clients the ability for AIA to support $NUM clients is dependent on having free dynamic media ports available. We need to make sure when a datachannel connection closes that we free it and the port it has allocated for the multiplexed UDP session.

"Just do it" install script

When we have the infrastructure in place (images pushed to Docker Hub, etc) an inexperienced user should be able to run a single command and get up and running - possibly without even cloning the repo.

We hate it but possibly even the classic curl | bash one liners you see all over the internet...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.