Code Monkey home page Code Monkey logo

hush's Introduction

$${\huge \color{pink}🤫hush}$$

Silent Whisper inference for privacy and performance.

Current speech-to-text wrappers tend to require audio input, even though all models use mel spectrograms, not audio, internally.

This has drawbacks, as audio needs to be sent from the user's device to the server and if that is not possible the implementation is restricted to run locally.

hush uses quantized 8-bit grayscale images, not audio.

As well as helping to prevent leakage of identifiable information, this approach simplifies voice activity detection, caching, storage / retrieval and bandwidth considerations by removing audio signal processing and audio payloads from the pipeline.

For more background on how mel spectrograms are generated and used, see wavey-ai/mel-spec

To run inference, hush uses a fork of the brilliant whisper-burn that uses Rust's burn-rs Deep Learning framework and tch-rs (Rust bindings for the C++ api of PyTorch). The fork provides a mel API and exposes whisper-burn as a service, and configures a CUDA backend.

demo

Chrome is required as the demo currently uses SIMD instructions.

https://hush.wavey.ai

demo

The demo UI has the following components:

  • non-blocking WASM workers and Audio Worklets that convert audio (from file or microphone) into mel spectrograms on a stream with ultra-low latency
  • ultra-low latency voice activity detection that works by applying Sobel edge detection to spectrograms. This is used to determine were to segment streaming audio for transcription (ideally always cutting between words, and not in the middle of a word.)
  • real-time visualisations on canvas
  • a client that sends audio segments as images to the AWS service running Whisper on GPU, receiving a text translation back

Note that it is significantly faster with Dev Tools console closed.

running

The server will start accepting connections immediately and will load models in the background. To ensure quick cold starts the tiny_en model is always loaded and routed to first, with requests always being routed to the largest model available. TODO: Make all this configurable, and allow model to be specified in the request.

 INFO  hush > hush server listening on 0.0.0.0:1337
 INFO  hush > loading model "tiny_en"
 INFO  hush > loading model "medium_en"
 INFO  cached_path::cache > Cached version of https://huggingface.co/gpt2/resolve/main/tokenizer.json is up-to-date
 INFO  cached_path::cache > Cached version of https://huggingface.co/gpt2/resolve/main/tokenizer.json is up-to-date
 INFO  hush               > "tiny_en" loaded in 9 secs
 INFO  hush               > "medium_en" loaded in 109 secs

Any GET request will return a simple status:

{"done":3,"models":1,"queue":0}

done: number of completed requests
queue: number of pending requests
models:
    0 = non loaded
    1 = tiny_en loaded
    2 = medium_en loaded

deployment

The included ami.sh creates an image with GPU support for running NVIDIA T4 Tensor Core instances. A public ami will be provided soon.

The cloudformation template creates an Auto Scaling Group that requests a g4dn.xlarge spot instance and exposes the demo api on https://hush.wavey.ai.

The same template creates the demo UI, Github auth and a self-updating AWS CodePipeline project that applies infrastructure changes via the template alongside any code changes in the repo.

The initial deployment should be done from local via the make app task, this will create the CodePipeline pipeline.

Full instructions: TODO.

TODO

This is very much a POC and a WIP.

  • fix wasm content type with a cloud function
  • Traffic light status on UI for GPU spot instance: up/down/provisioning
  • Add real-time metrics to API and visibility in UI
  • Support for Safari, non-SIMD version.
  • Support Web GPU (AWS G4ad instance w/AMD Radeon Pro V520 GPU)
  • Admin UI
  • Add auth to EC2 service
  • WebRTC Data Channel API
  • Load medium_en model by default
  • Allow any audio format to be uploaded, resampling as required
  • Clients for mobile

hush's People

Contributors

jbrough avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.