Code Monkey home page Code Monkey logo

dataperf-speech-example's Introduction

dataperf-speech-example

Example workflow for our data-centric speech benchmark

Terminology

  • Keyword spotting model (KWS model): Also referred to as a wakeword or hotword model, or a voice trigger detection model, this is a small ML speech model that is designed to recognize a small vocabulary of spoken words or phrases (e.g., Siri, Google Voice Assistant, Alexa)
  • Target sample: An example 1-second audio clip of a keyword used to train or evaluate a keyword-spotting model
  • Nontarget sample: 1-second audio clips of words which are outside of the KWS model's vocabulary, used to train or measure the model's ability to minimize false positive detections on non-keywords.
  • MSWC dataset: the Multilingual Spoken Words Corpus, a dataset of 340,000 spoken words in 50 languages.
  • Embedding vector representation: An n-dimensional vector which provides a feature representation of an audio word. We have trained a large classifier on keywords in MSWC, and we provide a 1024-element feature vector by using the penultimate layer of the classifer. Other embeddings, such as wav2vec2 are also available [TODO: we may provide a flag for users to select which embedding they wish to use for training and evaluation, or we may restrict to only one embedding - TBD]

Files

  • Input to selection algorithm: samples.pb - a protocol buffer encoded file of target and nontarget keyword samples. For each sample we provide an embedding representation and the corresponding sample ID (i.e., the audio file name) from the MSWC dataset. Your training set selection algorithm will choose a subset of these embedding vectors which maximize a simple classifier's performance
  • Output from selection algorithm: train.npz - a numpy array containing a selected subset of embedding vectors used to train the classifier.
  • Input to eval.py:
    • train.npz: the selected embedding vectors used to train a classifier
    • eval.pb: a protocol buffer encoded file of test samples which are distinct from the training samples in samples.pb - this is the dataset we use to compute the classifier's score

On the evaluation server, we will have distinct, hidden samples.pb and eval.pb files using different keywords in different languages, in order to calculate the official score for our leaderboard. [TODO: provide link to scoring function]

Developing a custom training set selection algorithm

Edit the function select() in selection/selection.py to include your custom training set selection algorithm.

If your code has additional dependencies, make sure to edit requirements.txt and/or the Dockerfile to include these. Please make sure not to change the behavior of selection/main.py or the docker entrypoint (this is how we automate evaluation on the server).

You can run your selection algorithm locally (outside of docker) with the following command:

python -m selection.main --input_samples path/to/samples.pb --outdir=.

This will write out train.npz in your current directory (you can change this by specifying a different --outdir).

Creating a submission

Once you have implemented your selection algorithm, build a new version of your submission container:

docker build -t dataperf-speech-submission:latest .

Test your submission container before submitting to the evaluation server. To do so, first create a working directory for loading samples.pb. This will also be the destination for the docker container to write out the train.npz array containing your selected embedding vectors (used to train the classifier in eval.py in the evaluation step).

mkdir workdir
cp ~/path/to/samples.pb workdir/

Then run your selection algorithm within the docker container:

docker run --rm  -u $(id -u):$(id -g) --network none -v $(pwd)/workdir:/workdir -it dataperf-speech-submission:latest --input_samples /workdir/samples.pb --outdir=/workdir

There are several flags to note:

  • -u $(id -u):$(id -g): These flags are used so that the selection numpy array (train.npz) is written to disk as the user instead of as root
  • -v $(pwd)/workdir:/workdir: this is a mounted volume, specifying the working directory we use to read in samples.pb and write out train.npz - you can change this to point to another location, but if you change the mapped name (/workdir) be sure to also reflect this in the entrypoint arguments (--input_samples /workdir/samples.pb --outdir=/workdir)
  • --network none: your submission docker container will not have network access during evaluation on the server. This is to prevent exposing our hidden evaluation keyword.

Finally, test out the evaluation script on your selection algorithm's output (we will use the same eval.py script on the server, but with a different hidden samples.pb and eval.pb dataset)

python eval.py --eval_file=path/to/eval.pb --train_file=workdir/train.npz

Submitting to the evaluation server

[TODO]

dataperf-speech-example's People

Contributors

mmaz avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.