Code Monkey home page Code Monkey logo

keyword-spotting-research-datasets's Introduction

keyword-spotting-research-datasets

Overview

Reference

This repository contains the license and instructions relative to the open Datasets mentioned in this publication:

Coucke A. et al., "Efficient keyword spotting using dilated convolutions and gating",

accepted for publication to ICASSP 2019.

Any publication must include a full citation to this paper.

A different version was used in Leroy D. et al., "Federated learning for keyword spotting", also accepted for publication to ICASSP 2019. Please mention which version you want access to in the contact form (see below).

Datasets

The wake word is "Hey Snips" pronounced with no pause between the two words. Both Data Sets contains a large variety of English accents and recording environments. Note that negative samples have been recorded in the same conditions than wake-word utterances, therefore arising from the same domain (speaker, hardware, environment, etc.).

The full datasets and their metadata are available for research purposes as mentioned in the LICENSE file. Although some keyword spotting datasets are freely available, such as the Speech Commands dataset for voice commands classification, there is no equivalent in the specific wake-word detection field. By establishing an open reference for wake-word detection, we hope to contribute to promote transparency and reproducibility in a highly concurrent field where datasets are often kept private.

The datasets are available upon requests as described in the Dataset access section below.

Please note that the statistics displayed below might not remain consistent with the datasets provided. Indeed, under the GDPR and since voice recordings constitute personal data, dataset contributors have the right to opt out, see the full License Terms for more details.

Dataset 1: Coucke et al., 2019, "Efficient keyword spotting using dilated convolutions and gating"

Positive data has been cleaned by automatically removing samples of extreme duration (1st and 99th percentiles), or samples with repeated occurrences of the wake word. Positive dev and test sets have been manually cleaned to discard any mispronunciations of the wake word (e.g. "Hi Snips" or "Hey Snaips"), leaving the training set untouched. Around 11K wake word utterances and 86.5K negative examples have been recorded.

Train Dev Test
Positive Utterances 5,876 2,504 25,88
Speakers 1,179 516 520
max / speaker 10 10 10
Negative Utterances 45,344 20,321 20,821
Speakers 3,330 1,474 1,469
max / speaker 30 30 30

Dataset 2: Leroy et al., 2019, "Federated learning for keyword spotting"

This crowdsourcing-induced data distribution mimicks a real-world non-i.i.d, unbalanced and highly distributed setting, and a parallel is drawn in the following work between a crowdsourcing contributor and a voice assistant user. The train, dev and test splits are built purposely using distinct users, 77% of users being used solely for training while the 23% remaining are used for parameter tuning and final evaluation. The dataset statistics are provided below

Train Dev Test Total
Utterances 53,991 8,337 7,854 69,582
Speakers 1,374 200 200 1,774

Format

The dataset archive contains the following files:

* train.json
* dev.json
* test.json
* audio_files
  * uuid-id-1.wav
  * uuid-id-2.wav
  * uuid-id-3.wav
  * uuid-id-4.wav
  * ...

The train, dev and test files contain the list of audios for each part of the split, along with metadata. Each entry in those lists has the following attributes:

  • id: a unique identifier for the entry
  • is_hotword: 1 if the audio is a "Hey Snips" utterance, 0 otherwise
  • worker_id: the unique identifier of the contributor - note that worker ids are not consistent across datasets 1 and 2 as they are re-generated for each one of them
  • duration: the duration of the audio file in seconds
  • audio_file_path: the relative path of the audio from the root of the directory eg audio_files/<audio-uuid>.

An example of such an entry is provided below:

  {
    "id": "40084ea8-c576-4dba-a20b-fbda61f1de7d"
    "is_hotword": 1, 
    "worker_id": 12, 
    "duration": 1.86, 
    "audio_file_path": "audio_files/40084ea8-c576-4dba-a20b-fbda61f1de7d.wav", 
  }

License summary

Use only for academic and/or research purposes. No commercial use. Publication permitted only if the Datasets are unmodified and subject to the same license terms. Any publication must include a full citation to the paper in which the datasets were initially published by Snips1:

Coucke A. et al., 2019, "Efficient keyword spotting using dilated convolutions and gating,

accepted for publication to ICASSP 2019.

Please read the full License Terms before accessing the Data Sets.

Dataset access

To access the data, please fill the following form:

https://forms.gle/JtmFYM7xK1SaMfZYA

You will be granted access shortly and will be provided with a temporary url to download it.

1 The Snips team has joined Sonos in November 2019. These open datasets remain available and their access is now managed by the Sonos Voice Experience Team. Please email [email protected] with any question.

keyword-spotting-research-datasets's People

Contributors

alicecouckesonos avatar choufractal avatar davidleroy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

keyword-spotting-research-datasets's Issues

source code in the paper

Is there any chance that you could release the source code, especially for the data preprocessing in "FEDERATED LEARNING FOR KEYWORD SPOTTING". I am new to this area and I think I can learn a lot from your source code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.