jakobovski / free-spoken-digit-dataset Goto Github PK

A free audio dataset of spoken digits. An audio version of MNIST.

Python 100.00%

dataset spoken-language mnist machine-learning speech-recognition audio spoken-digits

free-spoken-digit-dataset's Introduction

Free Spoken Digit Dataset (FSDD)

A simple audio/speech dataset consisting of recordings of spoken digits in wav files at 8kHz. The recordings are trimmed so that they have near minimal silence at the beginnings and ends.

FSDD is an open dataset, which means it will grow over time as data is contributed. In order to enable reproducibility and accurate citation the dataset is versioned using Zenodo DOI as well as git tags.

Current status

6 speakers
3,000 recordings (50 of each digit per speaker)
English pronunciations

Organization

Files are named in the following format: {digitLabel}_{speakerName}_{index}.wav Example: 7_jackson_32.wav

How to use with Hub

A simple way of using this dataset is with Activeloop's python package Hub!

First, run pip install hub (or pip3 install hub).

import hub
ds = hub.load("hub://activeloop/spoken_mnist")

# check out the first spectrogram, it's label, and who spoke it!
import matplotlib.pyplot as plt
plt.imshow(ds.spectrograms[0].numpy())
plt.title(f"{ds.speakers[0].data()} spoke {ds.labels[0].numpy()}")
plt.show()

# train a model in pytorch
for sample in ds.pytorch():
    # ... model code here ...

# train a model in tensorflow
for sample in ds.tensorflow():
    # ... model code here ...

available tensors can be shown by printing dataset:

print(ds)
# prints: Dataset(path='hub://activeloop/spoken_mnist', tensors=['spectrograms', 'labels', 'audio', 'speakers'])

For more information, check out the hub documentation.

Contributions

Please contribute your homemade recordings. All recordings should be mono 8kHz wav files and be trimmed to have minimal silence. Don't forget to update metadata.py with the speaker meta-data.

To add your data, follow the recording instructions in acquire_data/say_numbers_prompt.py and then run split_and_label_numbers.py to make your files.

Metadata

metadata.py contains meta-data regarding the speakers gender and accents.

Included utilities

trimmer.py Trims silences at beginning and end of an audio file. Splits an audio file into multiple audio files by periods of silence.

fsdd.py A simple class that provides an easy to use API to access the data.

spectogramer.py Used for creating spectrograms of the audio data. Spectrograms are often a useful pre-processing step.

Usage

The test set officially consists of the first 10% of the recordings. Recordings numbered 0-4 (inclusive) are in the test and 5-49 are in the training set.

Made with FSDD

Did you use FSDD in a paper, project or app? Add it here!

More than 50+ scholarly articles
https://github.com/Jakobovski/decoupled-multimodal-learning/
https://adhishthite.github.io/sound-mnist/ by Adhish Thite
https://github.com/eonu/torch-fsdd/ - A simple PyTorch data loader for the dataset by Edwin Onuonga
https://proglearn.neurodata.io/ by NeuroData
https://neurodata.io/df_dn/ by NeuroData

External tools

Tensorflow https://www.tensorflow.org/datasets/catalog/spoken_digit
C#/.NET. The FSDD dataset can be used in .NET applications using the FreeSpokenDigitsDataset class included withing the Accord.NET Framework. A basic example on how to perform spoken digits classification using audio MFCC features can be found here.

License

Creative Commons Attribution-ShareAlike 4.0 International

free-spoken-digit-dataset's People

Contributors

Stargazers

Watchers

Forkers

chagge authman ahams morgukai jasmeetsb hoskingl deepanjank saifee95 ilvitoriocasas shujihachisu joonpark72 pabulson gunter24 donandres sushmit86 baletercero zorpa sblack4 yuyuzeng saurav-31 aledevansuk alexsiow dkrac diyuanlu emmanuelq2 guelmiv project-tuva sony111 krittikav aerophile thayermldac dvnsarma xldude suresir binchenbin volcas avipatchava xiao2mo dipeshdulal anders-torp miguel-ossa cesarsouza connello rttembo naroom stefania11 madelinebriere vorugant speechwrecko shikhar1729 nixongenoa fancyerii zhng1456 amsterdumb jjkindergarten junsooo issev sebashc3712 yelinkyaw naruto678 batermj xunzhaocunzi elastific adhishthite jonsoncode hollowninja jcalhoon ahmed-fau shaynemei gergues jupiterethan snehar26 yuxinpan komosinskid huazhz experimenti ronitsamaddar pmikas hongpeng1992 mutewall pkishan originofamonia eternityup faninafanina sandhyac0203 shyam1234 yweweler mdangschat genka7 stevenbh duhaoze11 targoons pb-pravin felixdollack akhil2495 v1vekkumar developermili shilpakk95 josephkj dunjapet

free-spoken-digit-dataset's Issues

How to perform 1D wavelet scattering transform on this dataset

How to perform 1D wavelet scattering transform on this dataset?

MAINT Update Zenodo and Tensorflow versions

I believe the Zenodo & Tensorflow versions of FSDD are out-of-sync. Links:

https://zenodo.org/record/1342401#.YWygx9nMK3I
https://www.tensorflow.org/datasets/catalog/spoken_digit

View

Add other languages

Hi! Would you consider adding Serbian language to the dataset? I am interesetd to contribute my voice and as many as I can gather. I suppose this would also be simpler to accomplish if we could gather audio online using an automated website.

New recordings are badly truncated

Hi. I was listening to the new samples from Jason, a lot of them are badly truncated at the beginning.

IndexError when I run "ds.spectrograms[0].numpy()"

Code

import hub
ds = hub.load("hub://activeloop/spoken_mnist")
ds.spectrograms[0].numpy()

Got the error:

DOI

It would be nice to have a DOI for the dataset instead of relying on just git tags.
I would suggest to use Zenodo as they are integrated with github and support multiple versions of the same dataset.

Dataset download

sir how can get this whole dataset/part of it for my project

DatasetCorruptError: The HEAD node of the branch main of this dataset is in a corrupted state and is likely not recoverable.

Hi,

I have been playing around with the dataset through the hub API.

I just started to get the following error message:

---------------------------------------------------------------------------
ClientError                               Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/deeplake/core/storage/s3.py](https://8u0mko7bx9-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20240430-060123_RC00_629368006#) in get_bytes(self, path, start_byte, end_byte)
    274         try:
--> 275             return self._get_bytes(path, start_byte, end_byte)
    276         except botocore.exceptions.ClientError as err:

20 frames
ClientError: An error occurred (InternalError) when calling the GetObject operation (reached max retries: 4): We encountered an internal error.  Please retry the operation again later.

During handling of the above exception, another exception occurred:

ClientError                               Traceback (most recent call last)
ClientError: An error occurred (InternalError) when calling the GetObject operation (reached max retries: 4): We encountered an internal error.  Please retry the operation again later.

During handling of the above exception, another exception occurred:

S3GetError                                Traceback (most recent call last)
S3GetError: An error occurred (InternalError) when calling the GetObject operation (reached max retries: 4): We encountered an internal error.  Please retry the operation again later.

The above exception was the direct cause of the following exception:

DatasetCorruptError                       Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/deeplake/api/dataset.py](https://8u0mko7bx9-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20240430-060123_RC00_629368006#) in load(path, read_only, memory_cache_size, local_cache_size, creds, token, org_id, verbose, access_method, unlink, reset, indra, check_integrity, lock_timeout, lock_enabled, index_params)
    714                 if not reset:
    715                     if isinstance(e, DatasetCorruptError):
--> 716                         raise DatasetCorruptError(
    717                             message=e.message,
    718                             action="Try using `reset=True` to reset HEAD changes and load the previous commit.",

DatasetCorruptError: The HEAD node of the branch main of this dataset is in a corrupted state and is likely not recoverable. Try using `reset=True` to reset HEAD changes and load the previous commit.

This is the code I use:

import hub
ds = hub.load("hub://activeloop/spoken_mnist")

I tried both on my machine and using colab to check the issue wasn't on my side, but I'm getting the same on both.

Please help!

Best,
David

Normalize the recordings to have the same number of channels

It seems that the recordings done by Jackson have only 1 audio channel (mono), but the recordings by Nicolas have 2 audio channels (stereo). I've noticed that there are no guidelines regarding how many channels the recordings should have in the main README.md at the front page of this project. As such, I would like to know whether the samples from this dataset can have samples with different audio channels, or whether there are plans to normalize the samples such that they all have just one channel. In either case, I suppose the contribution guidelines could be extended with this information.

Regards,
Cesar

zero wav files too small to be read in python

I would suggest making all of the recordings a standard 1 second so we have access to all 8000 samples. the recordings vary greatly in sample length

adding more numbers

I think it will be better if you add numbers like 20,30,40,50,60,70,80,90,100 and from 11:20
so we can have all the numbers combinations, so for example, if we use this dataset in application to recognize numbers like 3.45, 578, or 54 this dataset after improving will help a lot

How can we record 8KHz Mono WAV format file for Digit Classification?

I used this code for training a model to classify free-spoken-digit-dataset (https://github.com/mikesmales/Udacity-ML-Capstone). The accuracy of the trained model is 96%.

But the prediction using the saved model fails when I test it for my recorded voice. I recorded some digits using windows 10 voice recorder and converted files to 8KHz Mono WAV format.

Any help you can provide? The model predicts accurately on the recordings provided within the dataset.

My Recorded Digit 3:

Original sample rate: 22050
Librosa sample rate: 22050
Original audio file minmax range: 20 to 239
Librosa audio file minmax range: -0.84375 to 0.8671875
(40, 18)

Dataset : Jackson 3:

Original sample rate: 8000
Librosa sample rate: 22050
Original audio file minmax range: -10989 to 9277
Librosa audio file minmax range: -0.35349792 to 0.28417692
(40, 21)

.wav encoding for speaker Nicolas not consistent with other speakers

FYI file encoding for speaker nicolas are 8bit unsigned integer whereas all other speakers are 16bit Signed int
sox -b 16 -e signed-int old_nicolas.wav new_nicolas.wav
does the trick

About reference

Thank you for your work，Please tell me if I want to use this data set in my paper，what citation format should I use？

problem of 8bit&16bit

some files are 8bit like 1_nicolas_36.wav
some files are 16bit like 9_theo_22.wav
it hard to deal with two diffient Sampling number
how can i unified？

Gerrr

The dataset contains 1501 recordings

Currently, the dataset contains 1501 recordings instead of the 1500 described in the readme. If you check the list of files currently at the recordings directory, you will see that it contains a file named "6_jackson_50.wav". However the maximum possible number label for the index should have been 49 instead of 50.

Right now the entire dataset is contained in a single directory (https://github.com/Jakobovski/free-spoken-digit-dataset/tree/master/recordings). This will not scale once the dataset becomes larger. Depending on the file system, even listing the directory contents with ls can become burdensome after around 10,000 files.

But another reason to do so is that the current layout may prevent the files from being queried using GitHub's developer API in the future. I am building an interface to the dataset that can automatically download, query and organize the dataset into training and testing sets without having to first clone the dataset using git. However, there is a limit on the number of files that can be retrieved using this API, and after this limit, the only method would be to clone the repository and retrieve the files manually.