Code Monkey home page Code Monkey logo

free-spoken-digit-dataset's Introduction

Free Spoken Digit Dataset (FSDD)

DOI

A simple audio/speech dataset consisting of recordings of spoken digits in wav files at 8kHz. The recordings are trimmed so that they have near minimal silence at the beginnings and ends.

FSDD is an open dataset, which means it will grow over time as data is contributed. In order to enable reproducibility and accurate citation the dataset is versioned using Zenodo DOI as well as git tags.

Current status

  • 6 speakers
  • 3,000 recordings (50 of each digit per speaker)
  • English pronunciations

Organization

Files are named in the following format: {digitLabel}_{speakerName}_{index}.wav Example: 7_jackson_32.wav

How to use with Hub

A simple way of using this dataset is with Activeloop's python package Hub!

First, run pip install hub (or pip3 install hub).

import hub
ds = hub.load("hub://activeloop/spoken_mnist")

# check out the first spectrogram, it's label, and who spoke it!
import matplotlib.pyplot as plt
plt.imshow(ds.spectrograms[0].numpy())
plt.title(f"{ds.speakers[0].data()} spoke {ds.labels[0].numpy()}")
plt.show()

# train a model in pytorch
for sample in ds.pytorch():
    # ... model code here ...

# train a model in tensorflow
for sample in ds.tensorflow():
    # ... model code here ...

available tensors can be shown by printing dataset:

print(ds)
# prints: Dataset(path='hub://activeloop/spoken_mnist', tensors=['spectrograms', 'labels', 'audio', 'speakers'])

For more information, check out the hub documentation.

Contributions

Please contribute your homemade recordings. All recordings should be mono 8kHz wav files and be trimmed to have minimal silence. Don't forget to update metadata.py with the speaker meta-data.

To add your data, follow the recording instructions in acquire_data/say_numbers_prompt.py and then run split_and_label_numbers.py to make your files.

Metadata

metadata.py contains meta-data regarding the speakers gender and accents.

Included utilities

trimmer.py Trims silences at beginning and end of an audio file. Splits an audio file into multiple audio files by periods of silence.

fsdd.py A simple class that provides an easy to use API to access the data.

spectogramer.py Used for creating spectrograms of the audio data. Spectrograms are often a useful pre-processing step.

Usage

The test set officially consists of the first 10% of the recordings. Recordings numbered 0-4 (inclusive) are in the test and 5-49 are in the training set.

Made with FSDD

Did you use FSDD in a paper, project or app? Add it here!

External tools

License

Creative Commons Attribution-ShareAlike 4.0 International

free-spoken-digit-dataset's People

Contributors

adhishthite avatar antgeorge avatar cesarsouza avatar david-gerard avatar dependabot[bot] avatar eonu avatar epochdv avatar experimenti avatar farizrahman4u avatar felixdollack avatar jakobovski avatar madtracki avatar mikayelh avatar pssf23 avatar speechwrecko avatar verbose-void avatar yuxinpan avatar yweweler avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

free-spoken-digit-dataset's Issues

Add other languages

Hi! Would you consider adding Serbian language to the dataset? I am interesetd to contribute my voice and as many as I can gather. I suppose this would also be simpler to accomplish if we could gather audio online using an automated website.

DOI

It would be nice to have a DOI for the dataset instead of relying on just git tags.
I would suggest to use Zenodo as they are integrated with github and support multiple versions of the same dataset.

DatasetCorruptError: The HEAD node of the branch main of this dataset is in a corrupted state and is likely not recoverable.

Hi,

I have been playing around with the dataset through the hub API.

I just started to get the following error message:

---------------------------------------------------------------------------
ClientError                               Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/deeplake/core/storage/s3.py](https://8u0mko7bx9-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20240430-060123_RC00_629368006#) in get_bytes(self, path, start_byte, end_byte)
    274         try:
--> 275             return self._get_bytes(path, start_byte, end_byte)
    276         except botocore.exceptions.ClientError as err:

20 frames
ClientError: An error occurred (InternalError) when calling the GetObject operation (reached max retries: 4): We encountered an internal error.  Please retry the operation again later.

During handling of the above exception, another exception occurred:

ClientError                               Traceback (most recent call last)
ClientError: An error occurred (InternalError) when calling the GetObject operation (reached max retries: 4): We encountered an internal error.  Please retry the operation again later.

During handling of the above exception, another exception occurred:

S3GetError                                Traceback (most recent call last)
S3GetError: An error occurred (InternalError) when calling the GetObject operation (reached max retries: 4): We encountered an internal error.  Please retry the operation again later.

The above exception was the direct cause of the following exception:

DatasetCorruptError                       Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/deeplake/api/dataset.py](https://8u0mko7bx9-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab_20240430-060123_RC00_629368006#) in load(path, read_only, memory_cache_size, local_cache_size, creds, token, org_id, verbose, access_method, unlink, reset, indra, check_integrity, lock_timeout, lock_enabled, index_params)
    714                 if not reset:
    715                     if isinstance(e, DatasetCorruptError):
--> 716                         raise DatasetCorruptError(
    717                             message=e.message,
    718                             action="Try using `reset=True` to reset HEAD changes and load the previous commit.",

DatasetCorruptError: The HEAD node of the branch main of this dataset is in a corrupted state and is likely not recoverable. Try using `reset=True` to reset HEAD changes and load the previous commit.

This is the code I use:

import hub
ds = hub.load("hub://activeloop/spoken_mnist")

I tried both on my machine and using colab to check the issue wasn't on my side, but I'm getting the same on both.

Please help!

Best,
David

Normalize the recordings to have the same number of channels

It seems that the recordings done by Jackson have only 1 audio channel (mono), but the recordings by Nicolas have 2 audio channels (stereo). I've noticed that there are no guidelines regarding how many channels the recordings should have in the main README.md at the front page of this project. As such, I would like to know whether the samples from this dataset can have samples with different audio channels, or whether there are plans to normalize the samples such that they all have just one channel. In either case, I suppose the contribution guidelines could be extended with this information.

Regards,
Cesar

adding more numbers

I think it will be better if you add numbers like 20,30,40,50,60,70,80,90,100 and from 11:20
so we can have all the numbers combinations, so for example, if we use this dataset in application to recognize numbers like 3.45, 578, or 54 this dataset after improving will help a lot

How can we record 8KHz Mono WAV format file for Digit Classification?

I used this code for training a model to classify free-spoken-digit-dataset (https://github.com/mikesmales/Udacity-ML-Capstone). The accuracy of the trained model is 96%.

But the prediction using the saved model fails when I test it for my recorded voice. I recorded some digits using windows 10 voice recorder and converted files to 8KHz Mono WAV format.

Any help you can provide? The model predicts accurately on the recordings provided within the dataset.

My Recorded Digit 3:

Original sample rate: 22050
Librosa sample rate: 22050
Original audio file minmax range: 20 to 239
Librosa audio file minmax range: -0.84375 to 0.8671875
(40, 18)

Dataset : Jackson 3:

Original sample rate: 8000
Librosa sample rate: 22050
Original audio file minmax range: -10989 to 9277
Librosa audio file minmax range: -0.35349792 to 0.28417692
(40, 21)

About reference

Thank you for your work,Please tell me if I want to use this data set in my paper,what citation format should I use?

problem of 8bit&16bit

some files are 8bit like 1_nicolas_36.wav
some files are 16bit like 9_theo_22.wav
it hard to deal with two diffient Sampling number
how can i unified?

RFI

Hello,
Can I ask simple (read stupid question), can I use this model for digit recognition or speaker recognition/identification or both?
Thanks in advance,
Mirko

License

Have you considered choosing a license (Creative Commons for instance) to make explicit the conditions for copying and reuse?

Consider spreading the data into multiple directories

Right now the entire dataset is contained in a single directory (https://github.com/Jakobovski/free-spoken-digit-dataset/tree/master/recordings). This will not scale once the dataset becomes larger. Depending on the file system, even listing the directory contents with ls can become burdensome after around 10,000 files.

But another reason to do so is that the current layout may prevent the files from being queried using GitHub's developer API in the future. I am building an interface to the dataset that can automatically download, query and organize the dataset into training and testing sets without having to first clone the dataset using git. However, there is a limit on the number of files that can be retrieved using this API, and after this limit, the only method would be to clone the repository and retrieve the files manually.

Regards,
Cesar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.