Code Monkey home page Code Monkey logo

ms-snsd's Introduction

Microsoft Scalable Noisy Speech Dataset (MS-SNSD)

  • This dataset contains a large collection of clean speech files and variety of environmental noise files in .wav format sampled at 16 kHz.
  • The main application of this dataset is to train Deep Neural Network (DNN) models to suppress background noise. But it can be used for other audio and speech applications.
  • We provide the recipe to mix clean speech and noise at various signal to noise ratio (SNR) conditions to generate large noisy speech dataset.
  • The SNR conditions and the number of hours of data required can be configured depending on the application requirements.
  • This dataset will continue to grow in size as we encourage researchers and practitioners to contribute to this dataset by adding more clean speech and noise clips.
  • This dataset will immensely help researchers and practitioners in accademia and industry to develop better models.
  • We also provide test set that is different from training set to evaluate the developed models.
  • We provide html code for building two Human Intelligence Task (HIT) crowdsourcing applications to allow users to rate the noisy audio clips. We implemented an absolute category rating (ACR) application according to ITU-T P.800. In addition we implemented a subjective testing method according to ITU-T P.835 which allows to rate the speech signal, background noise, and the overall quality. Further details of this dataset can be found in our Interspeech 2019 paper: Chandan K. A. Reddy, Ebrahim Beyrami, Jamie Pool, Ross Cutler, Sriram Srinivasan, Johannes Gehrke. "A scalable noisy speech dataset and online subjective test framework," in Interspeech, 2019. Link

Prerequisites

  • Python 3.0 and above
  • pysoundfile (pip install pysoundfile)
  • ($ pip install -r requirements.txt)

Please cite us if you use this dataset

@article{reddy2019scalable,
  title={A Scalable Noisy Speech Dataset and Online Subjective Test Framework},
  author={Reddy, Chandan KA and Beyrami, Ebrahim and Pool, Jamie and Cutler, Ross and Srinivasan, Sriram and Gehrke, Johannes},
  journal={Proc. Interspeech 2019},
  pages={1816--1820},
  year={2019}
}

MS-SNSD Dataset

Training and test sets

  1. Clean Speech data for training is present in the directory 'CleanSpeech'
  2. Noise data for training is present in the directory 'Noise'
  3. Noisy Speech for testing is present in the directory 'noisy_test'
  4. Clean Speech corresponding to noisy speech test data is present in the directory 'clean_test' Download the data onto your local machine.

Usage

  1. Clone the repo to your local directory
  2. Download clean speech and noise datasets into the same directory with scripts
  3. The repo contains the following files:
    • 'audiolib.py'
    • 'noisyspeech_synthesizer.cfg'
    • 'noisyspeech_synthesizer.py'
    • 'requirements.txt'
  4. Specify your requirements in the config file (noisyspeech_synthesizer.cfg)
    • Specify sampling rate, audio format, audio length, silence length, total number of hours of noisy speech required and Speech to Noise Ratio (SNR) levels required.
    • Specify noise files to be excluded. Example: noise_types_excluded: Babble, Traffic. 'None' of no files to be excluded.
    • Specify the path to noise and speech directories if it is not in the same directory as scripts.
  5. Noisy speech and the corresponding clean speech and noise files will be in the directories 'NoisySpeech_training', 'CleanSpeech_training' and 'Noise_training' respectively.
  6. Make sure that the config file is in the same directory as (noisyspeech_synthesizer.py) for ease of use.
  7. Now run (python noisyspeech_synthesizer.py) to generate noisy speech clips.

Dataset licenses

MICROSOFT PROVIDES THE DATASETS ON AN "AS IS" BASIS. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, GUARANTEES OR CONDITIONS WITH RESPECT TO YOUR USE OF THE DATASETS. TO THE EXTENT PERMITTED UNDER YOUR LOCAL LAW, MICROSOFT DISCLAIMS ALL LIABILITY FOR ANY DAMAGES OR LOSSES, INLCUDING DIRECT, CONSEQUENTIAL, SPECIAL, INDIRECT, INCIDENTAL OR PUNITIVE, RESULTING FROM YOUR USE OF THE DATASETS.

The datasets are provided under the original terms that Microsoft received such datasets. See below for more information about each dataset.

The datasets used in this project are licensed as follows:

  1. Clean speech:
  1. Noise:

Code license

MIT License

Copyright (c) Microsoft Corporation.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE

ms-snsd's People

Contributors

chandanka90 avatar ebbeyram avatar hdubey avatar microsoftopensource avatar msftgits avatar rosscutler avatar sfilipi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ms-snsd's Issues

Different audio Lengths cause error broadcasting

if the passed arguments (clean&noise) were with different lengths, it would lead to the following error :
ValueError: operands could not be broadcast together with shapes

it could be solved by increasing the shorter length to be equal with the greater one like this

def snr_mixer(clean, noise, snr):
    clean_len = len(clean)
    noise_len = len(noise)
    if clean_len < noise_len:
        rep_time = int(np.floor(noise_len / audio_len))
        left_len = noise_len - clean_len * rep_time
        tmp = np.tile(clean, [1, rep_time])
        tmp.shape = (tmp.shape[1], )
        clean = np.hstack((tmp, clean[:left_len]))
        noise = np.array(noise)

    else:
        rep_time = int(np.floor(clean_len / noise_len))
        left_len = clean_len - noise_len * rep_time
        tmp = np.tile(noise, [1, rep_time])
        tmp.shape = (tmp.shape[1], )
        noise = np.hstack((tmp, noise[:left_len]))
        clean = np.array(clean)
    
    # Normalizing to -25 dB FS
    rmsclean = (clean**2).mean()**0.5
    scalarclean = 10 ** (-25 / 20) / rmsclean
    clean = clean * scalarclean
    rmsclean = (clean**2).mean()**0.5

    rmsnoise = (noise**2).mean()**0.5
    scalarnoise = 10 ** (-25 / 20) /rmsnoise
    noise = noise * scalarnoise
    rmsnoise = (noise**2).mean()**0.5
    
    # Set the noise level for a given SNR
    noisescalar = np.sqrt(rmsclean / (10**(snr/20)) / rmsnoise)
    noisenewlevel = noise * noisescalar
    noisyspeech = clean + noisenewlevel
    return clean, noisenewlevel, noisyspeech

clean, noisenewlevel, noisyspeech = snr_mixer(audio_org, noise_org, 2)

'noisescalar' derivation in clean speech and noise mix

Hi,

Thanks for sharing this open-source dataset. I am trying to apply this code to generate synthetic noisy datasets for speech processing. In my practice, I observed that the code-generated data has only half of SNR than the code nominated, which I tested from Audacity. After further checked the 'audiolib.py', I think the 'noisescalar' derivation (line 68) seems to be incorrect.

In the 'audiolib.py' code, the original code is:
noisescalar = np.sqrt(rmsclean / (10(snr/20)) / rmsnoise)**

Where I think the square root shall not be used for the noise scalar since the SNR is calculated based on RMS in the derivation, and it shall be corrected as below in the scaling of the noise level.
noisescalar = rmsclean / (10(snr/20)) / rmsnoise**

In my test, I got the synthetic noisy data with the correct SNR level after this correction. So could you please correct it in the code?

Maybe something wrong in audiolib.py

For the function def snr_mixer(clean, noise, snr) in the audiolib.py file, I think there is something wrong.
First, line 66 code: Function np.sqrt( ) may be unnecessary.
Second, as clean and noise have been normalized to -25 dBFS, noisescalar may not need to be calculated using rmsclean and rmsnoise.

Real time noise suppression

Excellent article on VentureBeat today:
https://venturebeat.com/2020/04/09/microsoft-teams-ai-machine-learning-real-time-noise-suppression-typing/

Funny enough I've used this dataset (which I'm assuming you are referring to in the article) to also train noise suppression. I didn't have a requirement for real-time/streaming so I used a bidirectional LSTM recurrent layer. I also trained against Librispeech (technically LibriTTS as I wanted 24hz audio.)

Examples

Sourced from national news broadcasts to show performance against data it was NOT trained on. Audio files are compressed as GitHub doesn't allow raw waveform upload. I've provided the source files from the broadcast with _noisy.wav suffix and the predicted output from the network with the _clean.wav suffix.

Example 1

sequence 1585584_clean
sequence.1585584_.zip

Example 2

sequence 1597540_clean
sequence.1597540_.zip

Example 3

sequence 1046182_clean
sequence.1046182_.zip

Example 4

sequence 1597377_clean
sequence.1597377_.zip

Example 5

sequence 231_clean
sequence.231_.zip

Example 6

Not the best but still did a decent job suppressing a noise sample it was never trained against.
00049 unknown and_despite_that_and_despite_40_million_18_trump_haters_including_people_that_worked_for_hillary_clinton_and_some_of_the_worst_human_beings_on_earth_they_got_nothing_clean
trump_helicopter.zip

Same configuration as the reference paper

Hello,
We are looking for the configuration that was used for 'A scalable noisy speech dataset and online subjective test framework' paper? We could not find all the values such as SNR levels, noise types etc.
Maybe add it to the repo so that its really easy to reproduce the same setup.
Thank you in advance!

noisyspeech_synthesizer.py always slices from the start of the noise array

In noisyspeech_synthesizer.py, an array of audio samples are read from a noise file (line 78). On line 81, a slice of the noise array is taken from index 0 to len(clean) as:

noise = noise[0:len(clean)]

By always starting at index 0, in the case where the clean speech arrays are roughly the same length (~16000 samples) as in the speech commands case, it means that the number of unique noise arrays we see is equal to the number of noise files.

Even if we have one noise file with 10 hours of audio, we may only ever make use of the first 1 second of this data.

It would be better to pick a random starting index within the noise array from which to take a slice. For example
start_idx = np.random.randint(low=0, high=len(noise)-len(clean), size=1)
noise = noise[start_idx : start_idx+len(clean)]

some bugs in audiolib

I think in audiolib file, line 66 code: noisescalar = np.sqrt (rmsclean / (10 ** (snr / 20)) / rmsnoise),
snr should be divided by 10, not 20.

Masking-based methods

Hello, if I want to use a mask-based approach for speech enhancement, e.g. IBM, IRM, etc.

how should I use this dataset?

Silence Removal Idea

Maybe a silence removal option could be added to be able to develop robust voice activity detection models. pyAudioAnalysis could be integrated for such purpose.

noisyspeech_synthesizer.py fails to run with numpy 1.18.5

There seems to be a breaking issue while running the noisyspeech_synthesizer.py while running it on Google Colab.
Colab has numpy 1.18.5 at the time this issue was posted. This version is installed by default when connecting to a runtime.
On following the standard procedure and prerequisites for running this scipt, it gives the following error.

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 117, in linspace
    num = operator.index(num)
TypeError: 'float' object cannot be interpreted as an integer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./MS-SNSD/noisyspeech_synthesizer.py", line 124, in <module>
    main(cfg._sections[args.cfg_str])
  File "./MS-SNSD/noisyspeech_synthesizer.py", line 47, in main
    SNR = np.linspace(snr_lower, snr_upper, total_snrlevels)
  File "<__array_function__ internals>", line 6, in linspace
  File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 121, in linspace
    .format(type(num)))
TypeError: object of type <class 'float'> cannot be safely interpreted as an integer.

However, when the numpy version is downgraded from 1.18.5 to 1.16.4, the script works perfectly fine as it is supposed to.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.