microsoft / ms-snsd Goto Github PK

The Microsoft Scalable Noisy Speech Dataset (MS-SNSD) is a noisy speech dataset that can scale to arbitrary sizes depending on the number of speakers, noise types, and Speech to Noise Ratio (SNR) levels desired.

License: MIT License

Python 15.33% HTML 84.67%

ms-snsd's Introduction

Microsoft Scalable Noisy Speech Dataset (MS-SNSD)

This dataset contains a large collection of clean speech files and variety of environmental noise files in .wav format sampled at 16 kHz.
The main application of this dataset is to train Deep Neural Network (DNN) models to suppress background noise. But it can be used for other audio and speech applications.
We provide the recipe to mix clean speech and noise at various signal to noise ratio (SNR) conditions to generate large noisy speech dataset.
The SNR conditions and the number of hours of data required can be configured depending on the application requirements.
This dataset will continue to grow in size as we encourage researchers and practitioners to contribute to this dataset by adding more clean speech and noise clips.
This dataset will immensely help researchers and practitioners in accademia and industry to develop better models.
We also provide test set that is different from training set to evaluate the developed models.
We provide html code for building two Human Intelligence Task (HIT) crowdsourcing applications to allow users to rate the noisy audio clips. We implemented an absolute category rating (ACR) application according to ITU-T P.800. In addition we implemented a subjective testing method according to ITU-T P.835 which allows to rate the speech signal, background noise, and the overall quality. Further details of this dataset can be found in our Interspeech 2019 paper: Chandan K. A. Reddy, Ebrahim Beyrami, Jamie Pool, Ross Cutler, Sriram Srinivasan, Johannes Gehrke. "A scalable noisy speech dataset and online subjective test framework," in Interspeech, 2019. Link

Prerequisites

Python 3.0 and above
pysoundfile (pip install pysoundfile)
($ pip install -r requirements.txt)

Please cite us if you use this dataset

@article{reddy2019scalable,
  title={A Scalable Noisy Speech Dataset and Online Subjective Test Framework},
  author={Reddy, Chandan KA and Beyrami, Ebrahim and Pool, Jamie and Cutler, Ross and Srinivasan, Sriram and Gehrke, Johannes},
  journal={Proc. Interspeech 2019},
  pages={1816--1820},
  year={2019}
}

MS-SNSD Dataset

Training and test sets

Clean Speech data for training is present in the directory 'CleanSpeech'
Noise data for training is present in the directory 'Noise'
Noisy Speech for testing is present in the directory 'noisy_test'
Clean Speech corresponding to noisy speech test data is present in the directory 'clean_test' Download the data onto your local machine.

Usage

Clone the repo to your local directory
Download clean speech and noise datasets into the same directory with scripts
The repo contains the following files:
- 'audiolib.py'
- 'noisyspeech_synthesizer.cfg'
- 'noisyspeech_synthesizer.py'
- 'requirements.txt'
Specify your requirements in the config file (noisyspeech_synthesizer.cfg)
- Specify sampling rate, audio format, audio length, silence length, total number of hours of noisy speech required and Speech to Noise Ratio (SNR) levels required.
- Specify noise files to be excluded. Example: noise_types_excluded: Babble, Traffic. 'None' of no files to be excluded.
- Specify the path to noise and speech directories if it is not in the same directory as scripts.
Noisy speech and the corresponding clean speech and noise files will be in the directories 'NoisySpeech_training', 'CleanSpeech_training' and 'Noise_training' respectively.
Make sure that the config file is in the same directory as (noisyspeech_synthesizer.py) for ease of use.
Now run (python noisyspeech_synthesizer.py) to generate noisy speech clips.

Dataset licenses

MICROSOFT PROVIDES THE DATASETS ON AN "AS IS" BASIS. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, GUARANTEES OR CONDITIONS WITH RESPECT TO YOUR USE OF THE DATASETS. TO THE EXTENT PERMITTED UNDER YOUR LOCAL LAW, MICROSOFT DISCLAIMS ALL LIABILITY FOR ANY DAMAGES OR LOSSES, INLCUDING DIRECT, CONSEQUENTIAL, SPECIAL, INDIRECT, INCIDENTAL OR PUNITIVE, RESULTING FROM YOUR USE OF THE DATASETS.

The datasets are provided under the original terms that Microsoft received such datasets. See below for more information about each dataset.

The datasets used in this project are licensed as follows:

Clean speech:

PTDB-TUG: Pitch Tracking Database from Graz University of Technology https://www.spsc.tugraz.at/databases-and-tools/ptdb-tug-pitch-tracking-database-from-graz-university-of-technology.html; License: http://opendatacommons.org/licenses/odbl/1.0/
Edinburgh 56 speaker dataset: https://datashare.is.ed.ac.uk/handle/10283/2791; License: https://datashare.is.ed.ac.uk/bitstream/handle/10283/2791/license_text?sequence=11&isAllowed=y

Noise:

Freesound: https://freesound.org/ Only files with CC0 licenses were selected; License: https://creativecommons.org/publicdomain/zero/1.0/
Demand: https://zenodo.org/record/1227121#.XRKKxYhKiUk; License: https://creativecommons.org/licenses/by-sa/3.0/deed.en_CA

Code license

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE

ms-snsd's People

Contributors

Stargazers

Watchers

Forkers

sfilipi ebbeyram ashaazami orangejui chandanka90 melspectrum007 zhaoforever colinaudio43 xingws chenglongbiao ishine bryent111 gateslu librence xinkez farjoubin syljoy yongyug dltaixlt kantic fatimatasnim yuzhongshanyue javiomotero andreasfragner atulkumarrai spxnn group12i simpleishappy bob-hu zehuangfang bhaskers-blu-org2 cp917 tjadamlee hiyoung-asr taffywrinkle alibabasglab wyn314 claudiusgonzo ma7555 themockingjester windstudent chenglinjuan wenwanchen phdodds joshijai2 nasos-anagnostou adrija-debug yaoyao20050321 luluwangwang1989 jiangyu94 rxhmdia jihwanparkpreprocessing maggie0830 anand0427 mohallel jixinintelligence sakhprace gurugubelllik normonisping mingmchen onejune2018 zhongshijun noise-suppression seniorglassmaster okrio mtxing georgejerzy kchemorion proling1994 ui-richard vanessa108 scofir yexiayin road2018 speech998 irfaniqbal7577 miblue119 drumpt baekms 0000-1 xj-martin wjliu0215 amallik2 jordirbmed hiddefolkertsma bigsealing jassam jd07 xujieuse virajkarandikar yunyangzeng shenhark fragrantrookie aapocketz dddis yifanwang983 alanliudx yijxiang kshitij-wahi kaitaitong

ms-snsd's Issues

Different audio Lengths cause error broadcasting

if the passed arguments (clean&noise) were with different lengths, it would lead to the following error :
ValueError: operands could not be broadcast together with shapes

it could be solved by increasing the shorter length to be equal with the greater one like this

def snr_mixer(clean, noise, snr):
    clean_len = len(clean)
    noise_len = len(noise)
    if clean_len < noise_len:
        rep_time = int(np.floor(noise_len / audio_len))
        left_len = noise_len - clean_len * rep_time
        tmp = np.tile(clean, [1, rep_time])
        tmp.shape = (tmp.shape[1], )
        clean = np.hstack((tmp, clean[:left_len]))
        noise = np.array(noise)

    else:
        rep_time = int(np.floor(clean_len / noise_len))
        left_len = clean_len - noise_len * rep_time
        tmp = np.tile(noise, [1, rep_time])
        tmp.shape = (tmp.shape[1], )
        noise = np.hstack((tmp, noise[:left_len]))
        clean = np.array(clean)
    
    # Normalizing to -25 dB FS
    rmsclean = (clean**2).mean()**0.5
    scalarclean = 10 ** (-25 / 20) / rmsclean
    clean = clean * scalarclean
    rmsclean = (clean**2).mean()**0.5

    rmsnoise = (noise**2).mean()**0.5
    scalarnoise = 10 ** (-25 / 20) /rmsnoise
    noise = noise * scalarnoise
    rmsnoise = (noise**2).mean()**0.5
    
    # Set the noise level for a given SNR
    noisescalar = np.sqrt(rmsclean / (10**(snr/20)) / rmsnoise)
    noisenewlevel = noise * noisescalar
    noisyspeech = clean + noisenewlevel
    return clean, noisenewlevel, noisyspeech

clean, noisenewlevel, noisyspeech = snr_mixer(audio_org, noise_org, 2)

'noisescalar' derivation in clean speech and noise mix

Hi,

Thanks for sharing this open-source dataset. I am trying to apply this code to generate synthetic noisy datasets for speech processing. In my practice, I observed that the code-generated data has only half of SNR than the code nominated, which I tested from Audacity. After further checked the 'audiolib.py', I think the 'noisescalar' derivation (line 68) seems to be incorrect.

In the 'audiolib.py' code, the original code is:
noisescalar = np.sqrt(rmsclean / (10(snr/20)) / rmsnoise)**

Where I think the square root shall not be used for the noise scalar since the SNR is calculated based on RMS in the derivation, and it shall be corrected as below in the scaling of the noise level.
noisescalar = rmsclean / (10(snr/20)) / rmsnoise**

In my test, I got the synthetic noisy data with the correct SNR level after this correction. So could you please correct it in the code?

total_hours results in wrong total hours

setting total_hours to be 1 hour, only produces 10 mins
setting total_hours to be 10 hourr, only produces 100 mins

Enhanced results of RNNoise

We are trying to replicate the result published for improved RNNoise as per below paper. Is the fork of that rnnoise available to test?

A scalable noisy speech dataset and online subjective test framework
Chandan K. A. Reddy1
, Ebrahim Beyrami1
, Jamie Pool1
, Ross Cutler1
, Sriram Srinivasan1

https://arxiv.org/pdf/1909.08050&ved=2ahUKEwjMr6LD3aPqAhXtYd8KHZ7ECl0QFjAAegQIBRAC&usg=AOvVaw01MCa2LbkZ3KXTi21FHjsq

Maybe something wrong in audiolib.py

For the function def snr_mixer(clean, noise, snr) in the audiolib.py file, I think there is something wrong.
First, line 66 code: Function np.sqrt( ) may be unnecessary.
Second, as clean and noise have been normalized to -25 dBFS, noisescalar may not need to be calculated using rmsclean and rmsnoise.

Real time noise suppression

Excellent article on VentureBeat today:
https://venturebeat.com/2020/04/09/microsoft-teams-ai-machine-learning-real-time-noise-suppression-typing/

Funny enough I've used this dataset (which I'm assuming you are referring to in the article) to also train noise suppression. I didn't have a requirement for real-time/streaming so I used a bidirectional LSTM recurrent layer. I also trained against Librispeech (technically LibriTTS as I wanted 24hz audio.)

Examples

Sourced from national news broadcasts to show performance against data it was NOT trained on. Audio files are compressed as GitHub doesn't allow raw waveform upload. I've provided the source files from the broadcast with _noisy.wav suffix and the predicted output from the network with the _clean.wav suffix.

Example 6

Not the best but still did a decent job suppressing a noise sample it was never trained against.

trump_helicopter.zip

setting the sample rate doesn't work

fs gets overwritten and always stays 16K

How do I generate audio data with fixed audio length?

Is there a way to generate audio data with fixed audio length? I know there's audio_length that specifies the minimum length of each audio clip, but is there a way to specify the maximum audio length?

Same configuration as the reference paper

Hello,
We are looking for the configuration that was used for 'A scalable noisy speech dataset and online subjective test framework' paper? We could not find all the values such as SNR levels, noise types etc.
Maybe add it to the repo so that its really easy to reproduce the same setup.
Thank you in advance!

noisyspeech_synthesizer.py always slices from the start of the noise array

In noisyspeech_synthesizer.py, an array of audio samples are read from a noise file (line 78). On line 81, a slice of the noise array is taken from index 0 to len(clean) as:

noise = noise[0:len(clean)]

By always starting at index 0, in the case where the clean speech arrays are roughly the same length (~16000 samples) as in the speech commands case, it means that the number of unique noise arrays we see is equal to the number of noise files.

Even if we have one noise file with 10 hours of audio, we may only ever make use of the first 1 second of this data.

It would be better to pick a random starting index within the noise array from which to take a slice. For example
start_idx = np.random.randint(low=0, high=len(noise)-len(clean), size=1)
noise = noise[start_idx : start_idx+len(clean)]

some bugs in audiolib

I think in audiolib file, line 66 code: noisescalar = np.sqrt (rmsclean / (10 ** (snr / 20)) / rmsnoise),
snr should be divided by 10, not 20.

Masking-based methods

Hello, if I want to use a mask-based approach for speech enhancement, e.g. IBM, IRM, etc.

how should I use this dataset?

original 44.1k or 48k dataset

Could you please provide the original 44.1k or 48k dataset for speech and noises? Thank you very much.

Silence Removal Idea

Maybe a silence removal option could be added to be able to develop robust voice activity detection models. pyAudioAnalysis could be integrated for such purpose.

noisyspeech_synthesizer.py fails to run with numpy 1.18.5

There seems to be a breaking issue while running the noisyspeech_synthesizer.py while running it on Google Colab.
Colab has numpy 1.18.5 at the time this issue was posted. This version is installed by default when connecting to a runtime.
On following the standard procedure and prerequisites for running this scipt, it gives the following error.

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 117, in linspace
    num = operator.index(num)
TypeError: 'float' object cannot be interpreted as an integer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./MS-SNSD/noisyspeech_synthesizer.py", line 124, in <module>
    main(cfg._sections[args.cfg_str])
  File "./MS-SNSD/noisyspeech_synthesizer.py", line 47, in main
    SNR = np.linspace(snr_lower, snr_upper, total_snrlevels)
  File "<__array_function__ internals>", line 6, in linspace
  File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 121, in linspace
    .format(type(num)))
TypeError: object of type <class 'float'> cannot be safely interpreted as an integer.

However, when the numpy version is downgraded from 1.18.5 to 1.16.4, the script works perfectly fine as it is supposed to.

microsoft / ms-snsd Goto Github PK

ms-snsd's Introduction

Microsoft Scalable Noisy Speech Dataset (MS-SNSD)

Prerequisites

Please cite us if you use this dataset

MS-SNSD Dataset

Training and test sets

Usage

Dataset licenses

Code license

ms-snsd's People

Contributors

Stargazers

Watchers

Forkers

ms-snsd's Issues

Examples

Example 1

Example 2

Example 3

Example 4

Example 5

Example 6

Recommend Projects

Recommend Topics

Recommend Org