Code Monkey home page Code Monkey logo

facebookresearch / svoice Goto Github PK

View Code? Open in Web Editor NEW
1.2K 25.0 174.0 2.71 MB

We provide a PyTorch implementation of the paper Voice Separation with an Unknown Number of Multiple Speakers In which, we present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously. The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed. A different model is trained for every number of possible speakers, and the model with the largest number of speakers is employed to select the actual number of speakers in a given sample. Our method greatly outperforms the current state of the art, which, as we show, is not competitive for more than two speakers.

License: Other

Shell 0.59% Python 99.41%

svoice's Introduction

SVoice: Speaker Voice Separation using Neural Nets

We provide a PyTorch implementation of our speaker voice separation research work. In Voice Separation with an Unknown Number of Multiple Speakers, we present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously. The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed. A different model is trained for every number of possible speakers, and the model with the largest number of speakers is employed to select the actual number of speakers in a given sample. Our method greatly outperforms the current state of the art, which, as we show, is not competitive for more than two speakers. Please note that this implementation does not contain the "IDloss" as described in the paper.

Audio samples can be found here: Samples

The architecture of our network. The audio is being convolved with a stack of 1D convolutions and reordered by cutting
overlapping segments of length K in time, to obtain a 3D tensor. In our method, the RNN blocks are of the type of multiply and add.
After each pair of blocks, we apply a convolution D to the copy of the activations, and obtain output channels by reordering the chunks
and then using the overlap and add operator.

Installation

First, install Python 3.7 (recommended with Anaconda).

Clone this repository and install the dependencies. We recommend using a fresh virtualenv or Conda environment.

git clone [email protected]:fairinternal/svoice.git
cd svoice
pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt  

Setup

Configuration

We use Hydra to control all the training configurations. If you are not familiar with Hydra we recommend visiting the Hydra website. Generally, Hydra is an open-source framework that simplifies the development of research applications by providing the ability to create a hierarchical configuration dynamically.

The config file with all relevant arguments for training our model can be found under the conf folder. Notice, under the conf folder, the dset folder contains the configuration files for the different datasets. You should see a file named config.yaml with the relevant configuration for the debug sample set.

You can pass options through the command line, for instance python train.py lr=1e-4. Please refer to conf/config.yaml for a reference of the possible options. You can also directly edit the config.yaml file, although this is not recommended due to the way experiments are automatically named, as explained hereafter.

Checkpointing

Each experiment will get a unique name based on the command line options you passed. Restarting the same command will reuse the existing folder and automatically start from a previous checkpoint if possible. In order to ignore previous checkpoints, you must pass the restart=1 option. Note that options like device, num_workers, etc. have no influence on the experiment name.

Setting up a new dataset

If you want to train using a new dataset, you can:

  1. Create a separate config file for it.
  2. Place the new config files under the dset folder. Check conf/dset/debug.yaml for more details on configuring your dataset.
  3. Point to it either in the general config file or via the command line, e.g. ./train.py dset=name_of_dset.

You also need to generate the relevant .jsonfiles in the egs/folder. For that purpose you can use the python -m svoice.data.audio command that will scan the given folders and output the required metadata as json. For instance, if your mixture files are located in $mix and the separated files are in $spk1 and $spk2, you can do

out=egs/mydataset/tr
mkdir -p $out
python -m svoice.data.audio $mix > $out/mix.json
python -m svoice.data.audio $spk1 > $out/s1.json
python -m svoice.data.audio $spk1 > $out/s1.json

Creating your own dataset

We provide a dataset generation script in which users can create their own noisy and reverberant datasets. This dataset generation scripts follows the same recipes as described in our recent ICASSP-2021 paper: Single Channel Voice Separation for Unknown Number of Speakers Under Reverberant and Noisy Settings. Generation scripts can be found under: scripts/make_dataset.py. This data generation scripts gets as input the clean recordings, together with a set of noises and uses these recordings to generate a noisy-reverberant dataset. We synthesize room impulse responses using the following RIR-Generator package, which uses the image method, proposed by Allen and Berkley in 1979. This method is one of the most frequently used methods in the acoustic signal processing community to create synthetic room impulse responses.

In case of generating a reverberant data, one needs to first install the RIR-Generator package.

For more details regarding possible arguments, please see:

usage: Mode [-h] [--in_path IN_PATH] [--out_path OUT_PATH]
            [--noise_path NOISE_PATH] [--num_of_speakers NUM_OF_SPEAKERS]
            [--num_of_scenes NUM_OF_SCENES] [--sec SEC] [--sr SR]

optional arguments:
  -h, --help            show this help message and exit
  --in_path IN_PATH
  --out_path OUT_PATH
  --noise_path NOISE_PATH
  --num_of_speakers NUM_OF_SPEAKERS
                        no of speakers.
  --num_of_scenes NUM_OF_SCENES
                        no of examples.
  --sec SEC
  --sr SR

Usage

Quick Start with Toy Example

  1. Run ./make_debug.sh to generate json files for the toy dataset.
  2. Run python train.py

Notice, we already provided the yaml file for it. Can be found under conf/dset/debug.yaml.

Data Structure

The data loader reads both mixture and separated json files named: mix.json and s<id>.json where <id> is a running identifier. These files should contain all the paths to the wav files to be used to optimize and test the model along with their size (in frames). You can use python -m svoice.data.audio FOLDER_WITH_WAV1 [FOLDER_WITH_WAV2 ...] > OUTPUT.json to generate those files. You should generate the above files for both training and test sets (and validation set if provided). Once this is done, you should create a yaml (similarly to conf/dset/debug.yaml) with the dataset folders' updated paths. Please check conf/dset/debug.yaml for more details.

WSJ Mixture Generation

In case you have access to the origin wsj0 data (sphere format), you can generate the mixtures using the tools provided in the following repository (see usage section in the readme). You can access the csv files containing all the metadata for generating the mixtures from the following samples page.

Training

Training is simply done by launching the train.py script:

python train.py

This will automaticlly read all the configurations from the conf/config.yaml file. You can override different configuration arguments from the command, this will automaticlly generate new folder using the override params.

python train.py lr=0.001
python train.py dset=librimix lr=0.001 swave.R=8

Distributed Training

To launch distributed training you should turn on the distributed training flag. This can be done as follows:

python train.py ddp=1

Logs

Logs are stored by default in the outputs folder. Look for the matching experiment name. In the experiment folder you will find the training checkpoint checkpoint.th (containing the last state as well as the best state) as well as the log with the metrics trainer.log. All metrics are also extracted to the history.json file for easier parsing. Enhancements samples are stored in the samples folder (if mix_dir or mix_json is set in the dataset config yaml file).

Evaluating

Evaluating the models can be done by launching the following:

python -m svoice.evaluate <path to the model> <path to folder containing mix.json and all target separated channels json files s<ID>.json>

For more details regarding possible arguments, please see:

usage: Evaluate separation performance using MulCat blocks [-h]
                                                           [--device DEVICE]
                                                           [--sdr SDR]
                                                           [--sample_rate SAMPLE_RATE]
                                                           [--num_workers NUM_WORKERS]
                                                           [-v]
                                                           model_path data_dir

positional arguments:
  model_path            Path to model file created by training
  data_dir              directory including mix.json, s1.json, s2.json, ...
                        files

optional arguments:
  -h, --help            show this help message and exit
  --device DEVICE
  --sdr SDR
  --sample_rate SAMPLE_RATE
                        Sample rate
  --num_workers NUM_WORKERS
  -v, --verbose         More loggging

Separation

Separating files can be done by launching the following:

python -m svoice.separate <path to the model> <path to store the separated files> --mix_dir=<path to the dir with the mixture files>

Notice, you can either provide mix_dir or mix_json for the test data. For more details regarding possible arguments, please see:

usage: Speech separation using MulCat blocks [-h] [--mix_dir MIX_DIR]
                                             [--mix_json MIX_JSON]
                                             [--device DEVICE]
                                             [--sample_rate SAMPLE_RATE]
                                             [--batch_size BATCH_SIZE] [-v]
                                             model_path out_dir

positional arguments:
  model_path            Model name
  out_dir               Directory putting enhanced wav files

optional arguments:
  -h, --help            show this help message and exit
  --mix_dir MIX_DIR     Directory including mix wav files
  --mix_json MIX_JSON   Json file including mix wav files
  --device DEVICE
  --sample_rate SAMPLE_RATE
                        Sample rate
  --batch_size BATCH_SIZE
                        Batch size
  -v, --verbose         More loggging

Results

Using the default configuration (same one as presented in our [paper][arxiv]), results should be similar to the following. All reprted numbers are the Scale-Invariant Signal-to-Noise-Ratio improvment (SI-SNRi) over the input mixture.

Model #params 2spk 3spk 4spk 5spk
ADANet 9.1M 10.5 9.1 - -
DPCL++ 13.6M 10.8 7.1 - -
CBLDNN-GAT 39.5M 11.0 - - -
TasNet 32.0M 11.2 - - -
IBM - 13.0 12.8 10.6 10.3
IRM - 12.7 12.5 9.8 9.6
ConvTasNet 5.1M 15.3 12.7 8.5 6.8
FurcaNeXt 51.4M 18.4 - - -
DPRNN 3.6M 18.8 14.7 10.4 8.7
Ours 7.5M 20.1 16.9 12.9 10.6

Learning Curves

The following learning cures were obtained using L=8 (the encoder kernel size):

Training curves of our model. SI-SNRi curves of our model.

Citation

If you find our code or models useful for your research, please cite it as:

@inproceedings{nachmani2020voice,
  title={Voice Separation with an Unknown Number of Multiple Speakers},
  author={Nachmani, Eliya and Adi, Yossi and Wolf, Lior},
  booktitle={Proceedings of the 37th international conference on Machine learning},
  year={2020}
}

If you find our dataset generation pipeline useful, please cite it as:

@inproceedings{chazan2021single,
  title={Single channel voice separation for unknown number of speakers under reverberant and noisy settings},
  author={Chazan, Shlomo E and Wolf, Lior and Nachmani, Eliya and Adi, Yossi},
  booktitle={ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={3730--3734},
  year={2021},
  organization={IEEE}
}

License

This repository is released under the CC-BY-NC-SA 4.0. license as found in the LICENSE file.

The file: svoice/models/sisnr_loss.py and svoice/data/preprocess.py were adapted from the kaituoxu/Conv-TasNet repository. It is an unofficial implementation of the Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation paper, released under the MIT License. Additionally, several input manipulation functions were borrowed and modified from the yluo42/TAC repository, released under the CC BY-NC-SA 3.0 License.

svoice's People

Contributors

adiyoss avatar enk100 avatar kaka2makaka avatar vxfla avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

svoice's Issues

What exactly does segment do?

Hi,

I'm curious what exactly does the segment variable control? I'm assuming it sets the length of the segment read while training, but how? If it's set to 4 does it only read the first 4 seconds of the file? or does it take a random 4 second portion? If it takes the first 4 seconds, does it then process the next 4 seconds or does it only ever read the first 4 seconds of the file?

Thanks :)

raise TypeError(msg) TypeError: 'required' is an invalid argument for positionals

I'm facing problem in while running the train.py file

[2020-12-28 16:29:58,757][__main__][INFO] - For logs, checkpoints and samples check /content/svoice/outputs/exp_
[2020-12-28 16:30:01,919][__main__][ERROR] - Some error happened
Traceback (most recent call last):
  File "train.py", line 118, in main
    _main(args)
  File "train.py", line 112, in _main
    run(args)
  File "train.py", line 30, in run
    from svoice.solver import Solver
  File "/content/svoice/svoice/solver.py", line 21, in <module>
    from .separate import separate
  File "/content/svoice/svoice/separate.py", line 27, in <module>
    parser.add_argument("model_path", type=str, required=True, help="Model name")
  File "/usr/lib/python3.6/argparse.py", line 1329, in add_argument
    kwargs = self._get_positional_kwargs(*args, **kwargs)
  File "/usr/lib/python3.6/argparse.py", line 1441, in _get_positional_kwargs
    raise TypeError(msg)
TypeError: 'required' is an invalid argument for positional

How can I solve it?

Can't separate using CPU while net trained on GPU

I have trained net with GPU,
I'm truing to separate some files from mix folder:
!python -m svoice.separate /path/to/checkpoint.th /path/to/separated_output --mix_dir=/path/to/mix --device="cpu"

And I have error message:
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

P.S. I have also tryed different way to point cpu: --device=cpu, --device='cpu'.

Assertion Error

Hi,
I'm trying to train on the AVSpeech dataset but have encountered the following error:

(virtualenv) (base) ed716@ed716:~/Documents/NewSSD/gated_DPRNN$ sudo python3 train.py                                   
[2021-06-14 20:23:48,612][__main__][INFO] - For logs, checkpoints and samples check /media/ed716/NewSSD/gated_DPRNN/out$
uts/exp_                                                                                                                
[2021-06-14 20:23:48,612][__main__][DEBUG] - {'sample_rate': 8000, 'segment': 2, 'stride': 1, 'pad': True, 'cv_maxlen': 
8, 'validfull': 1, 'num_prints': 5, 'device': 'cuda', 'num_workers': 5, 'verbose': 1, 'show': 0, 'checkpoint': True, 'c$
ntinue_from': '', 'continue_best': False, 'restart': False, 'checkpoint_file': 'checkpoint.th', 'history_file': 'histor$
.json', 'samples_dir': 'samples', 'seed': 2036, 'dummy': None, 'pesq': False, 'eval_every': 10, 'keep_last': 0, 'optim'$
 'adam', 'lr': 0.0005, 'beta2': 0.999, 'stft_loss': False, 'stft_sc_factor': 0.5, 'stft_mag_factor': 0.5, 'epochs': 100$
 'batch_size': 1, 'max_norm': 5, 'lr_sched': 'step', 'step': {'step_size': 2, 'gamma': 0.98}, 'plateau': {'factor': 0.5$
 'patience': 5}, 'model': 'swave', 'swave': {'N': 128, 'L': 8, 'H': 128, 'R': 6, 'C': 2, 'input_normalize': False}, 'dd$
': False, 'ddp_backend': 'nccl', 'rendezvous_file': './rendezvous', 'rank': None, 'world_size': None, 'dset': {'train': 
'/media/ed716/NewSSD/gated_DPRNN/egs/debug/tr', 'valid': '/media/ed716/NewSSD/gated_DPRNN/egs/debug/cv', 'test': '/medi$
/ed716/NewSSD/gated_DPRNN/egs/debug/tt', 'mix_json': '/media/ed716/NewSSD/gated_DPRNN/egs/debug/tr/mix.json', 'mix_dir'$
 None}}                                                                                                                 
/home/ed716/virtualenv/lib/python3.6/site-packages/torchaudio/backend/utils.py:54: UserWarning: "sox" backend is being $
eprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. 
Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail.         
  '"sox" backend is being deprecated. '                                                                                 
[2021-06-14 20:23:49,509][__main__][INFO] - Running on host ed716                                                       
[2021-06-14 20:23:51,179][svoice.solver][INFO] - ---------------------------------------------------------------------- 
[2021-06-14 20:23:51,179][svoice.solver][INFO] - Training...                                                            
[2021-06-14 20:24:05,661][svoice.solver][INFO] - Train | Epoch 1 | 40/200 | 2.8 it/sec | Loss 25.22286
[2021-06-14 20:24:20,099][svoice.solver][INFO] - Train | Epoch 1 | 80/200 | 2.8 it/sec | Loss 23.01586
[2021-06-14 20:24:34,336][svoice.solver][INFO] - Train | Epoch 1 | 120/200 | 2.8 it/sec | Loss 22.67385
[2021-06-14 20:24:48,426][svoice.solver][INFO] - Train | Epoch 1 | 160/200 | 2.8 it/sec | Loss 22.08590
[2021-06-14 20:25:02,552][svoice.solver][INFO] - Train | Epoch 1 | 200/200 | 2.8 it/sec | Loss 21.87745
[2021-06-14 20:25:02,553][svoice.solver][INFO] - Train Summary | End of Epoch 1 | Time 71.37s | Train Loss 21.87745
[2021-06-14 20:25:02,553][svoice.solver][INFO] - ----------------------------------------------------------------------
[2021-06-14 20:25:02,553][svoice.solver][INFO] - Cross validation...
[2021-06-14 20:25:02,775][__main__][ERROR] - Some error happened
Traceback (most recent call last):
  File "train.py", line 119, in main
    _main(args)
  File "train.py", line 113, in _main
    run(args)
  File "train.py", line 94, in run
    solver.train()
  File "/media/ed716/NewSSD/gated_DPRNN/svoice/solver.py", line 131, in train
    valid_loss = self._run_one_epoch(epoch, cross_valid=True)
  File "/media/ed716/NewSSD/gated_DPRNN/svoice/solver.py", line 210, in _run_one_epoch
    sources, est_src, lengths)
  File "/media/ed716/NewSSD/gated_DPRNN/svoice/models/sisnr_loss.py", line 23, in cal_loss
    source_lengths)
  File "/media/ed716/NewSSD/gated_DPRNN/svoice/models/sisnr_loss.py", line 39, in cal_si_snr_with_pit
    assert source.size() == estimate_source.size()
AssertionError

The only variable I've changed was segment=2, batchsize=1,
I've also tried with segment=1 and 3 and 4, batchsize=2 and 4 but still getting the same error.

The number of training sets printed in the log does not match

Hi,The actual number of my training set is 4000, the verification set and test set are 500, why the log print training set is 1157?


[2021-09-24 09:56:01,729][svoice.solver][INFO] - Training...
[2021-09-24 09:59:01,476][svoice.solver][INFO] - Train | Epoch 60 | 231/1157 | 1.3 it/sec | Loss -8.01471
[2021-09-24 10:02:00,937][svoice.solver][INFO] - Train | Epoch 60 | 462/1157 | 1.3 it/sec | Loss -7.95500
[2021-09-24 10:05:00,119][svoice.solver][INFO] - Train | Epoch 60 | 693/1157 | 1.3 it/sec | Loss -7.93800
[2021-09-24 10:07:59,123][svoice.solver][INFO] - Train | Epoch 60 | 924/1157 | 1.3 it/sec | Loss -7.92135
[2021-09-24 10:11:00,928][svoice.solver][INFO] - Train | Epoch 60 | 1155/1157 | 1.3 it/sec | Loss -7.90971
[2021-09-24 10:11:02,425][svoice.solver][INFO] - Train Summary | End of Epoch 60 | Time 900.70s | Train Loss -7.90990
[2021-09-24 10:11:02,426][svoice.solver][INFO] - ----------------------------------------------------------------------
[2021-09-24 10:11:02,426][svoice.solver][INFO] - Cross validation...
[2021-09-24 10:11:13,715][svoice.solver][INFO] - Valid | Epoch 60 | 100/500 | 9.1 it/sec | Loss -11.36663
[2021-09-24 10:11:24,161][svoice.solver][INFO] - Valid | Epoch 60 | 200/500 | 9.3 it/sec | Loss -11.31748
[2021-09-24 10:11:35,182][svoice.solver][INFO] - Valid | Epoch 60 | 300/500 | 9.2 it/sec | Loss -11.50216
[2021-09-24 10:11:46,996][svoice.solver][INFO] - Valid | Epoch 60 | 400/500 | 9.0 it/sec | Loss -11.34704
[2021-09-24 10:11:58,362][svoice.solver][INFO] - Valid | Epoch 60 | 500/500 | 9.0 it/sec | Loss -11.42437
[2021-09-24 10:11:58,363][svoice.solver][INFO] - Valid Summary | End of Epoch 60 | Time 956.63s | Valid Loss -11.42437
[2021-09-24 10:11:58,364][svoice.solver][INFO] - Learning rate adjusted: 0.00027
[2021-09-24 10:11:58,364][svoice.solver][INFO] - ----------------------------------------------------------------------
[2021-09-24 10:11:58,365][svoice.solver][INFO] - Evaluating on the test set...
[2021-09-24 10:12:11,507][svoice.evaluate][INFO] - Eval estimates | 100/500 | 7.8 it/sec
[2021-09-24 10:12:24,636][svoice.evaluate][INFO] - Eval estimates | 200/500 | 7.7 it/sec
[2021-09-24 10:12:37,091][svoice.evaluate][INFO] - Eval estimates | 300/500 | 7.8 it/sec
[2021-09-24 10:12:46,929][svoice.evaluate][INFO] - Eval estimates | 400/500 | 8.3 it/sec
[2021-09-24 10:12:56,445][svoice.evaluate][INFO] - Eval estimates | 500/500 | 8.7 it/sec
[2021-09-24 10:12:56,446][svoice.evaluate][INFO] - Eval metrics | 100/500 | 210968.5 it/sec
[2021-09-24 10:12:56,447][svoice.evaluate][INFO] - Eval metrics | 200/500 | 148216.4 it/sec
[2021-09-24 10:12:56,448][svoice.evaluate][INFO] - Eval metrics | 300/500 | 151577.1 it/sec
[2021-09-24 10:12:56,448][svoice.evaluate][INFO] - Eval metrics | 400/500 | 157615.6 it/sec
[2021-09-24 10:12:56,449][svoice.evaluate][INFO] - Eval metrics | 500/500 | 160678.0 it/sec
[2021-09-24 10:12:56,494][svoice.evaluate][INFO] - Test set performance: SISNRi=11.10 PESQ=0.0, STOI=0.0.
[2021-09-24 10:12:56,525][svoice.solver][INFO] - Separate and save samples...
100%|█████████████████████████████████████████████████████████████████████████████████| 125/125 [00:28<00:00, 4.41it/s]
[2021-09-24 10:13:24,919][svoice.solver][INFO] - ----------------------------------------------------------------------
[2021-09-24 10:13:24,920][svoice.solver][INFO] - Overall Summary | Epoch 60 | Train -7.90990 | Valid -11.42437 | Best -11.55503 | Sisnr 11.10285 | Pesq 0.00000 | Stoi 0.00000

About training process status

Hi, I have preprocessed the data in required format and started training, But there is no live update on number of iterations/ loss. How do we check it in hydra?

CUDA out of memory

Hello, so I was training my toy dataset for 5 speaker separation.
My data was 200 audio file, so total data with each spk1, spk2, ..., spk5, and mix data is 1200 data.
I was in around 10th epoch training and this is happened.

[2021-01-28 16:59:10,810][svoice.evaluate][INFO] - Eval estimates | 40/200 | 1.1 it/sec                                                                                                                      
[2021-01-28 16:59:37,874][svoice.evaluate][INFO] - Eval estimates | 80/200 | 1.3 it/sec                                                                                                                      
[2021-01-28 17:00:01,654][svoice.evaluate][INFO] - Eval estimates | 120/200 | 1.4 it/sec                                                                                                                     
[2021-01-28 17:00:22,557][svoice.evaluate][INFO] - Eval estimates | 160/200 | 1.5 it/sec                                                                                                                     
[2021-01-28 17:00:40,419][svoice.evaluate][INFO] - Eval estimates | 200/200 | 1.6 it/sec                                                                                                                     
[2021-01-28 17:00:40,420][svoice.evaluate][INFO] - Eval metrics | 40/200 | 164561.2 it/sec
[2021-01-28 17:00:40,421][svoice.evaluate][INFO] - Eval metrics | 80/200 | 106802.5 it/sec
[2021-01-28 17:00:40,421][svoice.evaluate][INFO] - Eval metrics | 120/200 | 98584.1 it/sec
[2021-01-28 17:00:40,422][svoice.evaluate][INFO] - Eval metrics | 160/200 | 96469.0 it/sec
[2021-01-28 17:00:40,422][svoice.evaluate][INFO] - Eval metrics | 200/200 | 88817.4 it/sec
[2021-01-28 17:00:40,497][svoice.evaluate][INFO] - Test set performance: SISNRi=4.94 PESQ=0.0, STOI=0.0.
[2021-01-28 17:00:40,522][svoice.solver][INFO] - Separate and save samples...
  0%|                                                                                            | 0/50 [00:00<?, ?it/s]
[2021-01-28 17:00:41,458][__main__][ERROR] - Some error happened
Traceback (most recent call last):
  File "train.py", line 118, in main
    _main(args)
  File "train.py", line 112, in _main
    run(args)
  File "train.py", line 93, in run
    solver.train()
  File "/home/donny.adhitama/dom_tools/svoice/svoice/solver.py", line 166, in train
    separate(self.args, self.model, self.samples_dir)
  File "/home/donny.adhitama/dom_tools/svoice/svoice/separate.py", line 123, in separate
    estimate_sources = model(mixture)[-1]
File "/home/donny.adhitama/miniconda3/envs/svoice/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/donny.adhitama/dom_tools/svoice/svoice/models/swave.py", line 246, in forward
    output_all = self.separator(mixture_w)
  File "/home/donny.adhitama/miniconda3/envs/svoice/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/donny.adhitama/dom_tools/svoice/svoice/models/swave.py", line 207, in forward
    output_all = self.rnn_model(enc_segments)
  File "/home/donny.adhitama/miniconda3/envs/svoice/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/donny.adhitama/dom_tools/svoice/svoice/models/swave.py", line 108, in forward
    row_output = self.rows_grnn[i](row_input)
  File "/home/donny.adhitama/miniconda3/envs/svoice/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/donny.adhitama/dom_tools/svoice/svoice/models/swave.py", line 47, in forward
    gate_rnn_output, _ = self.gate_rnn(output)
  File "/home/donny.adhitama/miniconda3/envs/svoice/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/donny.adhitama/miniconda3/envs/svoice/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 577, in forward
    self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: CUDA out of memory. Tried to allocate 4.17 GiB (GPU 0; 31.72 GiB total capacity; 10.77 GiB already allocated; 503.00 MiB free; 30.06 GiB reserved in total by PyTorch)

Any idea what was going on? Everything was fine until 10th epoch.
Looking forward for the help. Thank you...

WHere is the trained model saved after training? How to pass a differnt path to save model through CLI?

Hi,

I followed the readme steps and I had no issues with training the model.
I want to evaluate the trained model but I'm not able to locate the path_to_model_dir to pass the saved model for evaluation.

List of commands I ran after clean setup:

./make_debug.sh
python train.py
python train.py lr=0.001
python train.py ddp=1

I had some issue with running the command below:

python train.py dset=librimix lr=0.001 swave.R=8
Error:
$ python train.py dset=librimix lr=0.001 swave.R=8
Traceback (most recent call last):
  File "train.py", line 126, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/hydra/main.py", line 32, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 346, in _run_hydra
    run_and_report(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 201, in run_and_report
    raise ex
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 347, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 100, in run
    cfg = self.compose_config(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 507, in compose_config
    cfg = self.config_loader.load_configuration(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/config_loader_impl.py", line 151, in load_configuration
    return self._load_configuration(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/config_loader_impl.py", line 256, in _load_configuration
    cfg = self._merge_defaults_into_config(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/config_loader_impl.py", line 805, in _merge_defaults_into_config
    hydra_cfg = merge_defaults_list_into_config(hydra_cfg, user_list)
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/config_loader_impl.py", line 777, in merge_defaults_list_into_config
    merged_cfg = self._merge_config(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/config_loader_impl.py", line 715, in _merge_config
    raise MissingConfigException(msg, new_cfg, options)
hydra.errors.MissingConfigException: Could not load dset/librimix.
Available options:
	debug

I want to evaluate the trained models and I tried to look in the PWD and I did not find the model saved there.

python -m svoice.evaluate <**path to the model**> <path to folder containing mix.json and all target separated channels json files s<ID>.json>

What is the default path to the model?
How do I explicitly pass a path to model dir of my chosen dir?

ERROR:root:mic not in the room

Hi,I reported the following error during the sample generation process. How can I solve it?
noise files:https://www.openslr.org/28/(RIRS_NOISES/pointsource_noises)

1,RuntimeWarning: invalid value encountered in sqrt
2,root:mic not in the room


0%| | 5/4000 [01:14<16:40:31, 15.03s/it]ERROR:root:mic not in the room
0%| | 6/4000 [01:29<16:39:22, 15.01s/it]/data/home/test/svoice/nprirgen/nprirgen.py:57: RuntimeWarning: invalid value encountered in sqrt
betaCoeffs = np.ones(6) * np.sqrt(1 - alpha)
1%| | 25/4000 [07:21<15:04:08, 13.65s/it]ERROR:root:mic not in the room
1%| | 40/4000 [08:21<4:02:05, 3.67s/it]ERROR:root:mic not in the room
1%| | 46/4000 [08:44<4:09:13, 3.78s/it]

There is a big difference between the training loss and valid loss

After I trained the model with the provided samples, I found that the valid loss is far good than the training loss.
such as below:

"train": -6.68186240196228,
"valid": -15.98158609867096,
"best": -15.98158609867096

Then I use my own audio dataset, I also found this situation after epoch 8th.
Is this situation is reasonable?

Non-English situation

Hi! Thank you for sharing your code.
We plan to use it to separate our mixed-speech in Japanese.
Is it necessary to fine-tune your model in Japanese dataset? and if yes, approximately how many hours of data is needed?

Thank you!

hydra.errors.MissingConfigException: Could not load hydra/hydra_logging/colorlog

Hi!

I am facing this error when running train.py on debug data.

Full error
(venv_s) noev@pd02-dgx-002:~/svoice$ HYDRA_FULL_ERROR=1 python train.py Traceback (most recent call last): File "train.py", line 126, in <module> main() File "/home/noev/venv_s/lib/python3.7/site-packages/hydra/main.py", line 37, in decorated_main strict=strict, File "/home/noev/venv_s/lib/python3.7/site-packages/hydra/_internal/utils.py", line 347, in _run_hydra lambda: hydra.run( File "/home/noev/venv_s/lib/python3.7/site-packages/hydra/_internal/utils.py", line 201, in run_and_report raise ex File "/home/noev/venv_s/lib/python3.7/site-packages/hydra/_internal/utils.py", line 198, in run_and_report return func() File "/home/noev/venv_s/lib/python3.7/site-packages/hydra/_internal/utils.py", line 350, in <lambda> overrides=args.overrides, File "/home/noev/venv_s/lib/python3.7/site-packages/hydra/_internal/hydra.py", line 104, in run run_mode=RunMode.RUN, File "/home/noev/venv_s/lib/python3.7/site-packages/hydra/_internal/hydra.py", line 512, in compose_config from_shell=from_shell, File "/home/noev/venv_s/lib/python3.7/site-packages/hydra/_internal/config_loader_impl.py", line 156, in load_configuration from_shell=from_shell, File "/home/noev/venv_s/lib/python3.7/site-packages/hydra/_internal/config_loader_impl.py", line 262, in _load_configuration run_mode=run_mode, File "/home/noev/venv_s/lib/python3.7/site-packages/hydra/_internal/config_loader_impl.py", line 804, in _merge_defaults_into_config hydra_cfg = merge_defaults_list_into_config(hydra_cfg, system_list) File "/home/noev/venv_s/lib/python3.7/site-packages/hydra/_internal/config_loader_impl.py", line 783, in merge_defaults_list_into_config package_override=default1.package, File "/home/noev/venv_s/lib/python3.7/site-packages/hydra/_internal/config_loader_impl.py", line 715, in _merge_config raise MissingConfigException(msg, new_cfg, options) hydra.errors.MissingConfigException: Could not load hydra/hydra_logging/colorlog

How can I solve it?

Thanks!

Validation loss starts increasing / goes to NaN

Hello,
When trying to train your model on data from LibriSpeech corpus (custom created by me, and working well with other models), the validation loss decreases well for a few epochs, and than starts increasing fast until after 10-20 epochs the loss goes to NaN and an error occurs. Any idea what am I doing wrong? The speech data includes reverberation and noise if it matters.

I haven't changed much the config you provided, this is the config.yaml I use:

defaults:

  • dset: libri
  • hydra/job_logging: colorlog
  • hydra/hydra_logging: colorlog

Dataset related

sample_rate: 16000
segment: 4
stride: 1 # in seconds, how much to stride between training examples
pad: true # if training sample is too short, pad it
cv_maxlen: 8
validfull: 1 # use entire samples at valid

Logging and printing, and does not impact training

num_prints: 5
device: cuda
num_workers: 4
verbose: 0
show: 0 # just show the model and its size and exit

Checkpointing, by default automatically load last checkpoint

checkpoint: True
continue_from: '' # Only pass the name of the exp, like exp_dset=wham
# this arg is ignored for the naming of the exp!
continue_best: True
restart: False # Ignore existing checkpoints
checkpoint_file: checkpoint.th
history_file: history.json
samples_dir: samples

Other stuff

seed: 2036
dummy: # use this if you want twice the same exp, with a name

Evaluation stuff

pesq: false # compute pesq?
eval_every: 100
keep_last: 0

Optimization related

optim: adam
lr: 5e-4
beta2: 0.999
stft_loss: False
stft_sc_factor: .5
stft_mag_factor: .5
epochs: 100
batch_size: 2
max_norm: 5

learning rate scheduling

lr_sched: step # can be either step or plateau
step:
step_size: 2
gamma: 0.98
plateau:
factor: 0.5
patience: 4

Models

model: swave # either demucs or dwave
swave:
N: 128
L: 16
H: 128
R: 6
C: 2
input_normalize: False

Experiment launching, distributed

ddp: false
ddp_backend: nccl
rendezvous_file: ./rendezvous

Internal config, don't set manually

rank:
world_size:

Hydra config

hydra:
run:
dir: ./outputs/exp_${hydra.job.override_dirname}
job:
config:
# configuration for the ${hydra.job.override_dirname} runtime variable
override_dirname:
kv_sep: '='
item_sep: ','
# Remove all paths, as the / in them would mess up things
# Remove params that would not impact the training itself
# Remove all slurm and submit params.
# This is ugly I know...
exclude_keys: [
'hydra.job_logging.handles.file.filename',
'dset.train', 'dset.valid', 'dset.test', 'dset.mix_json', 'dset.mix_dir',
'num_prints', 'continue_from',
'device', 'num_workers', 'print_freq', 'restart', 'verbose',
'log', 'ddp', 'ddp_backend', 'rendezvous_file', 'rank', 'world_size']
job_logging:
handlers:
file:
class: logging.FileHandler
mode: w
formatter: colorlog
filename: trainer.log
console:
class: logging.StreamHandler
formatter: colorlog
stream: ext://sys.stderr

hydra_logging:
handlers:
console:
class: logging.StreamHandler
formatter: colorlog
stream: ext://sys.stderr

GPU memory requirements

Hi,

I'm trying to train on the included "toy" example and I'm running out of GPU memory:
RuntimeError: CUDA out of memory. Tried to allocate 316.00 MiB (GPU 0; 7.80 GiB total capacity; 6.31 GiB already allocated; 257.62 MiB free; 6.41 GiB reserved in total by PyTorch)

I'm not sure why PyTorch is reserving so much memory. I even tried lowering the batch size as was recommended in #14 but in that case PyTorch reserved even more memory. Is it possible to tell PyTorch to not reserve so much memory? or is 8GB simply not enough? I want to train a 5 source model based on LibriMix. I'm not sure I can afford any GPUs that have 32GB, but what if I added an additional 16GB card? Could the memory requirements be divided by more than one card, or would they all need large reservations by PyTorch? My system itself is using very little memory:
nvidia-smi

If the user from #14 is having OOM issues with 32GB, I'm afraid using anything less won't be feasible. Any suggestions?

Cuda OOM when using ddp

I am training with a custom dataset for 5 speakers. I have 4 GPUs for training. The current Batch size is 4 and after 1st epoch, I get Cuda OOM. Do I also have to modify swave parameters to lower network load? My current wave params are as follows

swave:
N: 128
L: 8
H: 128
R: 6
C: 5
input_normalize: False

google colab problem appear ModuleNotFoundError: No module named 'svoice'

google colab

input is {!python /content/drive/MyDrive/test_2/svoice-master/svoice/separate.py /content/drive/MyDrive/test_2/svoice-master/outputs/exp_/checkpoint.th /content/drive/MyDrive/test_2/svoice-master/out_data_1 --mix_dir=/content/drive/MyDrive/test_2/svoice-master/op11 }

output is {Traceback (most recent call last):
File "/content/drive/MyDrive/test_2/svoice-master/svoice/separate.py", line 133, in
separate(args, local_out_dir=args.out_dir)
File "/content/drive/MyDrive/test_2/svoice-master/svoice/separate.py", line 88, in separate
pkg = torch.load(args.model_path)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 584, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 842, in _load
result = unpickler.load()
ModuleNotFoundError: No module named 'svoice'}

Why would it return ModuleNotFoundError: No module named 'svoice

When executing "svoice.separate", "CUDA out of memory" occurs

Hi,I have a large file to be separated, 22M. When I execute "svoice.separate", "CUDA out of memory" occurs. How can I solve it?

When the separated files are relatively small, there is no error.


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 90.37it/s]
0%| | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/python3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/home/test/svoice/svoice/separate.py", line 133, in
separate(args, local_out_dir=args.out_dir)
File "/data/home/test/svoice/svoice/separate.py", line 123, in separate
estimate_sources = model(mixture)[-1]
File "/data/home/test/svoice/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/home/test/svoice/svoice/models/swave.py", line 253, in forward
output_all = self.separator(mixture_w)
File "/data/home/test/svoice/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/home/test/svoice/svoice/models/swave.py", line 214, in forward
output_all = self.rnn_model(enc_segments)
File "/data/home/test/svoice/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/home/test/svoice/svoice/models/swave.py", line 108, in forward
row_output = self.rows_grnni
File "/data/home/test/svoice/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/home/test/svoice/svoice/models/swave.py", line 43, in forward
rnn_output, _ = self.rnn(output)
File "/data/home/test/svoice/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/home/test/svoice/venv/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 576, in forward
result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: CUDA out of memory. Tried to allocate 5.42 GiB (GPU 0; 15.78 GiB total capacity; 9.66 GiB already allocated; 3.62 GiB free; 11.00 GiB reserved in total by PyTorch)

IndexError: list index out of range

python -m svoice.data.audio $spk1 > $out/s1.json
Traceback (most recent call last):
  File "/home/sachin/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/sachin/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/sachin/svoice/svoice/data/audio.py", line 87, in <module>
    json.dump(find_audio_files(sys.argv[1]), sys.stdout, indent=4)
IndexError: list index out of range

Separation of mix wav files not working

Hi, I tried to use the CLI interface to run the model and separate the speakers in the audio input file.
I believe there are some errors in passing arguments that I do not understand.

below is the recommended code to achieve separation:

python -m svoice.separate --model_path=<path to the model> --mix_dir=<path to the dir with the mixture files> --out_dir=<path to store the separated files>

I tried it and I got following error:

Speech separation using MulCat blocks: error: the following arguments are required: model_path, out_dir

I removed -- before the argument specification and I got another error saying the model dir does not exists.

Please check the below code execution to reproduce the issue.

$ python -m svoice.separate --model_path=outputs/exp_/checkpoint.th --mix_dir=/home/sachin/Desktop/mindset/mix --out_dir=/home/sachin/Desktop/mindset/s1
usage: Speech separation using MulCat blocks [-h] [--mix_dir MIX_DIR] [--mix_json MIX_JSON] [--device DEVICE] [--sample_rate SAMPLE_RATE] [--batch_size BATCH_SIZE] [-v] model_path out_dir
Speech separation using MulCat blocks: error: the following arguments are required: model_path, out_dir

$ python -m svoice.separate model_path=outputs/exp_/checkpoint.th --mix_dir=/home/sachin/Desktop/mindset/mix --out_dir=/home/sachin/Desktop/mindset/s1
usage: Speech separation using MulCat blocks [-h] [--mix_dir MIX_DIR] [--mix_json MIX_JSON] [--device DEVICE] [--sample_rate SAMPLE_RATE] [--batch_size BATCH_SIZE] [-v] model_path out_dir
Speech separation using MulCat blocks: error: the following arguments are required: out_dir

$ python -m svoice.separate model_path=outputs/exp_/checkpoint.th --mix_dir=/home/sachin/Desktop/mindset/mix out_dir=/home/sachin/Desktop/mindset/s1
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/sachin/svoice/svoice/separate.py", line 133, in <module>
    separate(args, local_out_dir=args.out_dir)
  File "/home/sachin/svoice/svoice/separate.py", line 88, in separate
    pkg = torch.load(args.model_path)
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 571, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 229, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 210, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'model_path=outputs/exp_/checkpoint.th'
$

create data for more than 2 mix

Hello,
can you please guide me on how to add data in the mix directory for more than 2 speakers like 3 or 4 speakers overlapped audio with different overlapped ratios (like 50 or 80 or 100)
I can see a pattern in the given data which is file_name.wav in the mix then this s1 and s2 both have the same file_name but what if I want to create more than one mix file like 50% overlapped or 100 % overlapped then how I can do it

cannot create mix.json (0it [00:00, ?it/s])

Creating a mix.json doesn't work for me, instead outputting 0it [00:00, ?it/s].

Can you give an example of how the json should look? Then I'll be able to populate it manually.

thanks

Which one should be specified in mix_json in debug.yaml?

Hi,I have generated the following data set, the speakers in train, valid, and test are all independent,which one should be specified in mix_json in debug.yaml?

egs/test/
├── cv
│   ├── mix.json
│   ├── s1.json
│   └── s2.json
├── tr
│   ├── mix.json
│   ├── s1.json
│   └── s2.json
└── tt
├── mix.json
├── s1.json
└── s2.json

Run separate on cpu only device

Hey Im looking to run the script separate on a CPU only device. I have set the map _location to cpu in site-packages/torch/serialization.py but still get the error:
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
I would appreciate any advice thanks :) also the program is amazing.

question about evaluation result

Hi everyone.

I trained the model with this repository codes, but I obtained wrong results.

I used only WSJ-2mix dataset you linked.

In your paper, the metric SI-SNRi is 20.1 on 2 speaker task, but when I trained this model, it was only 7.24 at Epoch 40.
at the same epoch, train and valid loss is -11.209 and -19.860, I think it is right. and separated sample data was good quality for me to hear.
I just wonder why the score is low.

I didn't change the configuration file except for batch-size, just I ran the training source code.
Is there anything I'm missing?

Thank you.

Do non-English training samples, synthetic samples and model configuration need special adjustments?

For non-English training samples, do you need to make special adjustments when synthesizing samples and model configuration? Such as Mandarin Chinese.

Using the sample generated by "scripts/make_dataset.py", the speakers in tr, cv, and tt are mutually exclusive.
1,2 speakers
2,the sample contains noise
3,Mandarin Chinese
4,4 cards per machine
5,tr: 4000, cv: 500, tt : 500

after 10 rounds of training, sisnr is only 3.6.


2021/09/24 16:08:25 [�[36m2021-09-24 16:08:24,686�[0m][�[34msvoice.solver�[0m][�[32mINFO�[0m] - �[1mOverall Summary | Epoch 10 | Train 5.20729 | Valid 9.05549 | Best 9.05549 | Sisnr 3.60169 | Pesq 0.00000 | Stoi 0.00000�[0m

2021/09/24 17:02:53 [�[36m2021-09-24 17:02:52,354�[0m][�[34msvoice.solver�[0m][�[32mINFO�[0m] - �[1mOverall Summary | Epoch 20 | Train 4.25031 | Valid 9.04332 | Best 8.89767 | Sisnr 3.75566 | Pesq 0.00000 | Stoi 0.00000�[0m

make_dataset:IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

Hi,when I used "https://www.openslr.org/28/(RIRS_NOISES/pointsource_noises)" and my own audio file to generate the data set, the following error occurred. But when I change the noise to "wham_noise/tr", it can be successful, how can I solve it?


0%| | 0/10 [00:45<?, ?it/s]
Traceback (most recent call last):
File "scripts/make_dataset.py", line 221, in
main(args)
File "scripts/make_dataset.py", line 206, in main
Data.gen_scene(args.num_of_speakers, i, args.out_path)
File "scripts/make_dataset.py", line 171, in gen_scene
noise = self.fetch_noise()
File "scripts/make_dataset.py", line 151, in fetch_noise
return s[0:self.size_of_signals, 0]
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

How to consider noise added mixtures in training?

I am trying to train mix_both (utterances + noise) subset in Libri2mix dataset. Vocal utterances are from Librispeech dataset and noise wavs are from WHAM dataset.
How should I consider added noise wavs? I mean take them as the third speaker s3 and train as 3mix model or just train 2mix model like mix_clean?

AssertionError when training for 3 speakers

Hello,

I am training model on Librimix 3 speakers dataset. I've also modified the make_debug.sh file for generating .json file in svoice/egs/debug/tr/ directory. However I am facing error as shown in below code (The model train perfectly for dataset of 2 speakers).

Would be greatful if someone can point me in a direction towards addressing the issue.

[2021-11-15 13:55:47,256][__main__][ERROR] - Some error happened
Traceback (most recent call last):
  File "train.py", line 118, in main
    _main(args)
  File "train.py", line 112, in _main
    run(args)
  File "train.py", line 93, in run
    solver.train()
  File "/workspace/svoice/svoice/solver.py", line 122, in train
    train_loss = self._run_one_epoch(epoch)
  File "/workspace/svoice/svoice/solver.py", line 210, in _run_one_epoch
    sources, est_src, lengths)
  File "/workspace/svoice/svoice/models/sisnr_loss.py", line 23, in cal_loss
    source_lengths)
  File "/workspace/svoice/svoice/models/sisnr_loss.py", line 39, in cal_si_snr_with_pit
    assert source.size() == estimate_source.size()
AssertionError

error while trying to train

Hi,

I overcame my OOM problem (from #24) while trying to train the included debug set. This was accomplished by setting R=2 and segment=2. I'm now trying to train using the librimix dataset but have encountered the following error:

(svoice) user@system:/media/user/svoice/svoice/svoice$ python train.py sample_rate=16000 dset=libri4mix segment=2 verbose=1
[2021-04-10 17:18:56,680][__main__][INFO] - For logs, checkpoints and samples check /media/user/svoice/svoice/svoice/outputs/exp_dset=libri4mix,sample_rate=16000,segment=2
[2021-04-10 17:18:56,680][__main__][DEBUG] - {'sample_rate': 16000, 'segment': 2, 'stride': 1, 'pad': True, 'cv_maxlen': 8, 'validfull': 1, 'num_prints': 5, 'device': 'cuda', 'num_workers': 5, 'verbose': 1, 'show': 0, 'checkpoint': True, 'continue_from': '', 'continue_best': False, 'restart': False, 'checkpoint_file': 'checkpoint.th', 'history_file': 'history.json', 'samples_dir': 'samples', 'seed': 2036, 'dummy': None, 'pesq': False, 'eval_every': 10, 'keep_last': 0, 'optim': 'adam', 'lr': 0.0005, 'beta2': 0.999, 'stft_loss': False, 'stft_sc_factor': 0.5, 'stft_mag_factor': 0.5, 'epochs': 100, 'batch_size': 4, 'max_norm': 5, 'lr_sched': 'step', 'step': {'step_size': 2, 'gamma': 0.98}, 'plateau': {'factor': 0.5, 'patience': 5}, 'model': 'swave', 'swave': {'N': 128, 'L': 8, 'H': 128, 'R': 2, 'C': 2, 'input_normalize': False}, 'ddp': False, 'ddp_backend': 'nccl', 'rendezvous_file': './rendezvous', 'rank': None, 'world_size': None, 'dset': {'train': '/media/user/svoice/svoice/svoice/egs/libri4mix/tr', 'valid': '/media/user/svoice/svoice/svoice/egs/libri4mix/tr', 'test': '/media/user/svoice/svoice/svoice/egs/libri4mix/tr', 'mix_json': '/media/user/svoice/svoice/svoice/egs/libri4mix/tr/mix.json', 'mix_dir': None}}
[2021-04-10 17:18:57,312][__main__][INFO] - Running on host system
[2021-04-10 17:18:59,374][svoice.solver][INFO] - ----------------------------------------------------------------------
[2021-04-10 17:18:59,374][svoice.solver][INFO] - Training...
[2021-04-10 17:18:59,727][__main__][ERROR] - Some error happened
Traceback (most recent call last):
  File "train.py", line 118, in main
    _main(args)
  File "train.py", line 112, in _main
    run(args)
  File "train.py", line 93, in run
    solver.train()
  File "/media/user/svoice/svoice/svoice/svoice/solver.py", line 122, in train
    train_loss = self._run_one_epoch(epoch)
  File "/media/user/svoice/svoice/svoice/svoice/solver.py", line 210, in _run_one_epoch
    sources, est_src, lengths)
  File "/media/user/svoice/svoice/svoice/svoice/models/sisnr_loss.py", line 23, in cal_loss
    source_lengths)
  File "/media/user/svoice/svoice/svoice/svoice/models/sisnr_loss.py", line 39, in cal_si_snr_with_pit
    assert source.size() == estimate_source.size()
AssertionError
(svoice) user@system:/media/user/svoice/svoice/svoice$

I've generated the relevant json files for the wavs, created the corresponding config file in the dset\ directory. The only variables I've changed was to set R=2 sample_rate=16000 dset=libri4mix segment=2. I'm considering renting a cloud instance with a GPU that has enough memory to train the model with the proper R and segment values but I'd like to know there isn't going to be any errors like this beforehand.

Run train.py script with torch 1.9.0

Hey there,

I got this error while running the train.py script with torch 1.9.0

Traceback (most recent call last):
  File "train.py", line 118, in main
    _main(args)
  File "train.py", line 112, in _main
    run(args)
  File "train.py", line 93, in run
    solver.train()
  File "/workspace/svoice/svoice/solver.py", line 122, in train
    train_loss = self._run_one_epoch(epoch)
  File "/workspace/svoice/svoice/solver.py", line 210, in _run_one_epoch
    sources, est_src, lengths)
  File "/workspace/svoice/svoice/models/sisnr_loss.py", line 23, in cal_loss
    source_lengths)
  File "/workspace/svoice/svoice/models/sisnr_loss.py", line 46, in cal_si_snr_with_pit
    estimate_source *= m
RuntimeError: Output 0 of UnbindBackward is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.

Any help ?

How to compare results of a new model with yours?

In your results you show a table of the SI-SNRi over the input mixture. In the table it says that for 2spk you obtained 20.1. I have used the code you provide in the repo and the exact commands you provide in the Readme, and when I evaluated the model it gives me as result:
INFO:main:Test set performance: SISNRi=2.22 PESQ=0.0, STOI=0.0.
{"sisnr": 2.2177312467247248, "pesq": 0.0, "stoi": 0.0}.
Which obviously is not the same as the result from the table for 2spk...
I have built a little dataset, similar in number of voices to the one provided with the code, for each number of speakers: 2,3,4 and 5. These are from people speaking in spanish and I want to compare, after training, the performance with the results you got from a dataset of english speakers. How could I do this?

Why is the model setup as it is, especially the final decoding process?

Hi everyone,

I've been trying to get a grip on machine learning and speech separation. So I thought it'd be a good exercise to manage to understand how this particular setup works. Initially, I did so by going over the paper about this model, (https://arxiv.org/pdf/2003.01531.pdf) but its descriptions proved somewhat ambiguous from time-to-time. Once I found the code on github, analyzing that answered most of my questions on the exact setup that was used. So I feel I have mostly a grip on the 'how'. What I can't seem to wrap my head around is the 'why', and with the answer eluding me when looking it up for myself, I thought I would just ask.

As the title mentioned, it is mostly the final decoding that mystifies me. Here's a longwinded description of how it goes, as far as I can follow along.


-In the separator subclass, the forward function first calls the DPMulCat rnn which outputs the expanded blocks after they've been Prelu'd and through filter D. They are size N(filters) x CR (chunks for all voices) x K (time-series segments).

-These are then fed into MergeChunks function which transposes them (N x K x CR) and reshapes them. (N x 0.5K x 2CR) Through this reshaping process the first chunk of new_K x new_CR will have the top rows filled with the information from the first old_K x old_CR chunk, with the second one filling the rows at the bottom. So all the information is already somewhat scrambled.

-Then it takes the first half of columns in new_K x _new_CR, flattens this matrix and removes the first part, and sums it with the final half of columns in this matrix, also flattened but with the final part removed. It does so for all N filters, getting the desired form of information N x CT', but having it be even more scrambled.

-These tensors are passed on to the an SWave class object, which first reshapes them to have an extra dimension for the speakers, which then all have an N x T' matrix attributed to them. This would attribute the top rows in the N x CT' matrix to the first speaker, rows below that to the second and so on. This information is scrambled again.

-After this step, the information is passed to the decoder object which transposes it again (T' x N) then averages the values in the along the rows N by steps of the original kernel size L used to 1D-Convolve the audio signal. (T' x N/L) and only after this is the actual overlap_and_add function used along the column dimension of T' with step size of 1/2L to get back to a signal of approximately length T.

-This signal is then padded back in the SWave object to the original length, although I saw no means to deal with a signal that is longer than the original signal which could occur, I think, if N/L > L


So I can follow along fairly well with what it does, mostly. But WHY it would be set up like this is a complete mystery to me.

Wouldn't doing it like this have the learning model not only have to learn how to actually separate the signals, but also how to recode them to fit the decoding process, giving it double work and therefore making it learn slower? Or was this done on purpose to prevent the system from being 'lazy' during the learning process which somehow results in it being more accurate?

The same goes for the MULCAT blocks. Is there a justification for their structure? Is the structure of a mixed audio signal such that, {adding two LSTM blocks, putting a fully connected layer afterwards to get back your original size, then multiplying the results of the two element-wise, then concatenating it with the original input, then putting another fully connected layer afterwards to get to the original size, and then adding the original block element-wise, then transposing and doing this over-and-over,} was expected to improve the model over Luo, Y., Chen, Z., and Yoshioka, T. 2019c(arXiv:1910.06379)? Or was it just empirically determined by trying lots of setups?

Pre-trained models

Thank you for making this repository public. This model looks much better than the other models it was compared against.

When are you making pre-trained models available on the repo?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.