sigsep / open-unmix-pytorch Goto Github PK

Open-Unmix - Music Source Separation for PyTorch

Home Page: https://sigsep.github.io/open-unmix/

License: MIT License

Dockerfile 0.22% Python 97.20% Mako 2.15% Shell 0.43%

open-unmix-pytorch's Introduction

Open-Unmix for PyTorch

This repository contains the PyTorch (1.8+) implementation of Open-Unmix, a deep neural network reference implementation for music source separation, applicable for researchers, audio engineers and artists. Open-Unmix provides ready-to-use models that allow users to separate pop music into four stems: vocals, drums, bass and the remaining other instruments. The models were pre-trained on the freely available MUSDB18 dataset. See details at apply pre-trained model.

⭐️ News

16/04/2024: We brought the repo to torch 2.0 level. Everything seems to work fine again, but we needed to relax the regression tests. With most recent version results a slightly different, so be warned when running unit tests
03/07/2021: We added umxl, a model that was trained on extra data which significantly improves the performance, especially generalization.
14/02/2021: We released the new version of open-unmix as a python package. This comes with: a fully differentiable version of norbert, improved audio loading pipeline and large number of bug fixes. See release notes for further info.
06/05/2020: We added a pre-trained speech enhancement model umxse provided by Sony.
13/03/2020: Open-unmix was awarded 2nd place in the PyTorch Global Summer Hackathon 2020.

Related Projects: open-unmix-pytorch | open-unmix-nnabla | musdb | museval | norbert

🧠 The Model (for one source)

To perform separation into multiple sources, Open-unmix comprises multiple models that are trained for each particular target. While this makes the training less comfortable, it allows great flexibility to customize the training data for each target source.

Each Open-Unmix source model is based on a three-layer bidirectional deep LSTM. The model learns to predict the magnitude spectrogram of a target source, like vocals, from the magnitude spectrogram of a mixture input. Internally, the prediction is obtained by applying a mask on the input. The model is optimized in the magnitude domain using mean squared error.

Input Stage

Open-Unmix operates in the time-frequency domain to perform its prediction. The input of the model is either:

models.Separator: A time domain signal tensor of shape (nb_samples, nb_channels, nb_timesteps), where nb_samples are the samples in a batch, nb_channels is 1 or 2 for mono or stereo audio, respectively, and nb_timesteps is the number of audio samples in the recording. In this case, the model computes STFTs with either torch or asteroid_filteranks on the fly.
models.OpenUnmix: The core open-unmix takes magnitude spectrograms directly (e.g. when pre-computed and loaded from disk). In that case, the input is of shape (nb_frames, nb_samples, nb_channels, nb_bins), where nb_frames and nb_bins are the time and frequency-dimensions of a Short-Time-Fourier-Transform.

The input spectrogram is standardized using the global mean and standard deviation for every frequency bin across all frames. Furthermore, we apply batch normalization in multiple stages of the model to make the training more robust against gain variation.

Dimensionality reduction

The LSTM is not operating on the original input spectrogram resolution. Instead, in the first step after the normalization, the network learns to compresses the frequency and channel axis of the model to reduce redundancy and make the model converge faster.

Bidirectional-LSTM

The core of open-unmix is a three layer bidirectional LSTM network. Due to its recurrent nature, the model can be trained and evaluated on arbitrary length of audio signals. Since the model takes information from past and future simultaneously, the model cannot be used in an online/real-time manner. An uni-directional model can easily be trained as described here.

Output Stage

After applying the LSTM, the signal is decoded back to its original input dimensionality. In the last steps the output is multiplied with the input magnitude spectrogram, so that the models is asked to learn a mask.

🤹‍♀️ Putting source models together: the `Separator`

models.Separator puts together Open-unmix spectrogram model for each desired target, and combines their output through a multichannel generalized Wiener filter, before application of inverse STFTs using torchaudio. The filtering is differentiable (but parameter-free) version of norbert. The separator is currently currently only used during inference.

🏁 Getting started

Installation

openunmix can be installed from pypi using:

pip install openunmix

Note, that the pypi version of openunmix uses [torchaudio] to load and save audio files. To increase the number of supported input and output file formats (such as STEMS export), please additionally install stempeg.

Training is not part of the open-unmix package, please follow [docs/train.md] for more information.

Using Docker

We also provide a docker container. Performing separation of a local track in ~/Music/track1.wav can be performed in a single line:

docker run -v ~/Music/:/data -it faroit/open-unmix-pytorch "/data/track1.wav" --outdir /data/track1

Pre-trained models

We provide three core pre-trained music separation models. All three models are end-to-end models that take waveform inputs and output the separated waveforms.

umxl (default) trained on private stems dataset of compressed stems. Note, that the weights are only licensed for non-commercial use (CC BY-NC-SA 4.0).
umxhq trained on MUSDB18-HQ which comprises the same tracks as in MUSDB18 but un-compressed which yield in a full bandwidth of 22050 Hz.
umx is trained on the regular MUSDB18 which is bandwidth limited to 16 kHz do to AAC compression. This model should be used for comparison with other (older) methods for evaluation in SiSEC18.

Furthermore, we provide a model for speech enhancement trained by Sony Corporation

umxse speech enhancement model is trained on the 28-speaker version of the Voicebank+DEMAND corpus.

All four models are also available as spectrogram (core) models, which take magnitude spectrogram inputs and ouput separated spectrograms. These models can be loaded using umxl_spec, umxhq_spec, umx_spec and umxse_spec.

To separate audio files (wav, flac, ogg - but not mp3) files just run:

umx input_file.wav

A more detailed list of the parameters used for the separation is given in the inference.md document.

We provide a jupyter notebook on google colab to experiment with open-unmix and to separate files online without any installation setup.

Using pre-trained models from within python

We implementes several ways to load pre-trained models and use them from within your python projects:

When the package is installed

Loading a pre-trained models is as simple as loading

separator = openunmix.umxl(...)

torch.hub

We also provide a torch.hub compatible modules that can be loaded. Note that this does not even require to install the open-unmix packagen and should generally work when the pytorch version is the same.

separator = torch.hub.load('sigsep/open-unmix-pytorch', 'umxl, device=device)

Where, umxl specifies the pre-trained model.

Performing separation

With a created separator object, one can perform separation of some audio (torch.Tensor of shape (channels, length), provided as at a sampling rate separator.sample_rate) through:

estimates = separator(audio, ...)
# returns estimates as tensor

Note that this requires the audio to be in the right shape and sampling rate. For convenience we provide a pre-processing in openunmix.utils.preprocess(..)` that takes numpy audio and converts it to be used for open-unmix.

One-liner

To perform model loading, preprocessing and separation in one step, just use:

from openunmix.predict import separate
estimates = separate(audio, ...)

Load user-trained models

When a path instead of a model-name is provided to --model, pre-trained Separator will be loaded from disk. E.g. The following files are assumed to present when loading --model mymodel --targets vocals

mymodel/separator.json
mymodel/vocals.pth
mymodel/vocals.json

Note that the separator usually joins multiple models for each target and performs separation using all models. E.g. if the separator contains vocals and drums models, two output files are generated, unless the --residual option is selected, in which case an additional source will be produced, containing an estimate of all that is not the targets in the mixtures.

Evaluation using `museval`

To perform evaluation in comparison to other SISEC systems, you would need to install the museval package using

pip install museval

and then run the evaluation using

python -m openunmix.evaluate --outdir /path/to/musdb/estimates --evaldir /path/to/museval/results

Results compared to SiSEC 2018 (SDR/Vocals)

Open-Unmix yields state-of-the-art results compared to participants from SiSEC 2018. The performance of UMXHQ and UMX is almost identical since it was evaluated on compressed STEMS.

Note that

[STL1, TAK2, TAK3, TAU1, UHL3, UMXHQ] were omitted as they were not trained on only MUSDB18.
[HEL1, TAK1, UHL1, UHL2] are not open-source.

Scores (Median of frames, Median of tracks)

target	SDR	SDR	SDR
`model`	UMX	UMXHQ	UMXL
vocals	6.32	6.25	7.21
bass	5.23	5.07	6.02
drums	5.73	6.04	7.15
other	4.02	4.28	4.89

Training

Details on the training is provided in a separate document here.

Extensions

Details on how open-unmix can be extended or improved for future research on music separation is described in a separate document here.

Design Choices

we favored simplicity over performance to promote clearness of the code. The rationale is to have open-unmix serve as a baseline for future research while performance still meets current state-of-the-art (See Evaluation). The results are comparable/better to those of UHL1/UHL2 which obtained the best performance over all systems trained on MUSDB18 in the SiSEC 2018 Evaluation campaign. We designed the code to allow researchers to reproduce existing results, quickly develop new architectures and add own user data for training and testing. We favored framework specifics implementations instead of having a monolithic repository with common code for all frameworks.

How to contribute

open-unmix is a community focused project, we therefore encourage the community to submit bug-fixes and requests for technical support through github issues. For more details of how to contribute, please follow our CONTRIBUTING.md. For help and support, please use the gitter chat or the google groups forums.

Authors

Fabian-Robert Stöter, Antoine Liutkus, Inria and LIRMM, Montpellier, France

References

If you use open-unmix for your research – Cite Open-Unmix

@article{stoter19,  
  author={F.-R. St\\"oter and S. Uhlich and A. Liutkus and Y. Mitsufuji},  
  title={Open-Unmix - A Reference Implementation for Music Source Separation},  
  journal={Journal of Open Source Software},  
  year=2019,
  doi = {10.21105/joss.01667},
  url = {https://doi.org/10.21105/joss.01667}
}

If you use the MUSDB dataset for your research - Cite the MUSDB18 Dataset

@misc{MUSDB18,
  author       = {Rafii, Zafar and
                  Liutkus, Antoine and
                  Fabian-Robert St{\"o}ter and
                  Mimilakis, Stylianos Ioannis and
                  Bittner, Rachel},
  title        = {The {MUSDB18} corpus for music separation},
  month        = dec,
  year         = 2017,
  doi          = {10.5281/zenodo.1117372},
  url          = {https://doi.org/10.5281/zenodo.1117372}
}

If compare your results with SiSEC 2018 Participants - Cite the SiSEC 2018 LVA/ICA Paper

@inproceedings{SiSEC18,
  author="St{\"o}ter, Fabian-Robert and Liutkus, Antoine and Ito, Nobutaka",
  title="The 2018 Signal Separation Evaluation Campaign",
  booktitle="Latent Variable Analysis and Signal Separation:
  14th International Conference, LVA/ICA 2018, Surrey, UK",
  year="2018",
  pages="293--305"
}

⚠️ Please note that the official acronym for open-unmix is UMX.

License

MIT

Acknowledgements

open-unmix-pytorch's People

Contributors

Stargazers

Watchers

Forkers

suwoncjh hiyoung-asr agangzz xzm2004260 liyucode templeblock dendisuhubdy vusd aanugraha suwadith loretoparisi pigip bourdalas vichoko faroit vichuda shoegazerstella nsouvira irentang birdgun nindidooo mcspx huguanglong tamwaiban chenchy nkueterman kno3a87 kevinzhangcode diggerdu morris-frank agolynski wannaphong wasabi-anr-project cyberluke russellizadi methevoz jyt1234 leo-pacheco-tal grugor qinxiaoyi mattfjh hwaninhawaii enricguso cpvlordelo macken107 seth814 alex-mocanu themidwestcanapps aliosamahassan georgetz15 boomwang 5l1v3r1 mthrok aneeshathrey ja14000 felipebetancur dankwartrustow shaheenkdr laughbuddha psanna77 tobe2d wl3b10s sun-peach darius522 1uka cprakashagr skratchdot wuhuaha cameronmaske gormonn drewmee bosnyan appleholic fxmarty xiongmaoxia simpleishappy frizzid07 drawfish baldwin-disso keweichen andres-carranza dlesz u7karshs syams86 somic zhaojy1 pc2752 keunwoochoi mutjinde windstudent akiboy96-newid hyoputer ialy1595 vrv18 cxz mtlong mynameisnhan jomarimendoza oliver-tautz nkgevorgyan

open-unmix-pytorch's Issues

Iterative usage lead to memory failure

🐛 Bug

Possible memory leak produces OOM error on a minimal amount of memory allocation while using separate functions in an iterative manner.

To Reproduce

Get yourself with a set of 100 normal sized mp3 files and do this:

Steps to reproduce the behavior:

for filename in files:
    result = separate_music_file(
        file,
        'cpu',
        ['vocals'],
        # etc
    )
    print(result)

Expected behavior

If the mp3 file and the model can fit in memory I hope it finishes without error.
If the mp3 and the model can't fit in memory I expect to fail in a consistent way.

Environment

Please add some information about your environment

PyTorch Version (e.g., 1.2): 1.3
OS (e.g., Linux): Windows 10
torchaudio loader (y/n): n
Python version: 3.7
CUDA/cuDNN version: 10.1
Available memory: 8 GB (6 really discounting SO usage)

Additional context

The error message says it fails to allocate 6,000,000 bytes ~ 6 MB, so it looks like the mp3 file isn't big enough. Also, I tried a "split & retry mechanism", and it doesn't really matter the file size, the program fails after some iterations with any input size.

I think there could be a memory leak.
I'm still testing some changes, for example, adding torch.no_grad and cleaning caches between iterations but no use so far.
I'll keep you updated.

Save pre-trained model if loaded from torch hub

🐛 Bug

If the model hasn't been manually downloaded then the default umxhq model from torch hub gets downloaded every time.

To Reproduce

Run umx on two files one by one
Model downloaded twice

Expected behavior

If downloading then it should be saved so as to not download everytime

torch.hub.set_dir could be set to model_path perhaps to save the model there (haven't tried yet).

A few questions regarding audio separation in general and going by this model

I find the concept here very interesting, because it illustrates quite well the possibilities of today. But the last few days I've been thinking about something that I'd just like to ask because I don't have the background knowledge. I hope that you as experts can give me some information. Here the thoughts are summarized:

Let's assume the following case: I have two signals that I want to separate from each other. In this case it is normal speech and music (no singing, just speech). I could now take a standard model (like this one) and train it on it. So far so good. But unlike the normal "music" separation I have some other problems and challenges. Let's assume that the music is a music bed which serves as a base. This can be talked over by many different people and can be used in many ways. In radio/broadcasting, for example, this is part of everyday life. There is no correlation between voice and music. For these reasons I have more than one version available from the music source (more than 100 or even thousand times), which means that the music bed may have been talked over by many people. But the background music is always exactly the same in all cases.

Now to my question or assumption: Is it possible to teach a neural network to extract only the "similar" or "same" signals that are present in each file? My idea would be that you could simply extract the music bed, because it is present in all recordings. Only the volume is not always the same but the content itself is.

Is this a purely theoretical scenario, or could you build something like this? If so, how much do you think experienced people will spend on this? How would you teach a network to do that?

Sorry it's a little off topic. But I would simply be interested in the opinion here. Is this just a fantasy, or can something like this actually be implemented with the available resources?

Typo in README - input tensor shape of OpenUnmix

Hello,
I believe the true input shape of OpenUnmix (the spectrogram model, not the on-the-fly waveform one) is this, taken from the code:

(nb_samples, nb_channels, nb_bins, nb_frames)

This corresponds to the (I, F, T) that I've seen in the oracle code (I = channels, F = frequency bins, T = time frames).

The README describes the shape in a different order:

models.OpenUnmix: The core open-unmix takes magnitude spectrograms directly (e.g. when pre-computed and loaded from disk). In that case, the input is of shape (nb_frames, nb_samples, nb_channels, nb_bins)

I get an error What should I do?

🐛 Bug

h$ python3 train.py --dataset musdb
Using GPU: True
Using Torchaudio: False
Traceback (most recent call last):
File "train.py", line 294, in
main()
File "train.py", line 158, in main
train_dataset, valid_dataset, args = data.load_datasets(parser, args)
File "/home/scss/DeepLearning/VocalEX/Separation/open-unmix-pytorch/open-unmix-pytorch/data.py", line 226, in load_datasets
**dataset_kwargs
File "/home/scss/DeepLearning/VocalEX/Separation/open-unmix-pytorch/open-unmix-pytorch/data.py", line 751, in init
*args, **kwargs
TypeError: init() got an unexpected keyword argument 'root'

To Reproduce

Steps to reproduce the behavior:

1.　Python3 train.py　--root (Datasets)

Expected behavior

Environment

Please add some information about your environment

PyTorch Version (e.g., 1.0.0):
OS (e.g., Linux):
torchaudio loader (y/n): N
Python version:
CUDA/cuDNN version: 440
Any other relevant information:

If unsure you can paste the output from the pytorch environment collection script
(or fill out the checklist below manually).

You can get that script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Additional context

Cannot use double precision with wiener filtering

🐛 Bug

It seems because of this line: https://github.com/sigsep/open-unmix-pytorch/blob/master/openunmix/filtering.py#L301, where dtype is not provided and fixed, trying to use wiener filtering with double precision will fail.

  File "/Users/defossez/projs/demucs/env/lib/python3.8/site-packages/openunmix/filtering.py", line 472, in wiener
    y = expectation_maximization(y, mix_stft, iterations, eps=eps)[0]
  File "/Users/defossez/projs/demucs/env/lib/python3.8/site-packages/openunmix/filtering.py", line 301, in expectation_maximization
    y[t, ...] = torch.tensor(0.0, device=x.device)
RuntimeError: Index put requires the source and destination dtypes match, got Double for the destination and Float for the source.

To Reproduce

Call wiener filtering function with a magnitude that is float64, and complex spectrogram of the mixture that is complex128.

Expected behavior

Expected call to succeed with high precision inputs.

Environment

PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 11.4 (x86_64)
GCC version: Could not collect
Clang version: 11.0.0
CMake version: version 3.19.1
Libc version: N/A

Python version: 3.8.8 (default, Feb 24 2021, 13:46:16)  [Clang 10.0.0 ] (64-bit runtime)
Python platform: macOS-10.16-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.2
[pip3] torch==1.9.0
[pip3] torchaudio==0.9.0
[pip3] torchvision==0.9.1
[conda] blas                      1.0                         mkl
[conda] mkl                       2019.4                      233
[conda] mkl-service               2.3.0            py38h9ed2024_0
[conda] mkl_fft                   1.2.0            py38hc64f4ea_0
[conda] mkl_random                1.1.1            py38h959d312_0
[conda] numpy                     1.20.2                   pypi_0    pypi
[conda] torch                     1.8.1                    pypi_0    pypi
[conda] torchaudio                0.8.1                    pypi_0    pypi
[conda] torchvision               0.9.1                    pypi_0    pypi

sourcefolder training

Hello, again.
@sigsep:
Sorry to bother you, but I should have another novice mistake on training a "sourcefolder" dataset.
Specifically, I am using DCASE2013_subtask2/singlesounds_stereo that has 320 wav files containing 16 classes of environmental noises (alert, clearthroat, cough, etc, 20 files each).I separated them into different folders according to the noise labels (./DCASE2013 (as root)/train/alert/alert01.wav, alert02.wav), etc.

When I tried the following comand, the error occured.
Command: python train.py --dataset sourcefolder --root ./DCASE2013 --target-dir alert --interferer-dirs clearthroat cough

Error message:
Using GPU: True
100%|█████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 36.16it/s]
100%|██████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 10745.44it/s]
0%| | 0/1000 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 291, in
main()
File "train.py", line 174, in main
scaler_mean, scaler_std = get_statistics(args, train_dataset)
File "train.py", line 66, in get_statistics
x, y = dataset_scaler[ind]
File "/xxx/open-unmix-pytorch/data.py", line 367, in getitem
source_path = random.choice(self.source_tracks[source])
File "/xxx/anaconda3/envs/open-unmix-pytorch-gpu/lib/python3.7/random.py", line 261, in choice
raise IndexError('Cannot choose from an empty sequence') from None
IndexError: Cannot choose from an empty sequence

Am I missing something? Looks like it does not find the training files.

Refactor open-unmix as a package

It seems that there is an interest to use just the pre-trained weights from open-unmix. To improve usability we will make open-unmix a pypi (and possibly conda-forge) package.

output_mean should be zeros?

https://github.com/sigsep/open-unmix-pytorch/blob/master/model.py#L187

I think that unless provided (no output scaling), output_mean should be torch.zeros() and not torch.ones()

Cuda Out of Memory Error on Longer Files

🐛 Bug

Hello,

I am trying to test out the torchfilters branch of this project. It works fine on shorter audio clips, but when the audio file is around 4 to 5 minutes in length, the program crashes with a CudaOutOfMemoryError.

To Reproduce

Steps to reproduce the behavior:

Run test.py on a music file about 4 or 5 minutes in length.

Traceback (most recent call last):
  File "/home/user/unmix/test.py", line 74, in separate
    estimates, model_rate = separator(audio_torch, rate)
  File "/home/user/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/unmix/unmix/filtering.py", line 833, in forward
    for sample in range(nb_samples)], dim=0)
  File "/home/user/unmix/filtering.py", line 833, in <listcomp>
    for sample in range(nb_samples)], dim=0)
  File "/home/user/anaconda3/lib/python3.7/site-packages/torchaudio/functional.py", line 130, in istft
    onesided, signal_sizes=(n_fft,))  # size (channel, n_frames, n_fft)
RuntimeError: CUDA out of memory. Tried to allocate 454.00 MiB (GPU 0; 7.43 GiB total capacity; 6.02 GiB already allocated; 218.94 MiB free; 690.49 MiB cached)

Expected behavior

The program should finish execution on files of longer length as well. Is there a way to split the audio every one or two minutes, or use an audio loader in such a way that the entire song isn't loaded into CUDA memory at once, so that way it doesn't crash?

Thank you!

Environment

Please add some information about your environment

PyTorch Version (e.g., 1.2): 1.2
OS (e.g., Linux): Linux
torchaudio loader (y/n): Y
Python version: 3.7
CUDA/cuDNN version: 10.0/7.6
Any other relevant information:

Additional context

Add umxl as new default

Better performance == more fun ;-)

A little confused while using istft in test.py

Hi sirs,

Sorry to bother.
This is not a bug, but I don't know whom I can ask.

I have a question about using istft() in test.py.
def istft(X, rate=44100, n_fft=4096, n_hopsize=1024): t, audio = scipy.signal.istft( X / (n_fft / 2), rate, nperseg=n_fft, noverlap=n_fft - n_hopsize, boundary=True ) return audio

Why does the input data "X" need to be divided by "(n_fft / 2)" ?
What is the purpose of it?

Thanks for your help.
mstfc

Can't just evaluate one goal

when I run “python openunmix/evaluate.py --root /local/musdb18 --targets vocals --model my_model --residual acc”
appear：
Traceback (most recent call last):
File "openunmix/evaluate.py", line 260, in
results.add_track(scores)
File "/home/bianyuren/anaconda3/envs/umx-gpu-pytorch_1_8/lib/python3.8/site-packages/museval/aggregate.py", line 183, in add_track
self.df = self.df.append(track.df, ignore_index=True)
File "/home/bianyuren/anaconda3/envs/umx-gpu-pytorch_1_8/lib/python3.8/site-packages/museval/aggregate.py", line 113, in df
return json2df(simplejson.loads(self.json), self.track_name)
File "/home/bianyuren/anaconda3/envs/umx-gpu-pytorch_1_8/lib/python3.8/site-packages/museval/aggregate.py", line 413, in json2df
df = pd.melt(
File "/home/bianyuren/anaconda3/envs/umx-gpu-pytorch_1_8/lib/python3.8/site-packages/pandas/core/reshape/melt.py", line 64, in melt
raise KeyError(
KeyError: "The following 'id_vars' are not present in the DataFrame: ['name', 'time']"

training speech of MUSDB18 is very slow

My system: Ubuntu 16.04, one GTX1080Ti, CUDA9, 24 core CPU

When training MUSDB18 using default unmix model, there are 544 batches, the iteration time of one bactch is about 21 seconds, so the training time of total 544 batches is 21*544 sec = 11424 sec = 3.1 hours, which is very slow.

PS: my training script is:
python train.py --root path/to/musdb18 --target vocals

I suspect that there are something wrong in my training process. What about your training time of MUSDB18? Thanks

Confusion about 'vocals SDR'

Dear Sir or Madam,
Hello. Thank you for your sharing firstly.
I run your codes only for separating 'vocals' according to your .md file.I get Aggrated Scores
vocals ==> SDR: 5.415 SIR: 10.950 ISR: 14.831 SAR: 5.533, which is quite different from the ideal result.
Could you please tell me something wrong? What should I do to reproduce your results. By the way, I use the dataset musdb18 --is-wav
Thank you very much.

Convert training model to pytorch-lightning

lightning has matured enough to be used to refactor open-unmix.

STL2 did not use additional training data

The main page says STL2 isn't included in the comparison because it used additional training data.
According to this page, STL2 (multi-instrument Wave-U-Net) didn't use additional training data. Which one is right? I think the confusion arose because STL1 does use additional data (CCMixter).

cc: @f90

UMX-L 1.2 installation?

Is there any installation instructions for the brand new version yet with larger training table?

Thanks, Rog

travis tests fail due to missing mkl library

travis doesn't support mkl out-of-the-box which is why the unit tests currently fail. See issue here. A work around seems to install the intel mkl library through apt.

Random Seeds

What are the random seeds you used for the different targets?

A little bug in notation of definition of OpenUnmix

class Spectrogram(nn.Module):
    def __init__(
        self,
        power=1,
        mono=True
    ):
        super(Spectrogram, self).__init__()
        self.power = power
        self.mono = mono

    def forward(self, stft_f):
        """
        Input: complex STFT
            (nb_samples, nb_bins, nb_frames, 2)
        Output: Power/Mag Spectrogram
            (nb_frames, nb_samples, nb_channels, nb_bins)
        """
        stft_f = stft_f.transpose(2, 3)
        # take the magnitude
        stft_f = stft_f.pow(2).sum(-1).pow(self.power / 2.0)

        # downmix in the mag domain
        if self.mono:
            stft_f = torch.mean(stft_f, 1, keepdim=True)

        # permute output for LSTM convenience
        return stft_f.permute(2, 0, 1, 3)

input shape should be (nb_samples, nb_channels, nb_bins, nbframes, 2)
It will confuse to understand.

Audio tracks availability

Hello,
is the dataset available as audio tracks for other projects? If so, under what terms?

Best regards

[Question] Ideal/oracle performance of source estimate + mix phase

Hello,
I've been interested in running various oracle benchmark methods to check if different types of spectrogram (CQT, etc.) can be useful for source separation.
Initially, I was working with the IRM1/2 and IBM1/2 from https://github.com/sigsep/sigsep-mus-oracle

However I noticed that Open-Unmix uses the strategy of "estimate of source magnitude + phase of original mix" (but it has an option to use soft masking instead). Is it valuable to create an "oracle phase-inversion" method?

So, soft mask/IRM1 "ceiling" of performance (the known IRM1 oracle mask calculation) is like (using vocals stem as an example):

mix = <load mix>                          # mixed track
vocals_gt = <load vocals stem>   # ground truth

vocals_irm1 = abs(stft(vocals_gt)) / abs(stft(mix))

vocals_est = istft(vocals_irm1 * stft(mix)) # estimate after "round trip" through soft mask

Now, for the phase inversion method, we could do the following:

mix = <load mix>                          # mixed track
vocals_gt = <load vocals stem>   # ground truth

mix_phase = phase(stft(mix))
vocals_gt_magnitude = abs(stft(vocals_gt))

vocals_stft = pol2cart(vocals_gt_magnitude, mix_phase)

vocals_est = istft(vocals_stft)  # estimate after "round trip" through phase inversion

Does this make sense to do? Has anybody done this before? What could this method be called?

Input-stage standardization

🐛 Bug

Hi,
Sorry for the bother. I have a basic question about the input-stage standardization.

After the STFT transform the model does:
x += self.input_mean
x *= self.input_scale

and the same, with the opposite order in the output-stage
x *= self.output_scale
x += self.output_mean

I'm wonder about the input stage part, if we want the normalize the spectrogram to be a zero-mean and with STD of one, don't we need to subtract the samples by the mean and dividing by the std ? like this:
x -= self.input_mean
x /= self.input_scale

Any help will be very appreciated.
Thank you!

Add progress bar for cli inference

as proposed in #91

GPU Utilization too low

Hi,

I use Open-Unmix training on my data set (includes MUSDB stem version + other) and it took place on Nvidia RTX2080 cards without SSD and without nb workers.
My GPU utilization as i see it with the command "nvidia-smi" is 2%-11% (Cuda is enabled and got print GPU usage True, though Torchaudio usage is False).
However in you're description about the training process, you mentioned that your GPU utilization got to 90%.

What is the reason for my low GPU Utilization? Is it related to the fact that torchaudio is not used?
Can you please give an approximation upon the expected range of GPU utilization?
Thank you very much.

Not really open, MUSDB18-HQ is not availiable

I've tried numerous times to "Request access" from zenodo.org but they ignore my requests. My guess is you have to be an RIAA member to get access. Is there someplace to actually get this data? I have a good amount of separate track music I would like to augment MUSDB18-HQ with.

About PyTorch Mobile

🚀 Model Improvement

Facebok has just announced PyTorch Mobile for both iOS and Android devices in PyTorch 1.3. They run new quantization algorithms (FBGEMM and QNNPACK state-of-the-art quantized kernel back ends) for this mobile version.

Motivation

Having a quantized model running on the device would be an interesting challenge.
It would be interesting to try the quantization of the model to make it ready to run on the device.
For more info about PyTorch 1.3 here

Objective Evaluation

Hardware requirements for test

🐛 Bug

It seems to be relatively easy to get out of memory for the first example provided in the README on the GPU. Maybe it would be nice to add some hardware requirements or estimation how much memory you need per second of input signal.

To Reproduce

Steps to reproduce the behavior:

>>> python test.py ~/data/musdb18-wav/test/Al\ James\ -\ Schoolboy\ Facination/mixture.wav --model umxhq
Traceback (most recent call last):
  File "test.py", line 301, in <module>
    device=device
  File "test.py", line 166, in separate
    use_softmask=softmask)
  File "/home/audeering.local/hwierstorf/.anaconda3/envs/open-unmix-pytorch-gpu/lib/python3.7/site-packages/norbert/__init__.py", line 260, in wiener
    y = expectation_maximization(y/max_abs, x_scaled, iterations, eps=eps)[0]
  File "/home/audeering.local/hwierstorf/.anaconda3/envs/open-unmix-pytorch-gpu/lib/python3.7/site-packages/norbert/__init__.py", line 141, in expectation_maximization
    eps)
  File "/home/audeering.local/hwierstorf/.anaconda3/envs/open-unmix-pytorch-gpu/lib/python3.7/site-packages/norbert/__init__.py", line 511, in get_local_gaussian_model
    C_j = _covariance(y_j)
  File "/home/audeering.local/hwierstorf/.anaconda3/envs/open-unmix-pytorch-gpu/lib/python3.7/site-packages/norbert/__init__.py", line 468, in _covariance
    y_j.dtype)
MemoryError

Environment

Please add some information about your environment

Any other relevant information: NVIDIA GP107M [GeForce GTX 1050 Mobile]

If unsure you can paste the output from the pytorch environment collection script
(or fill out the checklist below manually).

You can get that script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

PyTorch version: 1.2.0
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce GTX 1050
Nvidia driver version: 430.40
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.13.3
[conda] mkl                       2019.4                      243  
[conda] pytorch                   1.2.0           py3.7_cuda10.0.130_cudnn7.6.2_0    pytorch

git Repo issue

Hi, I have been waiting this for long!
I wold like to report my experiences on training MUSDB18 (Ubuntu 18.04, GTX1080Ti, CUDA10).

"python train.py --root ./musdb18 --target vocals" yielded the following errors.
1: raise InvalidGitRepositoryError(epath)
git.exc.InvalidGitRepositoryError:
My Response: "git init" created bunch of files under .git

2: Then two different error messages: raise ValueError("Reference at %r does not exist" % ref_path)
ValueError: Reference at 'refs/heads/master' does not exist
My Response: As suggested, created .git/refs/heads/master and wrote "ref: refs/heads/master" in the text there
Result: this stopped the error.

3: python train.py --root ./musdb18 finally runs without an error message, but nvidia-smi shows no GPU usage and returns none.

Any suggestions? Thanks!!

Improve vocal-accompaniment separation without wiener filter

🚀 Model Improvement

In the vocal/accompaniment scenario, separating with --niter 0 --residual gets only to 3.9 dB SDR for vocals, whereas with the --niter 1 the scores get up to 6.0.

Motivation

The scores without wiener filtering should only be slighly worse than with.

Does bandwidth extension even exist as a feature?

Your training docs mention that an aligned dataset can be used for Bandwidth Extension (Low Bandwidth -> High Bandwidth) as mentioned here.
Previously I have trained models for Source Separation (Mixture -> Target) and Denoising (Noisy -> Clean) and they're working as intended.
But training for Bandwidth Extension doesn't provide any noticeable enhancements at all, this is the output spectrogram and this is how it's supposed to be.
At first I thought I did a mistake to convert the source from 22050Hz to 44100Hz, so I tried training with 22050Hz and 48000Hz files directly but it would throw this error:

Using GPU: True
Using Torchaudio:  True
16748it [00:21, 795.79it/s]
15it [00:00, 657.46it/s]
Compute dataset statistics: 100%|███████| 16416/16416 [1:46:10<00:00,  2.58it/s]
Training Epoch:   0%|                                  | 0/1000 [00:00<?, ?it/strain.py:31: UserWarning: Using a target size (torch.Size([278, 16, 1, 2049])) that is different to the input size (torch.Size([126, 16, 1, 2049])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  loss = torch.nn.functional.mse_loss(Y_hat, Y)
Training batch:   0%|                                  | 0/1026 [00:01<?, ?it/s]
Training Epoch:   0%|                                  | 0/1000 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 295, in <module>
    main()
  File "train.py", line 244, in main
    train_loss = train(args, unmix, device, train_sampler, optimizer)
  File "train.py", line 31, in train
    loss = torch.nn.functional.mse_loss(Y_hat, Y)
  File "/home/mgt/anaconda3/envs/basepy37/lib/python3.7/site-packages/torch/nn/functional.py", line 2203, in mse_loss
    expanded_input, expanded_target = torch.broadcast_tensors(input, target)
  File "/home/mgt/anaconda3/envs/basepy37/lib/python3.7/site-packages/torch/functional.py", line 52, in broadcast_tensors
    return torch._C._VariableFunctions.broadcast_tensors(tensors)
RuntimeError: The size of tensor a (126) must match the size of tensor b (278) at non-singleton dimension 0

As I suspected it's expecting same size files, but then what am I supposed to do?
Could you please illustrate how to train this for Bandwidth Extension? I had zero trouble with Source Separation and Denoising tasks, so I expected this to work the same way.
This is genuinely driving me crazy, I REALLY need this, any help is much appreciated, thanks!

Sources with different numbers of channels for sourcefolder dataset

🐛 Bug

First of all, I have to acknowledge the authors of open-unmix for this obviously awesome work ;)

My issue is about the sourcefolder dataset, which cannot handle sources with different numbers of channels. Let's assume that we have two folders of sources, the first one contains mono signals, the second one stereo signals. For training, we also set nb_channels to 1. In __getitem__ of SourceFolderDataset, an error is raised when trying to stack the sources, before summing them to create the mixture (line 358 of data.py).

To Reproduce

Steps to reproduce the behavior:

Create two folders of sources, one with stereo signals and the other one with mono signals.
Launch training with nb-channels to 1, below is the command I used:

python train.py --root ./data-sourcefolder --dataset sourcefolder --interferer-dirs noise --target-dir speech --nb-train-samples 20000 --nb-valid-samples 2000 --seq-dur 2.0 --source-augmentations gain --hidden-size 256 --nb-channels 1 --nfft 1024 --nhop 256 --nb-workers 4

We get the following error:

Traceback (most recent call last):
  File "train.py", line 294, in <module>
    main()
  File "train.py", line 177, in main
    scaler_mean, scaler_std = get_statistics(args, train_dataset)
  File "train.py", line 68, in get_statistics
    x, y = dataset_scaler[ind]
  File "/data/recherche/python/speech_enhancement/open-unmix-pytorch/data.py", line 385, in __getitem__
    stems = torch.stack(audio_sources)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 2 and 1 in dimension 1 at /tmp/pip-req-build-58y_cjjl/aten/src/TH/generic/THTensor.cpp:689

Expected behavior

We could expect that because we set nb_channels to 1, stereo source signals would be downmixed so that they could be mixed with the monophonic sources.

Environment

PyTorch Version: 1.2.0
OS: Ubuntu
torchaudio loader: no
Python version: 3.7.3
CUDA/cuDNN version: CUDA 10.1 - cuDNN 7.6.0

Training

Hello,

I want to transfer-learn the default dataset on a dataset of my own (or just train my own dataset from scratch if easier). I have mixture.wav files and the wav files for the individual instruments as well. I want to be able to separate everything in the song. I have some questions, though:

What dataset type should I use for this application?
Does each individual song's wav files need to be the same length, or does every song in the dataset and their wav files need to be the same length? Basically, can different songs be different lengths?

I'm wondering this because I was messing around with train.py and got an error of NotImplementedError: Non-relative patterns are unsupported.

Would I get better single-instrument performance if I used the aligned ("denoising") dataset since it would be just focusing on the target sound and the noise? For example, if I just wanted to separate the bass from a song.
Also is there a Colab notebook that is set up to train?

Sorry if this doesn't make sense I tried to make it as clear as possible.

Improve dataset statistics for sourcefolder dataset

🐛 Dataset Statistics do not work for `sourcefolder` dataset

The get_statistics function was designed to iterate over the complete audio data in an deterministic manner, therefore loading the full audio samples. This doesn't work together with the sourcefolder dataset as it allows to have different length of files as it get short chunks of fixed lengths from each item.

Expected behavior

sourcefolder dataset should work with get_statistics

Proposed solutions

Solution 1

replace dataset_scaler.seq_duration = None with dataset_scaler.seq_duration = args.seq_dur. That would solve the issue but then would only train the dataset statistics on the first n seconds from each sample.

Solution 2

use stochastic sampling and use a dataloader instead of a dataset: e.g.:

def get_statistics(args, dataloader):
    scaler = sklearn.preprocessing.StandardScaler()

    spec = torch.nn.Sequential(
        model.STFT(n_fft=args.nfft, n_hop=args.nhop),
        model.Spectrogram(mono=True)
    )

    pbar = tqdm.tqdm(dataloader, disable=args.quiet)
    for x, y in pbar:
        pbar.set_description("Compute dataset statistics")
        X = spec(x)
        scaler.partial_fit(np.squeeze(X))

    std = np.maximum(
        scaler.scale_,
        1e-4*np.max(scaler.scale_)
    )
    return scaler.mean_, std

stats_sampler = torch.utils.data.DataLoader(
    train_dataset, batch_size=1,
    sampler=sampler, **dataloader_kwargs
)

the second option would get better distributed samples and users can maybe specify an argument that selects the number of samples randomly drawn to train the dataset statistics

Add missing arguments to docs and inference

--outdir is not mentioned in the inference docs
--samplerate is not mentioned in the inference docs
--nb_channels should also be an argument for inference

Add simple way to fine-tune pretrained models

The current training code does provide a way to fine-tune models given a checkpoint file. However:

we do not provide the checkpoints on zenodo (that include the optimizer states)
there is commandline interface option to load umx or umxhq pretrained models for training

RuntimeError: Backend "sox_io" is not one of available backends: ['soundfile'].

🐛 Bug

I am trying to run umx in Windows 10 64 + Anaconda 3.
The installation ("pip install openunmix") seemed to pass without any problem but "umx anyfile.wav" failed:

(base) C:\Users\Vita\audio-separation\open-unmix-2021>umx nakonci.wav
Traceback (most recent call last):
  File "c:\users\vita\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\vita\anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Vita\Anaconda3\Scripts\umx.exe\__main__.py", line 9, in <module>
  File "c:\users\vita\anaconda3\lib\site-packages\openunmix\cli.py", line 118, in separate
    torchaudio.set_audio_backend(args.audio_backend)
  File "c:\users\vita\anaconda3\lib\site-packages\torchaudio\backend\utils.py", line 44, in set_audio_backend
    f'Backend "{backend}" is not one of '
RuntimeError: Backend "sox_io" is not one of available backends: ['soundfile'].

The strange thing was that I got the same error message with full path to the file and even with a non-existing file so it seemed umx could not even load the input file.

Then I noticed this:
Note that we support all files that can be read by torchaudio, depending on the set backend (either soundfile (libsndfile) or sox).
Adding "--audio-backend sox_io" resulted in the same error message, but "--audio-backend soundfile" finally made it work.
Maybe the default setting should change...?

Obtaining weights for streaming implementation

I'm interested in implementing a real-time, streaming version of the separation method.

Do you have any advice on how to extract the model weights for this?

Would it be best to retrain, and save the weights during training?

The detail procedure to reproduce the evaluation results of UMX pre-trained model

Hi Sirs,

I'm new to UMX. I tried to reproduce the fantastic results that you made on the website.
I use only umx vocals-c8df74a5.pth to do evaluation (eval.py) with MUSDB18 testset (50 songs)
Here is my result :
UMX1 accompaniment ISR 18.950225
SAR 12.290675
SDR 11.881972
SIR 20.425005
vocals ISR 14.368638
SAR 5.715235
SDR 5.567850
SIR 12.480217
The SDR of vocals is 5.567850 and much worse than your result of 6.32
May I know how to reproduce your result?
What are the musdb/museval version you use?
I use musdb 0.3.1, museval 0.3.0.

I also plotted the boxplot and it just a little bit better than Wave-U-Net 44KHz pre-trained model.
I'm wondering that what did I do wrong?
Hope to receive your response.
Thanks in advance.

mstfc

Fix augmentation with multiple workers

we made a mistake like a lot of other repositories in the data augmention engine
https://tanelp.github.io/posts/a-bug-that-plagues-thousands-of-open-source-ml-projects/

To Reproduce

set nb_workers to a value higher than 1 to have identical seeds in each worker.

potential fix

Fix is given in the blog post. And also here: pytorch/pytorch#5059 (comment)

Possibly also happens in asteroid musdb18 dataset code

Availability on Android

Hi, I just wanted to know if it's possible to use Open-Unmix on android via Pytorch, I know there is usage of Pytorch on Android for image processing but I haven't found any examples to help me use Open-Unmix on android.
.

Docker command example not working

🐛 Bug

The docker command listed on the github homepage fails with RuntimeError: Error loading audio file: failed to open file umx

To Reproduce

Steps to reproduce the behavior:

T>docker run -v ~/Music/:/data -it faroit/open-unmix-pytorch umx "/data/track1.wav" --outdir /data/track1
Using cpu
Downloading: "https://zenodo.org/api/files/1c8f83c5-33a5-4f59-b109-721fdd234875/vocals-b62c91ce.pth" to /root/.cache/torch/hub/checkpoints/vocals-b62c91ce.pth
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34.0M/34.0M [00:04<00:00, 8.69MB/s]
Downloading: "https://zenodo.org/api/files/1c8f83c5-33a5-4f59-b109-721fdd234875/drums-9619578f.pth" to /root/.cache/torch/hub/checkpoints/drums-9619578f.pth
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34.0M/34.0M [00:04<00:00, 8.30MB/s]
Downloading: "https://zenodo.org/api/files/1c8f83c5-33a5-4f59-b109-721fdd234875/bass-8d85a5bd.pth" to /root/.cache/torch/hub/checkpoints/bass-8d85a5bd.pth
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34.0M/34.0M [00:05<00:00, 6.65MB/s]
Downloading: "https://zenodo.org/api/files/1c8f83c5-33a5-4f59-b109-721fdd234875/other-b52fbbf7.pth" to /root/.cache/torch/hub/checkpoints/other-b52fbbf7.pth
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34.0M/34.0M [00:04<00:00, 8.02MB/s]
formats: can't open input file `umx': No such file or directory
Traceback (most recent call last):
File "/opt/conda/bin/umx", line 8, in
sys.exit(separate())
File "/opt/conda/lib/python3.8/site-packages/openunmix/cli.py", line 160, in separate
audio, rate = data.load_audio(input_file, start=args.start, dur=args.duration)
File "/opt/conda/lib/python3.8/site-packages/openunmix/data.py", line 58, in load_audio
sig, rate = torchaudio.load(path)
File "/opt/conda/lib/python3.8/site-packages/torchaudio/backend/sox_io_backend.py", line 152, in load
return torch.ops.torchaudio.sox_io_load_audio_file(
RuntimeError: Error loading audio file: failed to open file umx

Expected behavior

I can see it complaining about not loading an audio file, at least while I figure out how the syntax applies to WINDOWS and docker, but it seems to be complaining about loading umx so perhaps a typo in the docker command?

Environment

Docker on Windows 10.

Please add some information about your environment

This stuff shouldn't be relevant for docker, yeah?

PyTorch Version (e.g., 1.2):
OS (e.g., Linux):
torchaudio loader (y/n):
Python version:
CUDA/cuDNN version:
Any other relevant information:

If unsure you can paste the output from the pytorch environment collection script
(or fill out the checklist below manually).

You can get that script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Additional context

Train single channel model using left or right channel

Currently using train.py --nb-channels 1 will apply a downmix in the spectral domain inside the model to feed in only single channel audio.

However, I can think about applications where we do not have access to a wiener filter and therefore apply the model to each channel individually. In that case the performance might be better when the model was trained on just the left or the right channel. This can be fixed since we use channel swap augmentation.

[Question] About mp3 input files

I have MP3 files at 128 kb/s like

  Metadata:
    encoder         : Lavf58.20.100
  Duration: 00:02:22.11, start: 0.025057, bitrate: 128 kb/s
    Stream #0:0: Audio: mp3, 44100 Hz, stereo, fltp, 128 kb/s
    Metadata:
      encoder         : Lavc58.35

and I therefore converto to wav so that I get the a 22050 Hz file as for the dataset specification:

ffmpeg -i file.mp3 -acodec pcm_s16le -ar 22050 file.wav
Metadata:
    encoder         : Lavf58.12.100
  Duration: 00:02:22.06, bitrate: 705 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 22050 Hz, 2 channels, s16, 705 kb/s

The separation works, my guess is if this is the best approach from the given mp3 sample rate and bit rate.
Thank you.

num_samples should be a positive integer value, but got num_samples=0

I'm trying to use my own data for training with the FixedSourcesTrackFolderDataset. Unfortunatly, the data doesn't seem to be recognized or found by the dataloader. I am using normal wav files (not stems), organized in the following folder structure:

   dataset
         valid
              0
                guitar.wav
                piano.wav
                ...
              1
              2
              ...
         train
              3
                guitar.wav
                piano.wav
                ...
              4
              5
              ...

Next, I issue:
python train.py --root /path/to/dataset --dataset trackfolder_fix --target-file piano.wav --interferer-files cello.wav guitar.wav hi-hat.wav

The following error is thrown:
ValueError: num_samples should be a positive integer value, but got num_samples=0

Is this a problem with the format (folder structure) in which the data is provided? From reading the documentation I can't figure out if a Pytorch dataclass has to be created beforehand or not. If so, how does that fit into the folder structure?

Using umx programmatically instead of via cli.

I'm actually going to use it in another script but there's some pre-processing before the separate function gets called in test.py (the part after if __name__ == '__main__').

I wrapped it up in a whole function and was wondering if that's a good approach?

Like,

def main(input_files, samplerate, niter, alpha, softmask, residual_model, model,
         targets=('vocals', 'drums', 'bass', 'other'), outdir=None, no_cuda=False):

and then at the end call it by

main(args.input, args.samplerate, args.niter, args.alpha, args.softmask, args.residual_model, args.model, args.targets, args.outdir, args.no_cuda)

This doesn't change the cli functionality but allows me to import the main function for external use.

README News links 404

Hey OpenUnmixers!

I'm excited about all of the great work you've been doing! Congrats on the latest releases! :D

I just wanted to point out that two links under the News Section of your README are 404'ing:

The "Release Notes" link in the 14/02/2021 update goes to https://github.com/sigsep/open-unmix-pytorch/blob/master, which gives me a github 404.
The link to the Speech Enhancement model by Sony links to https://sigsep.github.io/open-unmix/se, which 404s on the sigsep page.

Thanks!
Ethan

Set default augmentations

I'm trying to train open-unmix from scratch. The validation losses after early stopping patience are not as good as what's shown in training.md: https://github.com/sigsep/open-unmix-pytorch/blob/master/docs/training.md

I'm using the exact open-unmix-pytorch codebase with no modifications. My training script is:

for target in drums vocals other bass;
do
        python scripts/train.py \
                --root=~/MUSDB18-HQ/ --is-wav --nb-workers=4 --batch-size=16 --epochs=1000 \
                --target="$target" \
                --outpu="umx-baseline"
done

So far, drums and vocals have trained to the following lowest validation loss:
Drums: 0.93 (compared to 0.7 of the claimed training.md)
Vocals: 1.1 (compared to 0.992 of the claimed training.md)

These aren't huge differences, but I'm wondering if there's any explanation. Is it the random seed which allowed your drum model to as far down as 0.7?

Cant load dataset

🐛 Bug

When I run train.py with a custom dataset, the dataset doesnt load and I get the error: "IndexError: Cannot choose from an empty sequence". When I print the length of the dataset I get a non zero value.

This is the command I use to run the train.py script:
"! python train.py --dataset sourcefolder --root /content/data --target-dir gt--interferer-dirs interfer --ext .wav --nb-train-samples 1000 --nb-valid-samples 100"

I am running the code in Google Colab.

sigsep / open-unmix-pytorch Goto Github PK

open-unmix-pytorch's Introduction

Open-Unmix for PyTorch

⭐️ News

🧠 The Model (for one source)

Input Stage

Dimensionality reduction

Bidirectional-LSTM

Output Stage

🤹‍♀️ Putting source models together: the Separator

🏁 Getting started

Installation

Using Docker

Pre-trained models

Using pre-trained models from within python

When the package is installed

torch.hub

Performing separation

One-liner

Load user-trained models

Evaluation using museval

Results compared to SiSEC 2018 (SDR/Vocals)

Scores (Median of frames, Median of tracks)

Training

Extensions

Design Choices

How to contribute

Authors

References

License

Acknowledgements

open-unmix-pytorch's People

Contributors

Stargazers

Watchers

Forkers

open-unmix-pytorch's Issues

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

🐛 Bug

To Reproduce

Expected behavior

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

🐛 Bug

To Reproduce

Expected behavior

Environment

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

🐛 Bug

🚀 Model Improvement

Motivation

Objective Evaluation

🐛 Bug

To Reproduce

Environment

🚀 Model Improvement

Motivation

🐛 Bug

To Reproduce

Expected behavior

Environment

🐛 Dataset Statistics do not work for sourcefolder dataset

Expected behavior

Proposed solutions

Solution 1

Solution 2

🐛 Bug

To Reproduce

potential fix

🤹‍♀️ Putting source models together: the `Separator`

Evaluation using `museval`

🐛 Dataset Statistics do not work for `sourcefolder` dataset