sannawag / data_driven_pitch_corrector Goto Github PK

Python 100.00%

data_driven_pitch_corrector's Introduction

Description

This program computes automatic pitch correction for vocal performances. It outputs note-wise constant pitch shift values up to 100 cents, equivalent to one semitone. It it can also apply the shifts to the audio.

The program is trained on examples of in-tune singing and applies corrections along a continuous frequency scale.

A pre-trained model is available.

Note: this repository used to be named "autotuner", but has been renamed to "data-driven pitch corrector" to avoid confusion with the proprietary term "Antares Auto-Tune".

Usage

Requirements

Requirements are listed in requirements.txt and can be installed using pip install -r requirements.txt. Data pre-processing requires a program that computes probabilistic YIN (pYIN) pitch analysis. One option is Sonic Annotator.

Running the program

To run the program on an example included in the repo using a pre-trained model, run:

python rnn.py --extension "3" --resume True --run_training False --run_autotune True

This will output results to directory results_root_3/realworld_audio_output, which will contain the original performance (test_mix.wav), and the output of the program (corrected_mix.wav). Note that the backing track is not publicly available, so the backing track wav file is silent (a 10-hz sine wave). For this particular example, the program instead loads the pre-computed constant-Q transform (CQT). Make sure to first download and uncompress the CQT zip file available at http://homes.sice.indiana.edu/scwager/images/survive_4_back_cqt.npy.zip and place it in ./Intonation/realworld_data/raw_audio/backing_tracks_wav.

More generally, the program can be run either for training, testing, or automatic pitch correction by setting the boolean args --run_training, --run_training, and --run_autotune. In the case of training and testing, the dataloader takes as input in-tune singing, detunes the notes of the vocal track, and learns to predict the de-tuning amount. In the case of automatic pitch correction, the program takes as input real-world performances, predicts corrections for them, and synthesizes the output.

Multiple other settings and parameters are available in the argument parser in rnn.py.

Define input and output directories along with data format settings in globals.py

A pre-trained model is available in the pytorch_checkpoints_and_models directory. The README in the directory provides more details about it.

The program can be run on CPU but runs faster on GPU.

Data pre-processing

The program requires frame- and note-wise pYIN pitch analyses. Please check directory ./Intonation/realworld_data/pyin for examples of these. The outputs of the Sonic Visualizer are converted from seconds to frame indices.

Dataset

The Intonation directory needs to contain wav files of the vocals and backing tracks. The data format should be defined in globals.py. intonation.csv should contain a list of the vocal files and corresponding backing track file names (removing the .wav extension). Set variables boundary_id_val and boundary_id_test in rnn.py to determine the split between training, testing, and validation data.

The program was trained using the Intonation dataset. More information on the full dataset used for the paper and how to access it via the Stanford DAMP can be found here. Note that the dimension of the backing track CQT in the dataset is different from the one the current program is set to.

Any other data can be used instead.

References

If you use this code, please refer to the following:

S. Wager, G. Tzanetakis, C. Wang, and M. Kim, "Deep Autotuner: A pitch correcting network for singing performances," in IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Submitted for publication.

The reference to the dataset creation paper is:

S. Wager, G. Tzanetakis, C. Wang, S. Sullivan, J. Shimmin, M. Kim, and P. Cook, "Intonation: A dataset of quality vocal performances refined by spectral clustering on pitch congruence," in IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), Submitted for publication.

More information on pitch tracking can be found at:

M. Mauch and S. Dixon, “pYIN: A Fundamental Frequency Estimator Using Probabilistic Threshold Distributions,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2014), 2014.

data_driven_pitch_corrector's People

Contributors

Stargazers

Watchers

data_driven_pitch_corrector's Issues

How to preprocess the wav files?

Great work!! Would you please offer the code about how to generate such .npy files from my own .wav files?

        performance_list = sorted(list(
            set([f[:-4] for f in os.listdir(pyin_directory) if "npy"]) &
            set([f[:-4] for f in os.listdir(midi_directory) if "npy"]) &
            set([f[:-4] for f in os.listdir(back_chroma_directory) if "npy"])))

IndexError: index 1174 is out of bounds for axis 0 with size 1150

File "rnn.py", line 182, in getitem_realworld_cqt
original_boundaries = np.array([original_boundaries[notes[:, 0]], original_boundaries[notes[:, 1]]]).T
IndexError: index 1174 is out of bounds for axis 0 with size 1150

Question about pYin

Hi. I extracted the pYin of survive_4_vocals.wav by using Sonic Visualizer (transform->pYin->notes). The resulting csv file is like this:

However, I only got 136 rows, while in your survive_4_vocals.csv, there are 138 rows.

And, when I tried to auto-tune the model by python rnn.py --use_combination_channel True --extension "3" --resume True --run_training False --run_autotune True, the following error occurred:

exception in dataset index 1943 is out of bounds for axis 0 with size 1813 Traceback (most recent call last): File "rnn.py", line 336, in __getitem__ data_dict = self.getitem_realworld_cqt(idx) File "rnn.py", line 182, in getitem_realworld_cqt original_boundaries = np.array([original_boundaries[notes[:, 0]], original_boundaries[notes[:, 1]]]).T IndexError: index 1943 is out of bounds for axis 0 with size 1813 skipping song survive_4_vocals

Could you tell me how do you extract pYin from wav files (e.g. parameters in Sonic Annotator), and how to avoid above error?
Thank you very much.

Please provide more detailed instructions

Hello! Could you please provide more detailed instructions in order to test your solution?

I have questions about how to prepare material for autotune, such as: how to export an audio file from sonic visualizer and how to make a .npy file

The community of young developers would be very grateful to you for that!

About pYIN

Hi @sannawag ,
Could you share the parameters in Sonic Annotator? As I can not generate the same survive_4_vocals.csv from survive_4_vocals.wav. Thanks in advance. The followings is my parameters:

sonic-annotator -s vamp:pyin:pyin:smoothedpitchtrack > test.n3
sonic-annotator -t test.n3 survive_4_vocals.wav -w csv --csv-one-file survive_4_vocals.csv

numpy and type errors

I believe the code should be updated due to some changes in numpy and handling the types. In running the first example, I got this set of errors:

loading example backing track CQT
exception in save_outputs hasattr(): attribute name must be string: Traceback (most recent call last):
  File "/home/keto/autotuner/utils.py", line 364, in save_outputs
    bplt.save(bplt.gridplot([s1], [s2], [s3], [s4]))
  File "/home/keto/.local/lib/python3.8/site-packages/bokeh/layouts.py", line 261, in gridplot
    if not hasattr(Location, toolbar_location):
TypeError: hasattr(): attribute name must be string
 skipping song survive_4_vocals
using silent backing track
Traceback (most recent call last):
  File "/home/keto/.local/lib/python3.8/site-packages/numpy/core/function_base.py", line 117, in linspace
    num = operator.index(num)
TypeError: 'numpy.float64' object cannot be interpreted as an integer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "rnn.py", line 804, in <module>
    program.autotune_iters(dataloader=program.realworld_dataset)
  File "rnn.py", line 628, in autotune_iters
    utils.synthesize_result(self.realworld_audio_output_directory, data_dict['perf_id'], data_dict['arr_id'],
  File "/home/keto/autotuner/utils.py", line 447, in synthesize_result
    temp_shifted = psola_shift_pitch(
  File "/home/keto/autotuner/psola.py", line 40, in psola_shift_pitch
    new_signal_list.append(psola(signal, peaks, f_ratio))
  File "/home/keto/autotuner/psola.py", line 109, in psola
    new_peaks_ref = np.linspace(0, len(peaks) - 1, len(peaks) * f_ratio)
  File "<__array_function__ internals>", line 5, in linspace
  File "/home/keto/.local/lib/python3.8/site-packages/numpy/core/function_base.py", line 119, in linspace
    raise TypeError(
TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.

How to tune my own sound track

I am trying to tune my own sound track using this model but could not get to it.

After pre-process two sound track: vocal and background music, I got two .npy files.
the vocal pyin f0 candidate files in place in vocals_pitch_pyin folder, and the background music pyin f0 candidate files in place in back_chroma folder

And I modify the code in rnn.py and try to tune the sound track with the following code:

performance_list = sorted(list(set([f[:-4] for f in os.listdir(pyin_directory) if "npy" in f]) & set([f[:-4] for f in os.listdir(back_chroma_directory) if "npy" in f])))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dataset = get_dataset(performance_list, args.num_shifts, 1, 'testing', device, False)
outputs, loss = program.eval(dataset, save_song_outputs=True, plot=False)

However, in the get_dataset I got an Type error, details is following
TypeError: list indices must be integers or slices, not str

I would like to know how to get the program run properly to tune my own sound track, Thank you

Collaborate to make this project more available?

Hi! I have been looking for a project like this for a long time. However I struggle to make it work, as the descriptions are hard to follow and code is not generalized yet to make it work for arbitrary sound input.

I am a frontend ux developer and would like to help make a user friendly web page where users could input their own recordings etc and this python code would process and tune the audio. But would need you or someone to work on the python code to make it work.

Got time?

No License

There is no license for this source code.
I recommend MIT or Apache license, they are the most popular licenses for AI including research.
You can just add new file named "LICENSE" and choose license from template in GitHub.

If you would like to ask about licensing issues, conditions, I will be happy to answer these questions, with the reservation that I am not a lawyer and do not provide legal advice.

Failed to download CQT zip file

I found that the url you provided for the CQT zip file could not be reached.

How to use model correctly?

I input different group of vocal, back_track and pYIN files to try. They can all be successfully divided into notes and do the shifting. However, after evaluation and output generation, using the synthesis.py to listen to the shifted and corrected version of the song, it seems not doing much tuning. The graph of like ground truth vs shift prediction, the prediction shift (blue line) is mainly a horizontal line with little amplitude shift which is quite different to the ground truth(red line). It seems like I cannot utilize the model in a correct way. Is there any steps missing to achieve the tuning and test successfully below:

generate pYIN and change it to npy and input
input vocal wav
input backtrack wav
change intoncation.csv
program args, testing instead of training be true

Thanks for your attention.

Question about shift_to_autotuned.py

This file takes in two arguments: pyin and pyin_notes. Can you describe the schema of such? What exactly are these variables, and how can I generate these files for a custom WAV file?

Please provide versions in requirements.txt

Hello,
I am having trouble with some modules like librosa or numpy because of the versioning issues linked below:
https://stackoverflow.com/questions/63997969/attributeerror-module-librosa-has-no-attribute-output
https://stackoverflow.com/questions/20180543/how-to-check-version-of-python-modules

Would you please provide some versions of related modules in requirements.txt?
Thank you

the dataset is not available

New audio sample

This is interesting, I tried to build it and done it with some hassle (using import soundfile, some type conversions), then injected a new file into training_data\raw_audio\vocal_tracks as survive_4_vocals.wav, then

python rnn.py --extension "3" --run_training True --run_autotune False

then

python rnn.py --extension "3" --run_training False --run_autotune True