albertaparicio / tfg-voice-conversion Goto Github PK

Deep Learning-based Voice Conversion system

License: GNU General Public License v3.0

Shell 14.59% Python 85.41%

deep-learning deep-neural-networks python voice-conversion speech-processing speech gplv3 numpy keras tensorflow

tfg-voice-conversion's Introduction

Voice Conversion using Deep Learning

This project will be carried out at the Signal Theory and Communications Department (TSC) of the Polytechnic University of Catalonia (UPC). Specifically, it will be developed at the Speech Processing investigation group (VEU) as a contribution to its research project DeepVoice: Deep Learning Technologies for Speech and Audio Processing.

The purpose of this project is to develop a deep learning-based system able to convert a voice signal from a speaker into another that sounds as if it were uttered by a different one. The result signal must keep the linguistic and prosodic elements of the original signal unmodified.

Deep Learning techniques have shown remarkable results in other areas of speech processing, such as voice recognition and voice synthesis. These techniques are often combined with other, more classic, techniques of voice processing and modeling, such as feature extractions from a vocoder. These techniques are used for pre and post-processing purposes.

Before this system can be developed, there are some previous tasks that must be accomplished. Mainly, these tasks comprise acquiring a thorough knowledge of Neural Networks and how to apply them in Deep Learning, as well as getting familiarized with the tools that will be used in the project. These tools include several Python libraries, such as NumPy, TensorFlow, Theano and Keras.

Regarding the programming tools and libraries, some preparation work has already been done beforehand during summer 2016, working with Python, NumPy and TensorFlow.

The project’s main goals are:

Develop a Deep Learning-based system able to convert recorded speech from a speaker into that of another speaker
1. Profound understanding of Deep Learning architectures
2. Solid knowledge in the use of the Keras Deep Learning Python library
3. Propose an innovative architecture following the state of the art in Deep Learning for Voice Conversion
4. Evaluate the developed system’s conversion so it performs better than those submitted to Interspeech 2016 Voice Conversion Challenge

tfg-voice-conversion's People

Contributors

Stargazers

Watchers

Forkers

kar7hik kastnerkyle suanfeng fresty vicentwei codeashu jonaaathan uncledickhe mathias3 highjinx411 guozanhua reubenkhanna shubhampachori12110095 sszllx wenjunjiang runngezhang reloadbrain emezac guanlongzhao kikokushijo clever-scientist poonono qianqq maggie0830 chcbin kevinyang007 redfiue shubhamrock428 heimu11 entn-at pablopersa lukelluke sakura-zhanghao madipraise trellixvulnteam mahdeslami11 tgs1003 puthiphorn

tfg-voice-conversion's Issues

Missing interpolate.py

Hi, I'm trying to extract features from my own dataset but I find that you use interpolate.py during alignment, which isn't present in the repository or elsewhere. Could you please add the file and/or describe what it does?

A question about "Zaska" and "dtw -b", how could I get more feature by running "compute_dtw.sh"?

I tried the solution provided by lf0_lstm.py and so. When I tried to modify the parameters in tranning, a script in /data/training/compute_dtw.sh made me confused.

` ZASKA="Zaska -P $PRM_NAME $PRM_OPT"

# Compute mfcc $DIR_REF/${FILENAME}.wav $DIR_TST/${FILENAME}.wav => mfcc/$DIR_REF/${FILENAME}.prm mfcc/$DIR_TST/${FILENAME}.prm
${ZASKA} -t RAW -x wav=msw -n . -p mfcc -F ${DIR_REF}/${FILENAME}_sil ${DIR_TST}/${FILENAME}_sil

# Align: mfcc/${DIR_REF}/${FILENAME}.prm, mfcc/${DIR_TST/${FILENAME}.prm => dtw/${DIR_REF}-${DIR_TST}/${FILENAME}.dtw
b=2
dtw -b -${b} -t mfcc/${DIR_TST} -r mfcc/${DIR_REF} -a ${DIR_DTW}/beam${b} -w -B -f -F ${FILENAME}_sil`

Running the script is diffcult, as the command "Zaska" is not exist in any package I found and the "dtw" command doesnt have the parameter of "-b" . How could I sovle it?

By the way, I want to run this script because I wanted to add more parameters on training, I modified "tfglib" and tried to build the /data/train_datatable.h5 again.

It resulting in very few harmonic elements, may need to use high-order of feature extraction and adjust the network to fit more high-order implied feature.(It seems the default training also resulting over-fitting.)
In addition, the result of converted result sounds dull and low, lacks a sense of penetration, may due to the lack of high-order features harmonic elements.

May I pull an user guide to this repo?

MR albert, I love your work so much.
but it doesn't have a friendly user guide to use.
and....it may have some bugs in these codes?
these day I am trying to run these codes but I spent some time on how to use it. May I share my note and could you please spent a little time checking on it so that we could build it more friendly?
Here is my note, it seems someting wrong....（On step 6 when i using seq2seq） butI dont know where is the problem.

This project contains 3 solutions for voice conversion.
A MCEP-GMM based solution based on SPTK tools（Bsaeline solution of VCC2016: http://vc-challenge.org/summary.html）
A DNN-LSTM-GRU converting MVF-logf0-Mel-cepstrum
A seq2seq based MVF-logf0-Mel-cepstrum feature extraction conveting solution.

To run MCEP-GMM based solution based on SPTK tools:
edit sptk_vc.sh TRAIN_FILENAME to any files you need to convert
run sptk_vc.sh
the output wav is in data/training/gmm_vc

To run DNN-LSTM-GRU converting:
run lf0_lstm.py mvf_dnn.py mcp_gru.py to train different models to convert different features
[optional]run lf0_post_training.py mcp_post_training.py mvf_post_training.py mvf_plot_curves.py to verify model
run decode_aho.sh to merge the feature to wav

To run seq2seq:
0.[optional]you can get all the training file in ,put them in data/training/
1.apt-get intsall sox ,pip install tensorflow and so on
2.cp do_columns.pl to /usr/local/bin
3.get tfglib (https://github.com/albertaparicio/tfglib),edit seq2seq_datatable.py(maybe bug in para: nb_classes) ,and install
4.source install ahocoder, and add the file $ into your path
5.edit data/test/speakers.list (add more speakers if step 0 was procceed?)
6.run /data/train/seq2seq_align_training.sh and /data/test/seq2seq_align_test.sh 
7.run seq2seq.py
(There are some questions....that some file(like file 200007)wasn't extract .lf0.dat file and may throw an error)
8.run seq2seq_decode_prediction.py

x2x error on decodew_aho.sh

When I try to run decode_aho.sh, I receive this error. Is it an error of the data format?

Convert parameters to float data
usage: x2x [-f base] [-t base] [-l | -u] value
x2x: error: unrecognized arguments: data/test/predicted/SF1-TF1/200005.lf0.dat

Thank you

Does this work with fixed target speaker without parallel data?

What is or where is ”interpolate.py“？

Thank you for your great work.
And，
I run data/training/seq2seq_align_training.sh . And got an error showing that it needs a interpolate.py which is not in my system. Could u show me how to get the interpolate.py？

Here is the code in data/training/seq2seq_align_training.sh：
# Interpolate lfo and vf data
python $(which interpolate.py)
--f0_file ${DIR_VOC}/${DIR_SPK}/${FILENAME}.lf0.dat
--vf_file ${DIR_VOC}/${DIR_SPK}/${FILENAME}.vf.dat
--no-uv

run with error of：
Processing SM2/200040
Unknown option: --
usage: python [option] ... [-c cmd | -m mod | file | -] [arg] ...
Try `python -h' for more information.

I search for interpolate.py and found scipy has such a file ,but without parameter of f0_file vf_file and no-uv_
how to get the code work?THX

I want to which function is for the MCEPs

about prepare data for mcp_gru.py

Hi,
According to mcp_gru.py, the input batch has the size of (batch_size, tsteps, data_dim) with batch_size = 1 and tsteps=50. So I guess after DTW, we have to chop the source and targets files into portions with the size of (tsteps, data_dim). Is it right?
Are those portions (in the same file) overlap?
Thanks.

Cat't download successfully

Dear Mr albert ,I am a student from China .
I am interested in your work very much. But I can't download the file you shared for some reason said "the link is failure". So Can you share a new link ? I wish you can offer a friendly user guide to use too.
Thank you very much !
It also would be great that if anyone who have download the files successfully before can share them to me.

The configuration of the mcep-gmm model

Hello,

I'm sorry to interrupt you, I'm interested in your mcep-gmm method to do the voice conversion and I have executed your program with my computer, but I find when I established a 10 Gaussian components model and a 50 Gaussian components model respectively, the results of converting wav file 20007.wav keep always unchangeable, so I want to ask how many Gaussian components and iteration you have used ?
Also, the number of Gaussian component in the command vc and the number of component in the command gmm should they keep the same number ?

Thank you for your attention
J.SHI