This is an END-To-END system for speech recognition based on CTC implemented with pytorch.
At present, the system only supports phoneme recognition.
You can also do it at word-level, but you may get a high error rate.
Another way is to decode with a lexcion and word-level language model using WFST which is not included in this system.
English Corpus: Timit
- Training set: 3696 sentences(exclude SA utterance)
- Dev set: 400 sentences
- Test set: 192 sentences
Chinese Corpus: 863 Corpus
- Training set:
Speaker | UtterId | Utterances |
---|---|---|
M50, F50 | A1-A521, AW1-AW129 | 650 sentences |
M54, F54 | B522-B1040,BW130-BW259 | 649 sentences |
M60, F60 | C1041-C1560 CW260-CW388 | 649 sentences |
M64, F64 | D1-D625 | 625 sentences |
All | 5146 sentences |
- Test set:
Speaker | UtterId | Utterances |
---|---|---|
M51, F51 | A1-A100 | 100 sentences |
M55, F55 | B522-B521 | 100 sentences |
M61, F61 | C1041-C1140 | 100 sentences |
M63, F63 | D1-D100 | 100 sentences |
All | 800 sentences |
- Install Pytorch
- Install warp-ctc and bind it to pytorch.
Notice: If use python2, reinstall the pytorch with source code instead of pip. - Install pytorch audio:
sudo apt-get install sox libsox-dev libsox-fmt-all
git clone https://github.com/pytorch/audio.git
cd audio
pip install cffi
python setup.py install
- Install Kaldi. We use kaldi to extract mfcc and fbank.
- Install KenLM. Training n-gram Languange Model if needed.
- Install other python packages
pip install -r requirements.txt
- Start visdom
python -m visdom.server
- Install all the things according to the Install part.
- Open the top script run.sh and alter the directory of data and config file.
- Change the $feats if you want to use fbank or mfcc and revise conf file under the directory conf.
- Open the config file to revise the super-parameters about everything
- Run the top script with four conditions
bash run.sh data_prepare + AM training + LM training + testing
bash run.sh 1 AM training + LM training + testing
bash run.sh 2 LM training + testing
bash run.sh 3 testing
LM training are not implemented yet. They are added to the todo-list.
So only when you prepare the data, run.sh will work.
- Extract 39dim mfcc and 40dim fbank feature from kaldi.
- Use compute-cmvn-stats and apply-cmvn with training data to get the global mean and variance and normalize the feature.
- Rewrite Dataset and dataLoader in torch.nn.dataset to prepare data for training. You can find them in the steps/dataloader.py.
- RNN + DNN + CTC RNN here can be replaced by nn.LSTM and nn.GRU
- CNN + RNN + DNN + CTC
CNN is use to reduce the variety of spectrum which can be caused by the speaker and environment difference. - How to choose
Use add_cnn to choose one of two models. If add_cnn is True, then CNN+RNN+DNN+CTC will be chosen.
- initial-lr = 0.001
- decay = 0.5
- wight-decay = 0.005
Adjust the learning rate if the dev loss is around a specific loss for ten times.
Times of adjusting learning rate is 8 which can be alter in steps/ctc_train.py(line367).
Optimizer is nn.optimizer.Adam with weigth decay 0.005
Take the max prob of outputs as the result and get the path.
Calculate the WER and CER by used the function of the class.
Implemented with python. Original Code
I fix it to support phoneme for batch decode.
Beamsearch can improve about 0.2% of phonome accuracy.
Phoneme-level language model is inserted to beam search decoder now.
- Combine with RNN-LM
- Beam search with RNN-LM
- The code in 863_corpus is a mess. Need arranged.