The audio-recognition from haiminzhang

KaldiBasedSpeakerVerification
========================================
Author: Qianhui Wan
Version: 1.0.0
Date : 2018-01-23

Prerequisite
------------
1. Kaldi 5.3, as well as Altas and OpenFst required by Kaldi.
https://github.com/kaldi-asr/kaldi

2. libfvad, Voice activity detection (VAD) library, based on WebRTC's VAD engine.
https://github.com/dpirch/libfvad

Installation
------------
1. Install Kaldi 5.3:
git clone https://github.com/kaldi-asr/kaldi.git kaldi --origin upstream
cd kaldi

2. Install Kaldi's required libraries:
cd to /kaldi/tools and follow INSTALL instructions there.

3. Compile and finish Kaldi install:
cd to /kaldi/src and follow INSTALL instructions there.

4. Install libfvad:
git clone https://github.com/dpirch/libfvad
cd libfvad
./bootstrap
./configure
make
make install (perhaps sudo at this command)

5. Install KaldiBasedSpeakerVerification

cd KaldiBasedSpeakerVerification/src
*edit makefile; provide the correct locations for this project and the libraries.
make
(This will output 3 executables under /src: enroll, identifySpeaker and extractFeatures)

Project file structure (under KaldiBasedSpeakerVerification folder)
----------------------------------
/examples
contains enroll and test examples, along with example data

/examples/iv
contains i-vector features extracted from enrollment.(this can be empty before enrolling speakers, must have 2 files before testing)

/examples/mat
contains background model data, must have six files.

/scripts
contains scripts mainly used to create background model.

/src
contains code for 3 applications: creating a background model, enrolling speakers and speaker identification.

Main applications
-------------------------------------------------
/src/enroll.cpp
This program is used to extract speech features from one speaker.
Usage: enroll speakerId wavefile
The output should look like:
Not registered speaker: speakerId. Created a new spkid
or
Found registered speaker: speakerId. Updated speaker model

The wavefile should be in .wav format.

This will create/update two files in /iv: train_iv.ark and train_num_utts.ark.

/src/identifySpeaker.cpp
This program process a given audio clip and output person identification every ~3.2 seconds.
Usage: identifySpeaker wavefile
The output should look like:
Family membmer detected! Speaker: 225
Family membmer detected! Speaker: 225
Stanger detected!
Family membmer detected! Speaker: 227
Family membmer detected! Speaker: 227
...

It will also output the probability score for each segments -> this could be used to adjust the decision threshold due to different audio condition.

Examples
-------------------------------------------------
After installing all required applications, you can run the following examples to test if your installation is right.

1. make sure there is three folder in /examples
/example_data
/iv
/mat (due to the file size limit of GitHub, final.ie was zipped into several parts. To unzip, do: cat iepart* -> final.ie)

2. run ./test1Enroll.sh
This will enroll all speech files in /example_data/enroll
The output should look like:

The total active speech is 1.61 seconds.
No registered speaker: 174. Create a new spkid
Done.
The total active speech is 15 seconds.
Found registered speaker: 174. Update speaker model
Done.
The total active speech is 0.88 seconds.
No registered speaker: 84. Create a new spkid
Done.
The total active speech is 3.47 seconds.
Found registered speaker: 84. Update speaker model
Done.

3. run ./test1Test.sh
This will test speech /example_data/test/84/84-121550-0030.wav against all registered speaker
The output should look like:

Effective speech length: 2.605s.No family member detected. (score: 4.97931)
Effective speech length: 5.685s.Family member detected! Speaker: 84 (score: 33.7779)
Speech data is finished!
Done.

*Note:
There will also be outputs of kaldi log which look like:
LOG ([5.3.96~1-7ee7]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG ([5.3.96~1-7ee7]:ComputeDerivedVars():ivector-extractor.cc:204) Done.

This tells you one audio segment has been processed and can be omitted by setting kaldi verbose level.

Background Model Training
-------------------------------------
/src/extractFeatures
The program extracts 20-dim MFCC (with energy), append deltas and double deltas, and apply CMVN
Usage: extractFeatures wav.scp ark,scp:feat.ark,feat.scp
Input: wav.scp, a text list of speech file name and path
Output: feat.ark, feat.scp -> same as kaldi.

/scripts/data_prep.sh
usage: data_prep.sh path_to_speech path_to_info
prepare useful text file for later process, please refer to data_prep.sh for details

/scripts/utt2spk_to_spk2utt.pl
usage: utt2spk_to_spk2utt.pl utt2spk > spk2utt
create the spk2utt file with given utt2spk file

/scripts/train_ubm.sh
usage: train_ubm.sh path_to_feat path_to_mat
output: final.dubm, final.ubm
please refer to train_ubm.sh for details

/scripts/train_ivextractor.sh
usage: train_ivextractor.sh path_to_feat path_to_mat
output: final.ie
please refer to train_ivextractor.sh for details

/scripts/train_comp_plda.sh
usage: train_comp_plda.sh path_to_feat path_to_mat
output: final.plda, transform.mat, mean_vec
please refer to train_comp_plda.sh for details

The following folders will be created during running:
/dev_data
contains development dataset speech information, MFCC features and i-vectors

/mat
contains all trained models:
final.dubm, final.ubm, final.ie, final.plda, transform.mat, mean_vec

Note: The whole process can take several hours (e.g. 5 to 6 hours from VirtualBox-run CentOS version).
Note: All scripts need to modified manually for the path (same as examples), this can be avoided if you add all paths to environmental variables.

haiminzhang / audio-recognition Goto Github PK

audio-recognition's Introduction

audio-recognition's People

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent