janghyun1230 / speaker_verification Goto Github PK

Tensorflow implementation of "Generalized End-to-End Loss for Speaker Verification"

License: MIT License

Python 100.00%

speaker_verification's Introduction

Speaker_Verification

Tensorflow implementation of Generalized End-to-End Loss for Speaker Verification (Kaggle, paperswithcode). This paper is based on the previous work End-to-End Text-Dependent Speaker Verification.

Speaker Verification

Speaker verification does 1-1 check between the enrolled voice and the new voice. This task requires to achieve the higher accuracy than speaker identification which does N-1 check between the N enrolled voices and a new voice.
There are two types of speaker verification: 1) Text dependent speaker verification (TD-SV). 2) Text independent speaker verification (TI-SV). The former uses the text-specific utterances for enrollment and verification, whereas the latter uses text-independent utterances.
At each forward step of the method, the utterance similarity matrix is calculated and the integrated loss is used for the objective function. (see Section 2.1 of the paper)

Files

configuration.py : Argument parsing
data_preprocess.py : Extracts noise and performs STFT on raw audio. For each raw audio, the voice activity detection is performed via librosa library.
utils.py : Contains various util functions for training and test.
model.py : Contains train and test functions.
main.py : After the dataset is prepared, run

python main.py --train True --model_path where_you_want_to_save                 # training
python main.py --train False --model_path model_path used at training phase     # test

Data

Note, The authors of the paper used their own private dataset, and I could not obtain it.
In this implementation, I used VTCK public dataset, CSTR VCTK Corpus and noise added VTCK dataset (from "Noisy speech database for training speech enhancement algorithms and TTS models").
The VCTK dataset includes speech data uttered by 109 native English speakers with various accents.
For TD-SV, I used the first audio file of each speaker, which is speaking "Call Stella". For the each training and test data, I added random noise extracted from the noise added VTCK dataset.
For TD-SI, I used randomly selected utterances from each speaker. The blanks of raw audio files are trimmed, and then slicing is performed.

Results

I trained the model with my notebook CPU. The model hyperpameters are following the paper:

3 LSTM layers with 128 hidden nodes, 64 projection nodes (Total 210434 variables)
0.01 lr sgd with 0.5 decay
l2 norm clipping by 3

To finish training and test in time, I used smaller batch (4 speakers x 5 utterances) than the paper. I used the first 85% of the dataset as training set and used the remained parts as the testset. In the below, I used softmax loss (however, the contrastive loss is also implemented in this code). On my environment, it takes less than 1s for calculating 40 utterances embedding.

TD-SV
For each utterance, random noise is added at each forward step. I tested a model after 60000 iteration. As a result, Equal Error Rate (EER) is 0, and we can see the model performs well with a small population.

The figure below contains a similarity matrix and its EER, FAR, and FRR. Here, each matrix corresponds to each speaker. If we call the first matrix as A (5x4), then A[i,j] means the cosine similarity between the first speaker's i^th vertification utterance and the j^th speaker's enrollment.

TI-SV
Randomly selected utterances are used. I tested the model after 60000 iteration. Here, Equal Error Rate (EER) is 0.09.

LICENSE

MIT License

speaker_verification's People

Contributors

Stargazers

Watchers

Forkers

entn-at space-pope whuawell dalonlobo lihao0214 njpinton nangongmu jasonaidm mxmaxi007 muruganr96 hahahahahage danhuixie sibtainrazajamali zhengqun rockycamp signalogy duongnhatthang kedengfeng honghe 00001101-xt aurora11111 leolorenzoluis anigi98932 mrjj xiaohaoliang windowxiaoming raffy-bekhit kezhende ddxk zh794390558 aaxwaz hzlihang99 aascode xuanjihe liangyanfeng del18687058912 bovey0809 hafsabukhary bimunlp owen864720655 phecda-xu 742362144 sadam1195 ajilim htangle reinhardhsu wangxiao1021 jmhicoding wuqiangch mwhitehill gatsbychen tylerwilliams pubg1 paulchou0309 nsanirudh bobzengscut hintonthu maneeshjain1982 kharaone scott0910 n1r jamesmuwb ky941122 isabel-saludares cuchjj gdy1201 yuvaljacoby windstudent b1sounours aneybaby727 celikmustafa89 ricwg amshu1998 walidbousseta shenyi666666 ericallen16 michaelpdu sukruozan unicorn-io elliotthwang gabrieldi95 abhigyan1424 tubbz-alt orangebaowang ishine iamweiweishi hashim19 bobby-hua ngdangha xu-shihao hongwen-sun zomun ricoferdian poom-cyber dnfcallan 5l1v3r1 gavin-keli suhushuang amadeuszhao dennistang742

speaker_verification's Issues

Inconsistency with equation (8) (possible missing division by M-1)

Hi!
First of all, I want to thank you for a great library - it saved me a lot of time!
I noticed that when you calculate center_except in similarity(), you follow equation (8) from the article which actually calculates mean of all the elements except one, so there should be summation and division by M-1.
I see you perform the summation and subtraction of relevant element, however there is no division by M-1. Is there a chance it is missing?
Thanks!

TypeError: Input 'y' of 'Mul' Op has type float32 that does not match type int32 of argument 'x'

i was passes embeddings (enrollment, verification) checked similarities between these.
throws this issue.

Traceback (most recent call last):
   File "just_see.py", line 178, in speaker_verification
    similarity_matrix = similarity(embedded=verif_embed, w=1, b=0, N=N, M=M, P=proj, center=enroll_embed)
  File "just_see.py", line 105, in similarity
    S = tf.abs(w)*S+b   # rescaling
  File "/env/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 866, in binary_op_wrapper
    return func(x, y, name=name)
  File "/env/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 1131, in _mul_dispatch
    return gen_math_ops.mul(x, y, name=name)
  File "/env/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5042, in mul
    "Mul", x=x, y=y, name=name)
  File "/env/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 546, in _apply_op_helper
    inferred_from[input_arg.type_attr]))
TypeError: Input 'y' of 'Mul' Op has type float32 that does not match type int32 of argument 'x'.

the noise_filenum

Hi! I have a question on your noise extraction part.
In configuration.py you've marked the "noise_filenum" as "how many noise files will you use". In data_preprocess you seem to extract noises from each paired noise-clean data and save them as a *.npy file. In TD-SV train&test, however, you set the "noise_filenum" to default(16), which means that you add just #16 noise to any utterance.
So what does "how many noise files will you use" mean? I think it's better to say "the # of noise file you will use". Is my understanding correct?

Confusion between Enrollment and Verification batches

Suppose I want to test the voice/numpy array/features of one speaker against seven different speakers.

Does the data of the single speaker go in the enrollment or the verification batch? Similarly for the seven speakers.

I'm just confused since in the code, both enrollment and verification batches are being created the same way so it's difficult to figure out what's being measured up against what.

Error while executing data_preprocess.py

Hi @Janghyun1230,
I got this error (processing till speaker 14 was ok):
15th speaker processing...
Traceback (most recent call last):
File "data_preprocess.py", line 139, in
save_spectrogram_tisv()
File "data_preprocess.py", line 112, in save_spectrogram_tisv
utter, sr = librosa.core.load(utter_path, config.sr) # load utterance audio
File "/usr/local/lib/python3.5/dist-packages/librosa/core/audio.py", line 112, in load
with audioread.audio_open(os.path.realpath(path)) as input_file:
File "/usr/local/lib/python3.5/dist-packages/audioread/init.py", line 116, in audio_open
raise NoBackendError()
audioread.NoBackendError

any clues ?

Thanks in advance,

What are the files under the dataset？

TI-SV，I downloaded the vtck dataset. After decompressing, there are clean audio and Noise Audio, training and testing. When data preprocessing, do I put all audio files in one directory? What file is stored in audio_path?

discrepancy in feature extraction process

Why have you not extracted the 40 dimensional mel filterbank for TDSV? You however have extracted these features for TISV.

I am referring to the two functions in data_preprocess.py namely
save_spectrogram_tdsv
and save_spectrogram_tisv

You have not used the librosa.filters.mel function in save_spectrogram_tdsv. Can you please elaborate why you did so?

Why take first and last 180 frames from the utterance spectrogram?

I'm just wondering why you use the following lines:

utterances_spec.append(S[:, :config.tisv_frame])    # first 180 frames of partial utterance
utterances_spec.append(S[:, -config.tisv_frame:])   # last 180 frames of partial utterance

there are many examples where S.shape[1] < 360 which causes repeat frames in the utterance_spec. Could this pose a problem?

What are the requirement softwares?

Could you let me know all requirements? (especially tensorflow version)

Lack of shuffle the batch when training

It seems that there is a lack of shuffle the batch when training. Thus the loss will decrease very fast, and it learns nothing. It only learns the fixed similarity matrix output. After training, the model cannot work.
Each batch of data should be permuted and then unpermuted. There is a example of pytorch version.
Pytorch_Speaker_Verification:

        mel_db_batch = torch.reshape(mel_db_batch, (hp.train.N*hp.train.M, mel_db_batch.size(2), mel_db_batch.size(3)))
        perm = random.sample(range(0, hp.train.N*hp.train.M), hp.train.N*hp.train.M)
        unperm = list(perm)
        for i,j in enumerate(perm):
            unperm[j] = i
        mel_db_batch = mel_db_batch[perm]
        #gradient accumulates
        optimizer.zero_grad()
        embeddings = embedder_net(mel_db_batch)
        embeddings = embeddings[unperm]
        embeddings = torch.reshape(embeddings, (hp.train.N, hp.train.M, embeddings.size(1)))

ERROR while Testing

why define a new graph when running test ?

To my knowledge, the test process may be simplified as follows:
restore the model and get tensors we need, including the input utterances and the output embedding matrix, something like:

# code below may not work, just for illustrating the idea
saver.restore(sess, model_path)
x = tf.get_default_graph().get_tensor_by_name("x:0")
embedding_matrix = tf.get_default_graph().get_tensor_by_name("embedding_matrix:0")

enroll = sess.run(embedding_matrix, feed_dict={x:next_batch()})
verif = sess.run(embedding_matrix, feed_dict={x:next_batch(start=M)})

enroll_center = cal_center(enroll)

S = calculate_similarity_matrix(verif, enroll)

If the above process is right, why define a new graph in test(), any differences except larger batch size?

Thanks.

weird inference results for similar speakers

Hi @Janghyun1230
I trained the model based on vctk dataset (by reproducing your work).
As for inference, I am trying to verify speakers from LibriSpeech dataset. I obtained bizarre results each time. For instance, below are results of the same speaker (I splitted the *.wav of this speaker in two different folders and I feed them to the model). Hence, N=2 (but in real we have the same speaker) and M=4 utterances. The results below indicate that the model failed to detect that we deal with the same speaker.
Do you have any explanation of this ? should I try to train the model on a bigger dataset in order to get better results ?
inference time for 16 utterences : 0.18s
[[[0.87 0.29]
[0.79 0.07]
[0.93 0.24]
[0.81 0.17]]

[[0.42 0.89]
[0.4 0.81]
[0.53 0.73]
[0.52 0.62]]]

EER : 0.00 (thres:0.54, FAR:0.00, FRR:0.00)

noise dataset

good job!
Why should we use the noise dataset?

Inference different from paper

Hi,

Are you doing the following?

During inference time, for every utterance we apply a sliding window of fixed size (lb + ub)/2 = 160 frames with 50% overlap. We compute the d-vector for each window. The final utterance-wise d-vector is generated by L2 normalizing the window-wise d-vectors, then taking the element-wise averge (as shown in Figure 4)

Compare 2 d-vectors without center

Hi, can I compare 2 wavfiles by directly calculate the Cosine Similarity of 2 embedded d-vectors without centers for open-set datasets? Since there is no enrollment process.
Thanks.

librosa.effects.trim and librosa.effects.split using default frame_length=2048, hop_length=512, not same to config

Speaker_Verification/data_preprocess.py

Line 69 in 170e81b

    
           utter_trim, index = librosa.effects.trim(utter, top_db=14)         # trim the beginning and end blank

Failed to interpret file '.../p254_237.wav' as a pickle

Hi,
I'm trying to run the code on using the VCTK dataset but get this error in the train function of model.py. Any ideas?
In other words, why are you reading sample files in training with npyio.load function? isn't this function designed to read/write numpy arrays from/to disk?

  File "/home/mohsen/.local/lib/python3.5/site-packages/numpy/lib/npyio.py", line 447, in load
    return pickle.load(fid, **pickle_kwargs)
_pickle.UnpicklingError: bad pickle data
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/home/mohsen/.IntelliJIdea2018.3/config/plugins/python/helpers/pydev/_pydev_bundle/pydev_umd.py", line 198, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/home/mohsen/.IntelliJIdea2018.3/config/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/mohsen/workspace/Speaker_Verification/main.py", line 14, in <module>
    train(config.model_path)
  File "/home/mohsen/workspace/Speaker_Verification/model.py", line 69, in train
    _, loss_cur, summary = sess.run([train_op, loss, merged], feed_dict={batch: random_batch(utter_start=1), lr: config.lr * lr_factor})
  File "/home/mohsen/workspace/Speaker_Verification/utils.py", line 78, in random_batch
    utters = np.load(os.path.join(path, file))        # load utterance spectrogram of selected speaker
  File "/home/mohsen/.local/lib/python3.5/site-packages/numpy/lib/npyio.py", line 450, in load
    "Failed to interpret file %s as a pickle" % repr(file))
OSError: Failed to interpret file '/home/mohsen/workspace/Speaker_Verification/train_tisv/p254_237.wav' as a pickle```

in Data_preprocessing audio features not extracted properly?

41th speaker processing...
(0,)
169th speaker processing...
(0,)
this audios was 3 sec TI-SV speakers audio.
@Janghyun1230 sir some audios features not extracted properly. how to resolve this issue?

I`m facing permmissionerror when training the model

here is the error description

Traceback (most recent call last):
File "main.py", line 14, in
train(config.model_path)
File "D:\graduating\Speaker_Verification-master\model.py", line 66, in train
feed_dict={batch: random_batch(), lr: config.lr*lr_factor})
File "D:\graduating\Speaker_Verification-master\utils.py", line 76, in random_batch
utters = np.load(os.path.join(path, file)) # load utterance spectrogram of selected speaker
File "D:\anaconda\envs\tensorflow\lib\site-packages\numpy\lib\npyio.py", line 428, in load
fid = open(os_fspath(file), "rb")
PermissionError: [Errno 13] Permission denied: './train_tisv\p277'

i google for that but got no solution
could anyone provide a solution or someone faced the same error and have solved?

new speaker

hello，can this project be used to identify a new speaker?
I mean a wav which belongs to none of the enrollment utterance.

Recommended: A great STDV data set (or keyword spotting )

The speech_commands_dataset2 is better for tdsv. The same speaker has multiple utterances for the same keyword. ：）

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

download

@Janghyun1230

Result Visualization

After testing the trained model by using code:

python main.py --train False --model_path model_output

I get results as

inference time for 40 utterences : 1.64s
[[[ 0.73 -0.43 -0.3 0. ]
[ 0.65 -0.42 -0.47 -0.39]
[ 0.62 -0.42 -0.48 -0.4 ]
[ 0.96 -0.28 -0.14 -0.05]
[ 0.68 -0.54 -0.4 -0.08]]

[[ 0.11 0.83 0.37 -0.24]
[-0.08 0.86 0.49 -0.24]
[-0.08 0.87 0.48 -0.25]
[-0.15 0.96 0.69 0.08]
[-0.17 0.97 0.68 0.01]]

[[-0.04 0.65 0.89 -0.05]
[-0.02 0.7 0.94 -0.01]
[-0.03 0.69 0.94 -0.04]
[-0.14 0.79 0.96 0.08]
[-0.07 0.75 0.89 0.06]]

[[-0.16 -0.13 0.07 0.88]
[-0.05 -0.18 0.1 0.95]
[-0.05 -0.24 0.08 0.94]
[-0.05 -0.09 0.31 0.93]
[-0.06 -0.18 0.06 0.91]]]

EER : 0.10 (thres:0.65, FAR:0.10, FRR:0.10)

And I want to visualize this results like for specific number of people in specific timestamp. But I have no idea how can I do it.

So, if anyone can help me with this ?

ERROR

model is not created, please help!

os.listdir return unordered file list

Speaker_Verification/data_preprocess.py

Line 63 in 170e81b

    
           utter_path= os.path.join(audio_path, folder, os.listdir(os.path.join(audio_path, folder))[0])

About save the model

If I want to save the .ckpt file to a .pb file, what should I do?

About log-mel feature extraction

When I feed a 1000ms audio data into the feature extraction process, frame length is 25ms, frame shift is 10ms.it should be 98 frames in total, but in your code, it shows 101 frames, why?

Looking forward to your reply.