Code Monkey home page Code Monkey logo

speaker_verification's Introduction

Speaker_Verification

Tensorflow implementation of Generalized End-to-End Loss for Speaker Verification (Kaggle, paperswithcode). This paper is based on the previous work End-to-End Text-Dependent Speaker Verification.

Speaker Verification

  • Speaker verification does 1-1 check between the enrolled voice and the new voice. This task requires to achieve the higher accuracy than speaker identification which does N-1 check between the N enrolled voices and a new voice.
  • There are two types of speaker verification: 1) Text dependent speaker verification (TD-SV). 2) Text independent speaker verification (TI-SV). The former uses the text-specific utterances for enrollment and verification, whereas the latter uses text-independent utterances.
  • At each forward step of the method, the utterance similarity matrix is calculated and the integrated loss is used for the objective function. (see Section 2.1 of the paper)

Files

  • configuration.py : Argument parsing
  • data_preprocess.py : Extracts noise and performs STFT on raw audio. For each raw audio, the voice activity detection is performed via librosa library.
  • utils.py : Contains various util functions for training and test.
  • model.py : Contains train and test functions.
  • main.py : After the dataset is prepared, run
python main.py --train True --model_path where_you_want_to_save                 # training
python main.py --train False --model_path model_path used at training phase     # test

Data

  • Note, The authors of the paper used their own private dataset, and I could not obtain it.
  • In this implementation, I used VTCK public dataset, CSTR VCTK Corpus and noise added VTCK dataset (from "Noisy speech database for training speech enhancement algorithms and TTS models").
  • The VCTK dataset includes speech data uttered by 109 native English speakers with various accents.
  • For TD-SV, I used the first audio file of each speaker, which is speaking "Call Stella". For the each training and test data, I added random noise extracted from the noise added VTCK dataset.
  • For TD-SI, I used randomly selected utterances from each speaker. The blanks of raw audio files are trimmed, and then slicing is performed.

Results

I trained the model with my notebook CPU. The model hyperpameters are following the paper:

  • 3 LSTM layers with 128 hidden nodes, 64 projection nodes (Total 210434 variables)
  • 0.01 lr sgd with 0.5 decay
  • l2 norm clipping by 3

To finish training and test in time, I used smaller batch (4 speakers x 5 utterances) than the paper. I used the first 85% of the dataset as training set and used the remained parts as the testset. In the below, I used softmax loss (however, the contrastive loss is also implemented in this code). On my environment, it takes less than 1s for calculating 40 utterances embedding.

  1. TD-SV
    For each utterance, random noise is added at each forward step. I tested a model after 60000 iteration. As a result, Equal Error Rate (EER) is 0, and we can see the model performs well with a small population.

The figure below contains a similarity matrix and its EER, FAR, and FRR. Here, each matrix corresponds to each speaker. If we call the first matrix as A (5x4), then A[i,j] means the cosine similarity between the first speaker's i^th vertification utterance and the j^th speaker's enrollment.

  1. TI-SV
    Randomly selected utterances are used. I tested the model after 60000 iteration. Here, Equal Error Rate (EER) is 0.09.

LICENSE

MIT License

speaker_verification's People

Contributors

honghe avatar janghyun1230 avatar zh794390558 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

speaker_verification's Issues

Inconsistency with equation (8) (possible missing division by M-1)

Hi!
First of all, I want to thank you for a great library - it saved me a lot of time!
I noticed that when you calculate center_except in similarity(), you follow equation (8) from the article which actually calculates mean of all the elements except one, so there should be summation and division by M-1.
I see you perform the summation and subtraction of relevant element, however there is no division by M-1. Is there a chance it is missing?
Thanks!

TypeError: Input 'y' of 'Mul' Op has type float32 that does not match type int32 of argument 'x'

i was passes embeddings (enrollment, verification) checked similarities between these.
throws this issue.

Traceback (most recent call last):
   File "just_see.py", line 178, in speaker_verification
    similarity_matrix = similarity(embedded=verif_embed, w=1, b=0, N=N, M=M, P=proj, center=enroll_embed)
  File "just_see.py", line 105, in similarity
    S = tf.abs(w)*S+b   # rescaling
  File "/env/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 866, in binary_op_wrapper
    return func(x, y, name=name)
  File "/env/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 1131, in _mul_dispatch
    return gen_math_ops.mul(x, y, name=name)
  File "/env/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5042, in mul
    "Mul", x=x, y=y, name=name)
  File "/env/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 546, in _apply_op_helper
    inferred_from[input_arg.type_attr]))
TypeError: Input 'y' of 'Mul' Op has type float32 that does not match type int32 of argument 'x'.

the noise_filenum

Hi! I have a question on your noise extraction part.
In configuration.py you've marked the "noise_filenum" as "how many noise files will you use". In data_preprocess you seem to extract noises from each paired noise-clean data and save them as a *.npy file. In TD-SV train&test, however, you set the "noise_filenum" to default(16), which means that you add just #16 noise to any utterance.
So what does "how many noise files will you use" mean? I think it's better to say "the # of noise file you will use". Is my understanding correct?

Confusion between Enrollment and Verification batches

Suppose I want to test the voice/numpy array/features of one speaker against seven different speakers.

Does the data of the single speaker go in the enrollment or the verification batch? Similarly for the seven speakers.

I'm just confused since in the code, both enrollment and verification batches are being created the same way so it's difficult to figure out what's being measured up against what.

Error while executing data_preprocess.py

Hi @Janghyun1230,
I got this error (processing till speaker 14 was ok):
15th speaker processing...
Traceback (most recent call last):
File "data_preprocess.py", line 139, in
save_spectrogram_tisv()
File "data_preprocess.py", line 112, in save_spectrogram_tisv
utter, sr = librosa.core.load(utter_path, config.sr) # load utterance audio
File "/usr/local/lib/python3.5/dist-packages/librosa/core/audio.py", line 112, in load
with audioread.audio_open(os.path.realpath(path)) as input_file:
File "/usr/local/lib/python3.5/dist-packages/audioread/init.py", line 116, in audio_open
raise NoBackendError()
audioread.NoBackendError

any clues ?

Thanks in advance,

What are the files under the dataset?

TI-SV,I downloaded the vtck dataset. After decompressing, there are clean audio and Noise Audio, training and testing. When data preprocessing, do I put all audio files in one directory? What file is stored in audio_path?

discrepancy in feature extraction process

Why have you not extracted the 40 dimensional mel filterbank for TDSV? You however have extracted these features for TISV.

I am referring to the two functions in data_preprocess.py namely
save_spectrogram_tdsv
and save_spectrogram_tisv

You have not used the librosa.filters.mel function in save_spectrogram_tdsv. Can you please elaborate why you did so?

Why take first and last 180 frames from the utterance spectrogram?

I'm just wondering why you use the following lines:

utterances_spec.append(S[:, :config.tisv_frame])    # first 180 frames of partial utterance
utterances_spec.append(S[:, -config.tisv_frame:])   # last 180 frames of partial utterance

there are many examples where S.shape[1] < 360 which causes repeat frames in the utterance_spec. Could this pose a problem?

Lack of shuffle the batch when training

It seems that there is a lack of shuffle the batch when training. Thus the loss will decrease very fast, and it learns nothing. It only learns the fixed similarity matrix output. After training, the model cannot work.
Each batch of data should be permuted and then unpermuted. There is a example of pytorch version.
Pytorch_Speaker_Verification:

        mel_db_batch = torch.reshape(mel_db_batch, (hp.train.N*hp.train.M, mel_db_batch.size(2), mel_db_batch.size(3)))
        perm = random.sample(range(0, hp.train.N*hp.train.M), hp.train.N*hp.train.M)
        unperm = list(perm)
        for i,j in enumerate(perm):
            unperm[j] = i
        mel_db_batch = mel_db_batch[perm]
        #gradient accumulates
        optimizer.zero_grad()
        embeddings = embedder_net(mel_db_batch)
        embeddings = embeddings[unperm]
        embeddings = torch.reshape(embeddings, (hp.train.N, hp.train.M, embeddings.size(1)))

why define a new graph when running test ?

To my knowledge, the test process may be simplified as follows:
restore the model and get tensors we need, including the input utterances and the output embedding matrix, something like:

# code below may not work, just for illustrating the idea
saver.restore(sess, model_path)
x = tf.get_default_graph().get_tensor_by_name("x:0")
embedding_matrix = tf.get_default_graph().get_tensor_by_name("embedding_matrix:0")

enroll = sess.run(embedding_matrix, feed_dict={x:next_batch()})
verif = sess.run(embedding_matrix, feed_dict={x:next_batch(start=M)})

enroll_center = cal_center(enroll)

S = calculate_similarity_matrix(verif, enroll)

If the above process is right, why define a new graph in test(), any differences except larger batch size?

Thanks.

weird inference results for similar speakers

Hi @Janghyun1230
I trained the model based on vctk dataset (by reproducing your work).
As for inference, I am trying to verify speakers from LibriSpeech dataset. I obtained bizarre results each time. For instance, below are results of the same speaker (I splitted the *.wav of this speaker in two different folders and I feed them to the model). Hence, N=2 (but in real we have the same speaker) and M=4 utterances. The results below indicate that the model failed to detect that we deal with the same speaker.
Do you have any explanation of this ? should I try to train the model on a bigger dataset in order to get better results ?
inference time for 16 utterences : 0.18s
[[[0.87 0.29]
[0.79 0.07]
[0.93 0.24]
[0.81 0.17]]

[[0.42 0.89]
[0.4 0.81]
[0.53 0.73]
[0.52 0.62]]]

EER : 0.00 (thres:0.54, FAR:0.00, FRR:0.00)

Inference different from paper

Hi,

Are you doing the following?

During inference time, for every utterance we apply a sliding window of fixed size (lb + ub)/2 = 160 frames with 50% overlap. We compute the d-vector for each window. The final utterance-wise d-vector is generated by L2 normalizing the window-wise d-vectors, then taking the element-wise averge (as shown in Figure 4)

Compare 2 d-vectors without center

Hi, can I compare 2 wavfiles by directly calculate the Cosine Similarity of 2 embedded d-vectors without centers for open-set datasets? Since there is no enrollment process.
Thanks.

Failed to interpret file '.../p254_237.wav' as a pickle

Hi,
I'm trying to run the code on using the VCTK dataset but get this error in the train function of model.py. Any ideas?
In other words, why are you reading sample files in training with npyio.load function? isn't this function designed to read/write numpy arrays from/to disk?

  File "/home/mohsen/.local/lib/python3.5/site-packages/numpy/lib/npyio.py", line 447, in load
    return pickle.load(fid, **pickle_kwargs)
_pickle.UnpicklingError: bad pickle data
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/home/mohsen/.IntelliJIdea2018.3/config/plugins/python/helpers/pydev/_pydev_bundle/pydev_umd.py", line 198, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/home/mohsen/.IntelliJIdea2018.3/config/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/mohsen/workspace/Speaker_Verification/main.py", line 14, in <module>
    train(config.model_path)
  File "/home/mohsen/workspace/Speaker_Verification/model.py", line 69, in train
    _, loss_cur, summary = sess.run([train_op, loss, merged], feed_dict={batch: random_batch(utter_start=1), lr: config.lr * lr_factor})
  File "/home/mohsen/workspace/Speaker_Verification/utils.py", line 78, in random_batch
    utters = np.load(os.path.join(path, file))        # load utterance spectrogram of selected speaker
  File "/home/mohsen/.local/lib/python3.5/site-packages/numpy/lib/npyio.py", line 450, in load
    "Failed to interpret file %s as a pickle" % repr(file))
OSError: Failed to interpret file '/home/mohsen/workspace/Speaker_Verification/train_tisv/p254_237.wav' as a pickle```

I`m facing permmissionerror when training the model

here is the error description

Traceback (most recent call last):
File "main.py", line 14, in
train(config.model_path)
File "D:\graduating\Speaker_Verification-master\model.py", line 66, in train
feed_dict={batch: random_batch(), lr: config.lr*lr_factor})
File "D:\graduating\Speaker_Verification-master\utils.py", line 76, in random_batch
utters = np.load(os.path.join(path, file)) # load utterance spectrogram of selected speaker
File "D:\anaconda\envs\tensorflow\lib\site-packages\numpy\lib\npyio.py", line 428, in load
fid = open(os_fspath(file), "rb")
PermissionError: [Errno 13] Permission denied: './train_tisv\p277'

i google for that but got no solution
could anyone provide a solution or someone faced the same error and have solved?

new speaker

hello,can this project be used to identify a new speaker?
I mean a wav which belongs to none of the enrollment utterance.

Result Visualization

After testing the trained model by using code:

python main.py --train False --model_path model_output

I get results as

inference time for 40 utterences : 1.64s
[[[ 0.73 -0.43 -0.3 0. ]
[ 0.65 -0.42 -0.47 -0.39]
[ 0.62 -0.42 -0.48 -0.4 ]
[ 0.96 -0.28 -0.14 -0.05]
[ 0.68 -0.54 -0.4 -0.08]]

[[ 0.11 0.83 0.37 -0.24]
[-0.08 0.86 0.49 -0.24]
[-0.08 0.87 0.48 -0.25]
[-0.15 0.96 0.69 0.08]
[-0.17 0.97 0.68 0.01]]

[[-0.04 0.65 0.89 -0.05]
[-0.02 0.7 0.94 -0.01]
[-0.03 0.69 0.94 -0.04]
[-0.14 0.79 0.96 0.08]
[-0.07 0.75 0.89 0.06]]

[[-0.16 -0.13 0.07 0.88]
[-0.05 -0.18 0.1 0.95]
[-0.05 -0.24 0.08 0.94]
[-0.05 -0.09 0.31 0.93]
[-0.06 -0.18 0.06 0.91]]]

EER : 0.10 (thres:0.65, FAR:0.10, FRR:0.10)

And I want to visualize this results like for specific number of people in specific timestamp. But I have no idea how can I do it.

So, if anyone can help me with this ?

ERROR

screenshot from 2018-12-26 18-00-00
screenshot from 2018-12-26 18-00-32
model is not created, please help!

About log-mel feature extraction

When I feed a 1000ms audio data into the feature extraction process, frame length is 25ms, frame shift is 10ms.it should be 98 frames in total, but in your code, it shows 101 frames, why?

Looking forward to your reply.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.