Code Monkey home page Code Monkey logo

speaker-diarization's Introduction

Speaker-Diarization

This project contains:

  • Text-independent Speaker recognition module based on VGG-Speaker-recognition
  • Speaker diarization based on UIS-RNN.
  • Mainly borrowed from UIS-RNN and VGG-Speaker-recognition, just link the 2 projects by generating speaker embeddings to make everything easier, and also provide an intuitive display panel

Prerequisites

  1. pytorch 1.3.0
  2. keras
  3. Tensorflow 1.8-1.15
  4. pyaudio (About how to install on windows, refer to pyaudio_portaudio)

Outline

1. Speaker recognition.

cd ghostvlad
python predict.py

The confusion matrix of 4 persons utterances is as below

0.00  0.32  0.40  | 0.70  0.62  0.76  | 0.81  0.83  0.76  | 0.92  0.83  0.89  |

0.32  0.00  0.48  | 0.68  0.58  0.76  | 0.87  0.84  0.83  | 0.92  0.82  0.86  |

0.40  0.48  0.00  | 0.71  0.65  0.74  | 0.79  0.81  0.72  | 0.90  0.84  0.85  |

********************************************************************************

0.70  0.68  0.71  | 0.00  0.35  0.30  | 0.78  0.81  0.76  | 0.80  0.81  0.80  |

0.62  0.58  0.65  | 0.35  0.00  0.45  | 0.76  0.71  0.73  | 0.82  0.77  0.77  |

0.76  0.76  0.74  | 0.30  0.45  0.00  | 0.83  0.83  0.80  | 0.83  0.84  0.80  |

********************************************************************************

0.81  0.87  0.79  | 0.78  0.76  0.83  | 0.00  0.40  0.46  | 0.76  0.80  0.86  |

0.83  0.84  0.81  | 0.81  0.71  0.83  | 0.40  0.00  0.45  | 0.80  0.78  0.82  |

0.76  0.83  0.72  | 0.76  0.73  0.80  | 0.46  0.45  0.00  | 0.85  0.85  0.84  |

********************************************************************************

0.92  0.92  0.90  | 0.80  0.82  0.83  | 0.76  0.80  0.85  | 0.00  0.41  0.44  |

0.83  0.82  0.84  | 0.81  0.77  0.84  | 0.80  0.78  0.85  | 0.41  0.00  0.41  |

0.89  0.86  0.85  | 0.80  0.77  0.80  | 0.86  0.82  0.84  | 0.44  0.41  0.00  |

********************************************************************************

Thanks to the authors of VGG, they are kind enough to provide the code and pre-trained model. Their paper can refer to UTTERANCE-LEVEL AGGREGATION FOR SPEAKER RECOGNITION IN THE WILD
It's a novel idea that combines netvlad/ghostvlad which popularly used in image recognition to speaker recognition, the state-of-the-art in the past was i-vector based, which depended on the GMM model and pLDA.

About VGG speaker model, I have re-implemented in tensorflow, ghostvlad-speaker and corresponding pretrained model.

This project only shows how to generate speaker embeddings using pre-trained model for uis-rnn training in later.
The training project link to VGG-Speaker-Recognition

Dataset

  1. http://www.openslr.org/38 contains 855 speakers and 120 utterances of Chinese Mandarin in each, so there are 102600 utterances in total.
  2. VCTK contains 109 speakers of English.
  3. VoxCeleb1 contains 1251 speakers.
  4. VoxCeleb2 contains 6112 speakers.
    How to generate speaker embeddings for the next training stage:
    python generate_embeddings.py
    You may need to change the dataset path by your own.

2. Speaker diarization.

diarization

Training

python train.py

The speaker embeddings generated by vgg are all non-negative vectors, and contained many zero elements. The uis-rnn seems abnormally deal with these data somehow, shows as below

Iter: 0  	Training Loss: nan    
Negative Log Likelihood: 7.3020	Sigma2 Prior: nan	Regularization: 0.0007
Iter: 10  	Training Loss: nan    
Negative Log Likelihood: nan	Sigma2 Prior: nan	Regularization: nan
Iter: 20  	Training Loss: nan    
Negative Log Likelihood: nan	Sigma2 Prior: nan	Regularization: nan

When I added an insignificate bias (e.g. 0.00001) to each element of vectors, error disappeared.

Iter: 0  	Training Loss: -581.8732    
Negative Log Likelihood: 7.0125	Sigma2 Prior: -588.8864	Regularization: 0.0007
Iter: 10  	Training Loss: -614.1193    
Negative Log Likelihood: 1.7536	Sigma2 Prior: -615.8737	Regularization: 0.0007
Iter: 20  	Training Loss: -644.9244    
Negative Log Likelihood: 1.7123	Sigma2 Prior: -646.6375	Regularization: 0.0007

Clustering

python speakerDiarization.py

The Result is showing as below:(3 speakers)

========= 0 =========
0:00.288 ==> 0:04.406
0:07.699 ==> 0:16.461
0:33.921 ==> 0:35.8
========= 1 =========
0:04.406 ==> 0:07.699
0:16.461 ==> 0:19.594
0:30.371 ==> 0:33.921
0:41.19 ==> 0:44.185
========= 2 =========
0:19.594 ==> 0:30.371
0:35.8 ==> 0:41.19

The final result is influenced by the size of each window and the overlap rate. When the overlap is too large, the uis-rnn perhaps generates fewer speakers since the speaker embeddings changed smoothly, otherwise will generate more speakers. And also, the window size cannot be too short, it must contain enough information to generate more discrimitive speaker embeddings.

speaker-diarization's People

Contributors

neozhangthe1 avatar taylorlu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

speaker-diarization's Issues

How to caculate DER, EER ?

Hello!

Thank you @taylorlu for all your work here.
I am researching on speaker diarization. I have done all tutorial from you, but i don't know how to evaluation EER, DER.
Can you support me to create ground trust and code to evaluation ?
Thank you

MemoryError

Hi taylorlu could you help me out of this problem.
I did follow your step to train on custom dataset. the code itself works well on VCTK dataset (for testing) and then i did same step to apply on my custom dataset and it show this error:
Traceback (most recent call last):
File "train.py", line 91, in
main()
File "train.py", line 87, in main
diarization_experiment(model_args, training_args, inference_args)
File "train.py", line 48, in diarization_experiment
model.fit(train_sequence_list, train_cluster_id_list, training_args)
File "/home/huynh_hoang/Speaker-Diarization/uisrnn/uisrnn.py", line 373, in fit
concatenated_train_sequence, concatenated_train_cluster_id, args)
File "/home/huynh_hoang/Speaker-Diarization/uisrnn/uisrnn.py", line 240, in fit_concatenated
num_permutations=args.num_permutations)
File "/home/huynh_hoang/Speaker-Diarization/uisrnn/utils.py", line 214, in resize_sequence
sub_sequences.append(sequence[sampled_idx_sets[j], :])
MemoryError: Unable to allocate array with shape (27848, 512) and data type float64

The loss of uis-rnn model

When I trained the uis-rnn model, I found that the loss curve was abnormal. It shocked at -600 to -800 and never decline. I don't know how to solve it, can you give me some advice. And I used https://github.com/HarryVolek/PyTorch_Speaker_Verification this code to extract embeddings.
Part of my loss records:
Negative Log Likelihood: 62.7516 Sigma2 Prior: -702.1078 Regularization: 0.0011
Iter: 1250 Training Loss: -682.2758
Negative Log Likelihood: 67.6140 Sigma2 Prior: -749.8909 Regularization: 0.0011
Iter: 1260 Training Loss: -646.3773
Negative Log Likelihood: 75.9724 Sigma2 Prior: -722.3507 Regularization: 0.0011
Iter: 1270 Training Loss: -605.5015
Negative Log Likelihood: 80.3110 Sigma2 Prior: -685.8135 Regularization: 0.0011
Iter: 1280 Training Loss: -650.1578
Negative Log Likelihood: 93.3594 Sigma2 Prior: -743.5183 Regularization: 0.0011

Innacurate start and till time of slices attained

Hi @taylorlu

Thanks for your awesome integration on the UIS-RNN code. We have made a custom UIS-RNN model based on our data and getting somewhat decent accuracy but the slice timings when the speaker changes are slightly off by some margin in majority of the cases. (either the start time starts a bit early even when the particular speaker has not started speaking or the till time ends early when the particular speaker is still speaking)
Can you please throw some light on what can be the issues for the cause.
Thanks in advance.

max( ) arg is an empty sequence

Hello, I get an error when I test with my trained model:
max_clusters = max([len(beam_state.mean_set) for beam_state in beam_set])
ValueError: max( ) arg is an empty sequence

Is there a problem in the model training stage?

Unrecognized args in arguments.py

Hi

I was trying to execute the main program speakerDiarization.py with using a GPU by adding the argument --gpu 0 to the system call. This results in the ArgumentParser object in arguments.py to also get this argument --gpu and raise an unrecognised argument error. I have fixed this issue by changing the following line:

super_parser.parse_args()

from super_parser.parse_args() to super_parser.parse_known_args(). Maybe there is a still better way to deal with this (changing the namespace of the call or so) but it works now for me. You might like to consider this option in your code.

Thank you for your contribution with this repository.

AlreadyExistsError: Another metric with the same name already exists.

When I run the speakerDiarization.py, I encounter this error, how to solve it? THANKS!

File "C:\Users\LSM\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\eager\monitoring.py", line 124, in init
self._metric = self._metric_methods[self._label_length].create(*args)

AlreadyExistsError: Another metric with the same name already exists.

Slow performance?

How long should speakerDiarization.py take to run on a typical system with a GPU?

I ran the speakerDiarization.py example and it segmented the file correctly. However, it's very slow. It takes about twice the length of the wav file to run, which makes it impractical to run on large files. For example, the sample file is about 2 minutes long, and it took 4 minutes to process. I also tested it with a custom wav file that was 3 minutes long, and it took 6 minutes to run.

My system is 2.80GHz quadcore with 32GB of memory.

Is this the typical processing time? Is there anyway to speed up processing?

How can I resolve 'killed' problem?

I generated speaker embeddings to use 4 persons audio data you provided.
And I tried "python train.py" but it says Killed in the terminial.
How can I resolve this problem..?

creating a new model

can you tell me how to create new model to my new dataset which is english language and predict using that new created model.. Could u please tell me the steps to create new mdel

Very poor performance on my own wav file, is there anything wrong?

I just did a simple try with my phone call wave file, which is about 2.5 minutes, only 2 speakers in total. However, with pretrained model in this project, it returns 3 speakers and many slices contains voices from 2 speakers, I know that uis-rnn doesn't support setting speaker numbers, but the poor performance seems incorrect, has anybody met it?

Thanks if any suggestions.

Training with CPU mode

/home/sm/venv/lib/python3.6/site-packages/torch/cuda/init.py:118: UserWarning:
Found GPU0 Quadro K4200 which is of cuda capability 3.0.
PyTorch no longer supports this GPU because it is too old.
The minimum cuda capability that we support is 3.5.

warnings.warn(old_gpu_warn % (d, name, major, capability[1]))
Traceback (most recent call last):
File "train.py", line 90, in
main()
File "train.py", line 86, in main
diarization_experiment(model_args, training_args, inference_args)
File "train.py", line 44, in diarization_experiment
model = uisrnn.UISRNN(model_args)
File "/home/sm/Speaker-Diarization/uisrnn/uisrnn.py", line 99, in init
sigma2 * torch.ones(self.observation_dim).to(self.device))
RuntimeError: CUDA error: no kernel image is available for execution on the device

Can anyone suggest me any alternative to perform training?
Thanks in advance

Real time diarization

Hi, this work is really interesting. I would like to ask two informations...

Is it possible to realize a speaker diarization like this in real time? Hence, for example, while many people are speaking (having a microphone), each speaker is distinguished in real time (with a continuous audio stream). So you observe the final graph during its updates.

In order to improve the accuracy of the network, for example for Italian speakers recognition, should i train the network using a large dataset containing Italian conversations, after having generated the embeddings?

Thank you in advance,
Giorgio

Sliding for long audios

I have tested the code with the long audio files(20+ minutes) and I have realized that the algorithm does not cover whole audio file. For instance, if audio is 35min, the code returns 34:23min audio. Also, after a point, the speaker change points slides about 2-3 seconds and then, it starts to slide 10+ seconds. Do you have any suggestion?

train.py: allow_pickle

Hi!

Found a minor issue in the train.py.

Got an error Object arrays cannot be loaded when allow_pickle=False at line 38. This issue can be solved by changing this line to:

train_data = np.load('./ghostvlad/training_data.npz', allow_pickle=True)

Please refer to the demo code of uisrnn for more details.

Speaker diarization gives more than two speaker while i have only two speakers in the audio file.

I have hindi-english mixed audio with almost 130 speakers each having 200 utterances of length between 4 sec -10 sec. I made d-vector using vgg-speaker recognition model (pre-trained given in vgg-speaker recognition). I trained model using train.py and when i tested then i find more than two speakers while each testing data has only two speakers. In some cases it gives 3 or 4 speakers.
What should i do?
Why i gives me more speakers?
Where do i went wrong, do i need to train vgg-speaker recognition model on my own data.

Please help.
Thanks.

Misalignment?

Hi!

I made a synthetic test .wav file by concatenating utterances of person a and b from your 4persons sample dataset. However, it seems that there is some misalignment in the visualization. For instance, the model only predicts the speakers from 0 sec to 16.216 sec, but there is around 17.513s in the audio (please see the attached file
sample_2persons.zip
).

Is it because of some alignment issues in the model / visualization? Or because the CNN embeddings were used? Is there any way to fix this issue?

Thanks!

Module Error

When I am executing the speakerDiarization.py file module not found error pops up while importing model..

import model as spkModel ModuleNotFoundError: No module named 'model'

Using speaker diarization on mobile devices

Hello!

Thank you @taylorlu for all your work here, first off.

I'm working to get a handle on speaker diarization, and wanted to know if you had an idea of what might be involved in getting this system to work on mobile.

Assuming I could get the actual Pytorch model successfully loaded on a mobile device, either by using Pytorch's SDK for that directly or converting to CoreML on iOS or some such, what kind of audio preprocessing is needed when feeding each buffer of audio to the model?

If the question is too broad or far reaching, please let me know of any resources I might look at to gain some perspective, or any good spots in the code to examine what the preprocessing looks like for inference.

Thanks!

Cuda Out Of Memory when invoking train.py

Hi, I was trying to generate embeddings from a very small subset of VoxCeleb dataset (around 200 MB). The process created a training_data.npz file (around 2 GB), which was loaded in the training process (using uis-rnn). However, I encountered this error:

RuntimeError: CUDA out of memory. Tried to allocate 27.45 GiB (GPU 1; 39.59 GiB total capacity; 18.75 GiB already allocated; 19.56 GiB free; 18.77 GiB reserved in total by PyTorch)

The error does not occur when I try with a smaller file. Any idea how to resolve this issue? Thank you in advanced.

feats = np.array(feats)[:,0,:] # [splits, embedding dim] IndexError: too many indices for array

for epoch in range(7000): # Random choice utterances from whole wavfiles
# A merged utterance contains [10,20] utterances
splits_count = np.random.randint(1, 5, 1)
print('this is splits_count',splits_count,splits_count[0])
path_spks = random.sample(path_spk_tuples, splits_count[0])
utterance_specs, utterance_speakers = load_data(path_spks, min_win_time=500, max_win_time=1600)
feats = []
for spec in utterance_specs:
spec = np.expand_dims(np.expand_dims(spec, 0), -1)
v = network_eval.predict(spec)
feats += [v]
print(np.array(feats))

    feats = np.array(feats)[:,0,:]  # [splits, embedding dim]
    train_sequence.append(feats)
    train_cluster_id.append(utterance_speakers)
    print("epoch:{}, utterance length: {}, speakers: {}".format(epoch, len(utterance_speakers), len(path_spks)))

np.savez('training_data', train_sequence=train_sequence, train_cluster_id=train_cluster_id)

I changed np.random.randint(1, 5, 1) to 1 to 5 as i have only 4 speaker dataset as given. I am getting an error at line

feats = np.array(feats)[:,0,:] # [splits, embedding dim]

main()
File "generate_embeddings.py", line 196, in main
feats = np.array(feats)[:,0,:] # [splits, embedding dim]
IndexError: too many indices for array

How to save plot

Hi frist of all thanks for you great work.
I got an issue with saving plot for compare the result togethers any idea to do that?

syntax error in uisrnn.py

After generating embeddings, when i'm trying to run trying to run train.py then facing under mentioned error:

File "train.py", line 19, in
import uisrnn
File "/home/sm/python-virtual-environments/Speaker-Diarization/uisrnn/init.py", line 24, in
from . import uisrnn
File "/home/sm/python-virtual-environments/Speaker-Diarization/uisrnn/uisrnn.py", line 162
self.logger.print(
^
SyntaxError: invalid syntax

Kindly help me in this regard.
Thanks in advance.

create new model to my new dataset

Hello
I'm tiring to make new model on VCTK dateset first I'm make generate speaker embedding by using code python generate_embeddings.py, Now Traing model by using file train.py but i have problem in file utils
here :
### for i, (train_sequence, train_cluster_id) in enumerate((train_sequences, train_cluster_ids)):
ValueError: too many values to unpack (expected 2)

can i solve it ?
and when change embedding_per_second and overlap_rate number of speaker changed why ?
thank very much

About the training data

Hi, I am a newbie, and i have two questions:

  1. is the path of training data "SRC_PATH" in the generate_embeddings.py/, where the directory indicates the speaker_id
  2. Currently, I use only the short dataset like Librispeech (less than 10s) for training. However, the paper uses two off-domain datasets for training: 2000 NIST Speaker Recognition Evaluation and ICSI Meeting Corpus, which are long speech datasets. I am wondering how to use them in the code.
    Thanks a lot!

Question about using dvector created by VGG to train UISRNN

After running the whole project, I revised the procedure and both paper from VGG and UISRNN, and noticed that in UISRNN Google has a basic assumption where embeddings generated by RNN. However, VGG generates fixed length embeddings with CNN.
Could you tell me, whether the usage of CNN embeddings break UISRNN assumption, if not, please tell me why.Thanks a lot!

What is SRC_PATH & data_path ?

Hi!

In generate_embeddings.py, there is an argument named data_path, but it seems that it is never used. There is another hardcoded variable called SRC_PATH. Should we change it to the path to our dataset (i.e. the directory containing .wav files, such as ST-CMDS-20170001_1-OS in the openslr)?

pre-shifted predicted sequence

Hello, I'm testing your framework on my testset.

I did below with the pretrained model on your repository.
utterances -> webrtcvad -> get (start,end) pair -> concatenate utterances randomly

I think there are some time alignment issue.
But when I use dvector based embedder, there are no such issue.

image

Handling silent speech segments

Hi,
Thanks for the awesome repository. I was wondering how your code handles silent speech segments in between conversations? Does it exclude that?

What does the uisrnn pytorch model output exactly and which variable holds that output?

Excuse my ignornce, but I am trying to wrap my head around the inner workings of the uisrnn model and I am stuck. More specifically, I would like to know what the model outputs when it receives the VGG speaker embedding features. I struggle with this because the way the model is structured is too complicated; it all seems to be one continuous process and I cannot tell where the pytorch model's job starts and where it finishes. I looked into the uisrnn script and tried to trace the order in which the functions are executed, albeit with no success. To my understanding, the model outputs a sequence of "states" which are then processed and scored by a beam search algorithm. Then the scores are fed back into the model and the process continues until some certain point is reached (?).

Figuring out what the model does with the VGG speaker embeddings it receives is challenging to say the least. Problem is, I do not know where to start. Which parts of the inference process depend solely on the pytorch model and which parts of the code handle the rest (beam states, scoring etc)? Which part of the uisrnn script is responsible for processing the VGG embeddings and which variable holds the results thereof?

So far I have figured out the following:

This code loads the uisrnn object in memory and loads the weights.

model_args, _, inference_args = uisrnn.parse_arguments()
model_args.observation_dim = 512
uisrnnModel = uisrnn.UISRNN(model_args)
uisrnnModel.load(SAVED_MODEL_NAME)

This snippet runs inference on features (embeddings)

predicted_label = uisrnnModel.predict([feats], inference_args)

Now comes the hard part:

In the uisrnn script we get the following function:

def predict_single(self, test_sequence, args):
    ...

From that point on I have no idea what is going on. What does the model output after it has received the features and at which point in the code do we get the result of that computation? Is it the mean and hidden variables in:


class CoreRNN(nn.Module):
  """The core Recurent Neural Network used by UIS-RNN."""

  def __init__(self, input_dim, hidden_size, depth, observation_dim, dropout=0):
    super(CoreRNN, self).__init__()
    ...
    return mean, hidden

or is it something else? Most importantly, is the model feeding itself the features only or the maybe beam states too? I am so confused.

Understanding how exactly this code works could help with a variety of tasks, such as improving the code or turning the pytorch model into another format, in a more modular fashion. Any help is greatly appreciated.

File size exceed Zip

Hello everyone
Screenshot from 2019-07-09 19-30-43

I set the range of epoch 4000 in generate_embeddings.py but i get the following error
for epoch in range(4000): # Random choice utterances from whole wavfiles
# A merged utterance contains [10,20] utterances

Please guide me how to set range?
Thanks in advance

Unable to view figure

@gogyzzz @taylorlu @giorgionanfa
I am unable to view the results because of the following error:

speakerDiarization.py:204: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
p.plot.show()
kindly help me in this regard
Thanks in advance

Using two different languages and two different sampling rate in single data set

Can i use two different languages to generate embeddings and then train the model with those embeddings? And what is the effect of using different sampling rate of audio file in order to generate embeddings?
Please guide me because i'm unable to generate embeddings because i got the following error.
Thanks in advance
Screenshot from 2019-07-05 16-56-52

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.