nttcslab-sp / eend-vector-clustering Goto Github PK

This repository contains a set of codes to run (i.e., train, perform inference with, evaluate) a diarization method called EEND-vector-clustering.

License: Other

Python 67.02% Shell 29.99% Makefile 2.99%

eend-vector-clustering's People

Contributors

Stargazers

Watchers

Forkers

ishine czy97 zenglinxiao fbdp1202 zaouk v-tuenv kli017 tuenv2kproptit anigi98932 mdelcroix runngezhang michael-x-0 sgleem laudrup21 dongkeon 88aggressive runngezhang-jx

eend-vector-clustering's Issues

Possible to train with audios contained different number of speaker?

Hello, I found there is a paramter named num_speakers in the train.yaml. Dose that mean the number of speaker in audio shoud equal to num_speakers? I

Requesting for the callhome result.

Recently, I want to compare our algorithm with your paper in callhome result. Can you kindly provide the rttm hypothesis of callhome for us in the original paper? Thanks a lot.

Potentiel issue excluding silent speaker

Hello there,

Thanks for your efforts in open-sourcing the code, it's vital for us trying to reproduce the result presented in the paper.

Problem

But I've come across a RuntimeError when adapting the model with our private data which shows:

/*/EEND-vector-clustering/eend/pytorch_backend/train.py:186: RuntimeWarning: invalid value encountered in true_divide
  fet_arr[spk] = org / norm
...
Traceback (most recent call last):
...
RuntimeError: The loss (nan) is not finite.

Detail

After some debugging, I found the problem actually happens during the backpropagation step when there exists an entry left with zeros in the embedding layer:

EEND-vector-clustering/eend/pytorch_backend/train.py

Lines 173 to 186 in b3649ee

    
           fet_arr = np.zeros([spk_num, fet_dim]) 
        
           # sum 
        
           bs = spklabs.shape[0] 
        
           for i in range(bs): 
        
               if spkidx_tbl[spklabs[i]] == -1: 
        
                   raise ValueError(spklabs[i]) 
        
               fet_arr[spkidx_tbl[spklabs[i]]] += spkvecs[i] 
        
           # normalize 
        
           for spk in range(spk_num): 
        
               org = fet_arr[spk] 
        
               norm = np.linalg.norm(org, ord=2) 
        
               fet_arr[spk] = org / norm

Since the embeddings are actually loaded from the dumped speaker embeddings generated by the save_spkv_lab.py script when adapting the model, I suspect there might exist some issue in the save_spkv_lab function.

After some careful step-by-step checking with pdb, I found there is actually some silent speaker label added in the all_labels variable when dumping the speaker embeddings:

EEND-vector-clustering/eend/pytorch_backend/infer.py

Lines 349 to 355 in b3649ee

    
           for i in range(args.num_speakers): 
        
               # Exclude samples corresponding to silent speaker 
        
               if torch.sum(t_chunked_t[sigma[i]]) > 0: 
        
                   vec = outputs[i+1][0].cpu().detach().numpy() 
        
                   lab = chunk_data[2][sigma[i]] 
        
                   all_outputs.append(vec) 
        
                   all_labels.append(lab)

Even when if torch.sum(t_chunked_t[sigma[i]]) > 0, lab can still be -1 which is considered as silent speaker acroding to code in:

EEND-vector-clustering/eend/pytorch_backend/diarization_dataset.py

Lines 94 to 99 in b3649ee

    
           S_arr = -1 * np.ones(n_speakers).astype(np.int64) 
        
           for seg in filtered_segments: 
        
               speaker_index = speakers.index(self.data.utt2spk[seg['utt']]) 
        
               all_speaker_index = self.all_speakers.index( 
        
                   self.data.utt2spk[seg['utt']]) 
        
               S_arr[speaker_index] = all_speaker_index

. (This is where makes me feels confused since it should not happen as both lab and T/t_chunked produced with info from kaldi_obj.utt2spk)

Since these silent speaker labels are -1 and the python list support negative indexing, this issue is silently ignored when dumping the embedding but will cause Exceptions when training begins.

Question

I could simply fix this issue by adding speaker label to all_labels only if lab < 0 when saving speaker embeddings and the followed training process could continue smoothly resulting in a good performing model.

But before opening any PR, I would like to know if you guys have ever come across such an issue or do you have any idea on why this will happen.

Thanks!

Performance of different net architecture

Hello, I was wondering have you evaluate with different net architecture? I modified the net according to the transformer's paper (layer numbers, heads numbers and hidden units size). And I found that the result does not get better (even worse on some unseen test wav) with the net become complicated.

Hello, thanks for providing your solution which is useful! I would like to ask if there is any tutorial on how I can use my personal data to train the model? Thank you.

Number of speakers for simulated training data

Hi,

Thanks for this amazing work and open-sourcing it!
I do have a question regarding the number of speakers for simulated training data. Do you use a fix number of 3 or max number of 3? I saw from run.sh where you used 'simu_opts_num_speaker=3' for all the elements in simu_opts_num_speaker_array. So, you should use a fix spk num of 3, right?

Btw, any chance you can shared the trained/adapted models?

Cheers,
Xiang

get invalid input shape while modified the layer and head num in train.yaml

Hello, while I modified the layer and head num of transformer in train.yaml I got a RuntimeError.
RuntimeError: shape '[128, -1, 12, 21]' is invalid for input of size 4915200

spk_loss_ratio: 0.03
spkv_dim: 256
max_epochs: 120
input_transform: logmel23_mn
lr: 0.001
optimizer: noam
num_speakers: 3
gradclip: 5
chunk_size: 150
batchsize: 128
num_workers: 8
hidden_size: 256
context_size: 7
subsampling: 10
frame_size: 200
frame_shift: 80
sampling_rate: 8000
noam_scale: 1.0
noam_warmup_steps: 25000
transformer_encoder_n_heads: 12
transformer_encoder_n_layers: 8
transformer_encoder_dropout: 0.1
seed: 777
feature_nj: 100
batchsize_per_gpu: 8
test_run: 0

I didnt go through the model structure code now. So these two parameter cannot random modified? or they are related to other para(context_size)?

which file should I pass to spkv_lab while resume training with initmodel?

Hi, for some reson my training process was interupted. I want to resume the train from the lastest ckpt and continue training on the old data. There is a para --spkv_lab: "file path of speaker vector with label and speaker ID conversion table for adaptation" . which file does it exactly mean? I tried the featlab_chunk_indices.txt but failed. I cannot find another file suitable for it... please help.
Thanks

GPU utility extremely low

When I run the callhome recipe using the default config file, the GPU utility is extremely low (less than 10%). Is this normal?

	fet_arr = np.zeros([spk_num, fet_dim])

	# sum
	bs = spklabs.shape[0]
	for i in range(bs):
	if spkidx_tbl[spklabs[i]] == -1:
	raise ValueError(spklabs[i])
	fet_arr[spkidx_tbl[spklabs[i]]] += spkvecs[i]

	# normalize
	for spk in range(spk_num):
	org = fet_arr[spk]
	norm = np.linalg.norm(org, ord=2)
	fet_arr[spk] = org / norm

	for i in range(args.num_speakers):
	# Exclude samples corresponding to silent speaker
	if torch.sum(t_chunked_t[sigma[i]]) > 0:
	vec = outputs[i+1][0].cpu().detach().numpy()
	lab = chunk_data[2][sigma[i]]
	all_outputs.append(vec)
	all_labels.append(lab)

	S_arr = -1 * np.ones(n_speakers).astype(np.int64)
	for seg in filtered_segments:
	speaker_index = speakers.index(self.data.utt2spk[seg['utt']])
	all_speaker_index = self.all_speakers.index(
	self.data.utt2spk[seg['utt']])
	S_arr[speaker_index] = all_speaker_index