audio-westlakeu / fs-eend Goto Github PK

The official Pytorch implementation of "Frame-wise streaming end-to-end speaker diarization with non-autoregressive self-attention-based attractors". [ICASSP 2024]

License: MIT License

Python 100.00%

online-inference speaker-diarization frame-wise self-attention end-to-end pytorch

fs-eend's Introduction

FS-EEND

The official Pytorch implementation of "Frame-wise streaming end-to-end speaker diarization with non-autoregressive self-attention-based attractors".

This work is accepted by ICASSP 2024.

Paper 🤩 | Issues 😅 | Lab 🙉 | Contact 😘

Introduction

This work proposes a frame-wise online/streaming end-to-end neural diarization (FS-EEND) method in a frame-in-frame-out fashion. To frame-wisely detect a flexible number of speakers and extract/update their corresponding attractors, we propose to leverage a causal speaker embedding encoder and an online non-autoregressive self-attention-based attractor decoder. A look-ahead mechanism is adopted to allow leveraging some future frames for effectively detecting new speakers in real time and adaptively updating speaker attractors.

Get started

Clone the FS-EEND codes by:

git clone https://github.com/Audio-WestlakeU/FS-EEND.git

Prepare kaldi-style data by referring to here. Modify conf/xxx.yaml according to your own paths.
Start training on simulated data by

python train_dia.py --configs conf/spk_onl_tfm_enc_dec_nonautoreg.yaml --gpus YOUR_DEVICE_ID

Modify your pretrained model path in conf/spk_onl_tfm_enc_dec_nonautoreg_callhome.yaml.
Finetune on CALLHOME data by

python train_dia_fintn_ch.py --configs conf/spk_onl_tfm_enc_dec_nonautoreg_callhome.yaml --gpus YOUR_DEVICE_ID

Inference by (# modify your own path to save predictions in test_step in train/oln_tfm_enc_decxxx.py.)

python train_diaxxx.py --configs conf/xxx_infer.yaml --gpus YOUR_DEVICE_ID --test_from_folder YOUR_CKPT_SAVE_DIR

Evaluation

# generate speech activity probability (diarization results)
cd visualize
python gen_h5_output.py

#calculate DERs
python metrics.py --configs conf/xxx_infer.yaml

Performance

Please note we use Switchboard Cellular (Part 1 and 2) and 2005-2008 NIST Speaker Recognition Evaluation (SRE) to generate simulated data (including 4054 speakers).

Dataset	DER(%)	ckpt
Simu1spk	0.6	simu_avg_41_50epo.ckpt
Simu2spk	4.3	same as above
Simu3spk	9.8	same as above
Simu4spk	14.7	same as above
CH2spk	10.0	ch_avg_91_100epo.ckpt
CH3spk	15.3	same as above
CH4spk	21.8	same as above

The ckpts are the average of model parameters for the last 10 epochs.

If you want to check the performance of ckpt on CALLHOME:

python train_dia_fintn_ch.py --configs conf/spk_onl_tfm_enc_dec_nonautoreg_callhome_infer.yaml --gpus YOUR_DEVICE_ID, --test_from_folder YOUR_CKPT_SAVE_DIR

Note the modification of the code in train_dia_fintn_ch.py

ckpts = [x for x in all_files if (".ckpt" in x) and ("epoch" in x) and int(x.split("=")[1].split("-")[0])>=configs["log"]["start_epoch"] and int(x.split("=")[1].split("-")[0])<=configs["log"]["end_epoch"]]

state_dict = torch.load(test_folder + "/" + c, map_location="cpu")["state_dict"]

ckpts = [x for x in all_files if (".ckpt" in x)]

state_dict = torch.load(test_folder + "/" + c, map_location="cpu")

Reference code

Citation

If you want to cite this paper:

@misc{liang2023framewise,
      title={Frame-wise streaming end-to-end speaker diarization with non-autoregressive self-attention-based attractors}, 
      author={Di Liang and Nian Shao and Xiaofei Li},
      year={2023},
      eprint={2309.13916},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

fs-eend's People

Contributors

Stargazers

Watchers

Forkers

ishine wendongj entn-at wpnndy

fs-eend's Issues

inference

老师您好，推理时运行python train_diaxxx.py文件是python train_dia.py这个代码吗

training time

老师您好，我看到您的论文描述训练的三个阶段:
step1:先用2-speaker dataset训练100 epochs，
step2:然后 1-4语者 50 epochs，
step3:之后domain finetune 100 epochs。
不知可否请教老师三个阶段分别大约花费多少训练时间呢？
以及老师是在甚么样的设备上进行训练的?
因为我目前观察EEND相关研究大多需要消耗大量训练资源，所以正在做这样的一份调查，希望老师可以协助提供相关资讯，谢谢。

评估

老师您好，评估的时候是需要先运行gen_h5_output.py再计算DER吗？gen_h5_output.py的输入是什么呢？我训练后的结果中没有看到代码运行所需要的文件

关于数据集格式

老师您好！我是跨领域到说话人日志方向的学生。我现在对于这个说话人日志技术的kaldi数据格式很混沌，不清楚具体的文件夹类型和数据列表详情，在使用非CALLHOME的相关数据集时也不知道要如何预处理（没有CALLHOME数据集可供参考），目前是想把CN-Celeb数据集转换成适用于diarization的CALLHOME数据集格式，请问老师您可以给一个数据列表截图或者树状图给我看看数据集格式吗？

idea

老师您好，请问你们有尝试过在帧级speaker embedding上面拼上使用预训练的说话人认证模型提取出的speaker embedding的相关实验吗？我这边在尝试这种做法，但实验效果一直没有达到预期

speaker_id

请问代码中的use_speaker_id应该怎么使用？是用来标志说话人的唯一性吗

train

老师您好，请问代码的多卡机制是在哪里实现的呢？对于长音频的训练应该如何处理才能不会显存过载呢，我尝试将一条长数据划分成多个chunk作为batch送入网路，但仍然还是显存过载，我已经使用了10张A10的情况下也是如此。

pre-trained model

Hello, author. I have submitted a request to obtain permission to download the pre-trained model for my research purposes. I kindly request your approval for this permission. Thank you.

use pre-trained model infer dataset

老师您好，我想要尝试在不经过finetune的情况下使用simu_avg_41_50epo.ckpt预训练模型评估ami的效能。
但是在这个过程中，我因为遇到了许多挫折儿感到迷茫，不知道自己是否在正确的道路上前行，希望可以得到老师的指点。
以下是我目前所尝试的过程：
首先，我准备了kaldi格式的ami测试集，并创建一份spk_onl_tfm_enc_dec_nonautoreg_infer.yaml的拷贝，命名为ami_infer.yaml，修改其中train_data_dir与val_data_dir为我测试集位置。
之后运行 python train_dia.py --configs conf/ami_infer.yaml --gpus 0 --test_from_folder FS-EEND_simu_41_50epo_avg_model 我遇到错误抓不到ckpt档，因此我修改 ckpts = [x for x in all_files if (".ckpt" in x) and ("epoch" in x) and int(x.split("=")[1].split("-")[0])>=configs["log"]["start_epoch"] and int(x.split("=")[1].split("-")[0])<=configs["log"]["end_epoch"]] 为 ckpts = [x for x in all_files if (".ckpt" in x)]
但接着我遇到错误

Traceback (most recent call last):
  File "/mnt/HDD/HDD2/DTDwind/FS-EEND/train_dia.py", line 217, in <module>
    train(configs, gpus=setup.gpus, checkpoint_resume=setup.checkpoint_resume, test_folder=setup.test_from_folder)
  File "/mnt/HDD/HDD2/DTDwind/FS-EEND/train_dia.py", line 185, in train
    for name, param in state_dict.items():
AttributeError: 'float' object has no attribute 'items'

我发现程式无法正确读取ckpt当中的值，我把 state_dict = torch.load(test_folder + "/" + c, map_location="cpu")["state_dict"] 改成 state_dict = torch.load(test_folder + "/" + c, map_location="cpu") 后就能顺利读值了。

接着我遇到 TypeError: mel() takes 0 positional arguments but 3 were given1 我修改了 mel_basis = librosa.filters.mel(sr=sr, n_fft=n_fft, n_mels=n_mels) 解决。
但接着我遇到错误

raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'TransformerEncoderFusionLayer' object has no attribute 'self_attn'. Did you mean: 'self_attn1'?

这似乎显示ckpt存的模型跟预期的不同，但是我看其他人的issue都可以顺利的运行程式，
，我不明白为何我会遇到如此众多的问题，是不是我有哪个步骤有缺失，恳请老师指点我正确的执行方式。

finetune 語料時數

您好，想請問一下，如果想要接續著fine-tune這個model，您建議語料時數為多少小時?
以及每個語者最少要講話多少分鐘?

期待您的回覆，感謝

requirement

老师您好，可以提供一下requirement.txt吗？复现的时候由于很多依赖不明确版本号而导致了很多问题。