hitachi-speech / eend Goto Github PK

View Code? Open in Web Editor NEW

363.0 17.0 57.0 51.77 MB

End-to-End Neural Diarization

License: MIT License

Python 58.58% Shell 32.38% Perl 8.34% Makefile 0.69%

speaker-diarization end-to-end eend machine-learning chainer kaldi deep-learning

eend's Introduction

EEND (End-to-End Neural Diarization)

EEND (End-to-End Neural Diarization) is a neural-network-based speaker diarization method.

BLSTM EEND (INTERSPEECH 2019)
- https://www.isca-speech.org/archive/Interspeech_2019/abstracts/2899.html
Self-attentive EEND (ASRU 2019)
- https://ieeexplore.ieee.org/abstract/document/9003959/

The EEND extension for various number of speakers is also provided in this repository.

Self-attentive EEND with encoder-decoder based attractors
- https://arxiv.org/abs/2005.09921

Install tools

Requirements

NVIDIA CUDA GPU
CUDA Toolkit (8.0 <= version <= 10.1)

Install kaldi and python environment

cd tools
make

This command builds kaldi at tools/kaldi
- if you want to use pre-build kaldi
```
cd tools
make KALDI=<existing_kaldi_root>
```
  This option make a symlink at tools/kaldi
This command extracts miniconda3 at tools/miniconda3, and creates conda envirionment named 'eend'
Then, installs Chainer and cupy into 'eend' environment
- use CUDA in /usr/local/cuda/
  - if you need to specify your CUDA path
```
cd tools
make CUDA_PATH=/your/path/to/cuda-8.0
```
    This command installs cupy-cudaXX according to your CUDA version. See https://docs-cupy.chainer.org/en/stable/install.html#install-cupy

Test recipe (mini_librispeech)

Configuration

Modify egs/mini_librispeech/v1/cmd.sh according to your job schedular. If you use your local machine, use "run.pl". If you use Grid Engine, use "queue.pl" If you use SLURM, use "slurm.pl". For more information about cmd.sh see http://kaldi-asr.org/doc/queue.html.

Data preparation

cd egs/mini_librispeech/v1
./run_prepare_shared.sh

Run training, inference, and scoring

./run.sh

If you use encoder-decoder based attractors [3], modify run.sh to use config/eda/{train,infer}.yaml
See RESULT.md and compare with your result.

CALLHOME two-speaker experiment

Configuraition

Modify egs/callhome/v1/cmd.sh according to your job schedular. If you use your local machine, use "run.pl". If you use Grid Engine, use "queue.pl" If you use SLURM, use "slurm.pl". For more information about cmd.sh see http://kaldi-asr.org/doc/queue.html.
Modify egs/callhome/v1/run_prepare_shared.sh according to storage paths of your corpora.

Data preparation

cd egs/callhome/v1
./run_prepare_shared.sh
# If you want to conduct 1-4 speaker experiments, run below.
# You also have to set paths to your corpora properly.
./run_prepare_shared_eda.sh

Self-attention-based model using 2-speaker mixtures

./run.sh

BLSTM-based model using 2-speaker mixtures

local/run_blstm.sh

Self-attention-based model with EDA using 1-4-speaker mixtures

./run_eda.sh

References

[1] Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu, Shinji Watanabe, " End-to-End Neural Speaker Diarization with Permutation-free Objectives," Proc. Interspeech, pp. 4300-4304, 2019

[2] Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Yawen Xue, Kenji Nagamatsu, Shinji Watanabe, " End-to-End Neural Speaker Diarization with Self-attention," Proc. ASRU, pp. 296-303, 2019

[3] Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, Kenji Nagamatsu, " End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors," Proc. INTERSPEECH, 2020

Citation

@inproceedings{Fujita2019Interspeech,
 author={Yusuke Fujita and Naoyuki Kanda and Shota Horiguchi and Kenji Nagamatsu and Shinji Watanabe},
 title={{End-to-End Neural Speaker Diarization with Permutation-free Objectives}},
 booktitle={Interspeech},
 pages={4300--4304}
 year=2019
}

eend's People

Contributors

Stargazers

Watchers

Forkers

ishine entn-at satishpas2 maituan ky941122 xiongmaoxia twistedmove whoconli songyf shaoboh wuqiangch zzf-zhu-miracle widdiot shammur supikiti aslam021 bbrookie q-cheng yangyutu 53x tianlongkong runngezhang maldivesxue achronferry mdrpanwar chienlinhuang1116 zenglinxiao shibeiing settkl brunoignifai xmpx pierretsr zaouk anigi98932 lekynam2000 vladakapranova andymyzhang32 samuelbroughton sizqui happyjin alexdoberman jamiroquai88 jfzhouuu dannv0602 jaedukseo wsm8855 jhvmhg hspark84 egruttadauria98 xuridongsheng7142 saman-rahbar chaofeibu 88aggressive yhy929 mwy0615 shanguanma baekms

eend's Issues

Adaptation error on AMI data: Invalid shape for monophonic audio: ndim=2, shape=(400000, 2)

Hi,

While using the AMI data (*.Mix-Headset.wav) for adaptation, I get the following error:
Traceback (most recent call last):
File "/workspace/EEND/tools/miniconda3/envs/eend/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/workspace/EEND/tools/miniconda3/envs/eend/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/workspace/EEND/tools/miniconda3/envs/eend/lib/python3.7/site-packages/chainer/iterators/multiprocess_iterator.py", line 435, in fetch_batch batch_ret[0] = [self.dataset[idx] for idx in indices] File "/workspace/EEND/tools/miniconda3/envs/eend/lib/python3.7/site-packages/chainer/iterators/multiprocess_iterator.py", line 435, in <listcomp> batch_ret[0] = [self.dataset[idx] for idx in indices] File "/workspace/EEND/tools/miniconda3/envs/eend/lib/python3.7/site-packages/chainer/dataset/dataset_mixin.py", line 67, in __getitem__ return self.get_example(index) File "/workspace/EEND/eend/chainer_backend/diarization_dataset.py", line 86, in get_example self.n_speakers) File "/workspace/EEND/eend/feature.py", line 249, in get_labeledSTFT Y = stft(data, frame_size, frame_shift) File "/workspace/EEND/eend/feature.py", line 156, in stft hop_length=frame_shift).T[:-1] File "/workspace/EEND/tools/miniconda3/envs/eend/lib/python3.7/site-packages/librosa/core/spectrum.py", line 217, in stft util.valid_audio(y) File "/workspace/EEND/tools/miniconda3/envs/eend/lib/python3.7/site-packages/librosa/util/utils.py", line 295, in valid_audio "ndim={:d}, shape={}".format(y.ndim, y.shape) lbrosa.util.exceptions.ParameterError: Invalid shape for monophonic audio: ndim=2, shape=(400000, 2)

What I understand is that this error is a result of supplying stereo files in place of mono files. However, soxi, audacity, and python wave packages display the channel info as mono. I verified the shape using the librosa package independently for a few files and ndim= is never 2. Ex:
`

filename="IS1003a.Mix-Headset.wav"
audioData, sampleRate = librosa.load(filename)
print(audioData.shape)
(20142528,)`

Setting mono=False in valid_audio of utils.py does not help.

Is the mixing of multiple headset files to a single file in AMI creating the issue? What could be the other possible reasons? Is there a way out? Kindly excuse me if this is not a EEND specific issue. Any input regarding this would be helpful.

Thank You,

Run log is empty

train.py -c conf/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train

Started at 2024年 04月 22日星期一 17:20:17 CST

python version: 3.7.16 (default, Jan 17 2023, 22:20:44) [GCC 11.2.0]
chainer version: 7.8.0
cupy version: 7.7.0
cuda version: 10010
cudnn version: 7605
namespace(attractor_decoder_dropout=0.1, attractor_encoder_dropout=0.1, attractor_loss_ratio=1.0, backend='chainer', batchsize=8, config=[<yamlargparse.Path object at 0x7fef1ab59c10>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, shuffle=False, subsampling=10, train_data_dir='data/simu/data/train_clean_5_ns2_beta2_500', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, use_attractor=False, valid_data_dir='data/simu/data/dev_clean_2_ns2_beta2_500')
2730 chunks
1863 chunks
GPU device 0 is used
Prepared model

What are the two numbers at the end of each line in "segment" file?

Thank you for sharing your code!

I would like to know exactly what those numbers at the end of each line in the segment file. Those represent the OVERLAP times created? Or the speech duration of the ONE SPEAKER based on the speaker ID at the beginning of the line?

Mohammad

run_prepare_shared.sh error

I encounter problem during running run_prepare_shared.sh in egs/callhome.

I wonder this is because I didn't install kaldi correct.
Can anyone help me?

Makefile cupy-cuda chainer

In the Makefile, I am not sure what line 36 is trying to do.

I get the following error:

miniconda3/envs/eend/bin/pip install cupy-cuda10==6.2.0 chainer==6.2.0 ERROR: Could not find a version that satisfies the requirement cupy-cuda10==6.2.0 ERROR: No matching distribution found for cupy-cuda10==6.2.0

Any timeline to support the newly-published papers?

I notice that there are some newly-published papers that greatly extend the original EEND. I wonder is there a timeline to support those new functions?

Restrict the number of cores being used during inference

Is there a way to restrict the number of cores being used during inference? infer.py does not seem to have a parameter that can tweaked to achieve this. In a 32 core machine, the inference process occupies all the resources causing other important runs to crash. Since GPU based inference is not an option, any insight on this would be helpful. Thank You.

Why are we using "load_segments_rechash" function when loading the segments file? Why are we not using "load_segments", Its causing key error in the pipeline.

This is regarding mini_librispeech speech recipe
Line number 254 in eend/feature.py filtered_segments = kaldi_obj.segments[kaldi_obj.segments['rec'] == rec] returns a key error of 'rec' if we use load_segments_recash in line number 149 in eend/kaldi_data.py which can be fixed by using filtered_segments = kaldi_obj.segments[rec] But I'm not really sure if we should change this.
Any help is welcome.
Let me know what to use and when to use. Thanks

about data prepare

Thank you for your open source code
When I run run_prepare_shared.sh, I ran into some problems. If I set the nj parameter on line 90 to 100 , the following problems will occur：
~/projects/EEND-master/egs/mini_librispeech/v1/utils/validate_data_dir.sh: no such directory data/simu/data/train_clean_5_ns2_beta2_500 run.pl: job failed, log is in data/simu/.work/random_mixture_train_clean_5_ns2_beta2_500.log utils/split_scp.pl: You are splitting into too many pieces! [reduce $nj (100) to be smaller than the number of lines (5) in data/simu/.work/mixture_train_clean_5_ns2_beta2_500.scp]
But if I set nj to 3, I will only get very little mixed audio.

Another problem is that no matter if I set nj to 3 or 100，There will always be an error record in this file（data/simu/.work/random_mixture_train_clean_5_ns2_beta2_500.log），which is:
Traceback (most recent call last): File "../../../eend/bin/random_mixture.py", line 123, in <module> rir = rirs[random.choice(all_rirs)] File "/home/tp/anaconda3/envs/EEND/lib/python3.7/random.py", line 261, in choice raise IndexError('Cannot choose from an empty sequence') from None IndexError: Cannot choose from an empty sequence

Can you help me answer how I should set the nj value, and how can I avoid problems in the log?Looking forward to your answer. Thanks!

Question about shuffle

In the implementation, acoustic features rather than embeddings are shuffled during training. Is it ok? The positional encoding for the Transformer-based encoder seem to be meaningless feature.

Issues about data preparation

EEND-master/egs/mini_librispeech/v1/utils/validate_data_dir.sh: no such directory data/simu/data/train_clean_5_ns2_beta2_10
run.pl: job failed, log is in data/simu/.work/random_mixture_train_clean_5_ns2_beta2_10.log
utils/split_scp.pl: error: empty input scp file data/simu/.work/mixture_train_clean_5_ns2_beta2_10.scp
utils/split_scp.pl: You are splitting into too many pieces! [reduce $nj (10) to be smaller than the number of lines (0) in data/simu/.work/mixture_train_clean_5_ns2_beta2_10.scp]

I don't know how to solve this problem. Thanks a lot for ur help!

What is the extension for files like segments, utt2spk?

What is the extension for files like segments, utt2spk? Are they text files?

Tcl_AsyncDelete error

Hello, some time ago I caught the next error while training:

   0.55256 iters/sec. Estimated time to finish: 19 days, 0:00:23.591611.
     total [####..............................................]  9.99%
this epoch [#################################################.] 99.30%
    100700 iter, 9 epoch / 100 epochs
   0.59537 iters/sec. Estimated time to finish: 17 days, 15:10:34.544256.
Tcl_AsyncDelete: async handler deleted by the wrong thread
# Accounting: time=221865 threads=1

Do you have any ideas of why this could happen? Stackoverflow says that matplotlib should use 'Agg' backend, and it does.

?How does scoring works in run_eda.sh?

I have trained a 2-spk model on custom dataset with an overall DER of 0.063. I would like to run inference and scoring on another custom dataset containing 2 spks.

The part that does scoring in run_eda.sh is

if [ $stage -le 8 ]; then
    echo "scoring at $scoring_dir"
    if [ -d $scoring_dir ]; then
        echo "$scoring_dir already exists. "
        echo " if you want to retry, please remove it."
        exit 1
    fi
    for dset in callhome2_spkall; do
        work=$scoring_dir/$dset/.work
        mkdir -p $work
        find $infer_dir/$dset -iname "*.h5" > $work/file_list_$dset
        for med in 1 11; do
        for th in 0.3 0.4 0.5 0.6 0.7; do
        make_rttm.py --median=$med --threshold=$th \
            --frame_shift=$infer_frame_shift --subsampling=$infer_subsampling --sampling_rate=$inf$
            $work/file_list_$dset $scoring_dir/$dset/hyp_${th}_$med.rttm
        md-eval.pl -c 0.25 \
            -r data/eval/$dset/rttm \
            -s $scoring_dir/$dset/hyp_${th}_$med.rttm > $scoring_dir/$dset/result_th${th}_med${med}_collar0.25 2>/dev/null || exit
        done
        done
    done
fi

Now I understand that make_rttm.py creates hypotheses which I was able to do however, I am unable to figure out how the rescoring works. In run_eda.sh rescoring is done using

md-eval.pl -c 0.25 -r data/eval/$dset/rttm -s $scoring_dir/$dset/hyp_${th}_$med.rttm > $scoring_dir/$dset/result_th${th}_med${med}_collar0.25 2>/dev/null || exit

I cannot find the file md-eval.pl which seems to be doing the rescoring here. Can anybody point me towards this particular file?

How to evaluate long records with SA-EED.

The EEND self attention paper states we split the input audio recordings into non-overlapping 50-second segments. At the inference stage, we used the entire sequence for each recording

As far as I understand, receptive field of TransformerEncoder is limited by n_units. I can't just split the record by 50 seconds segment and evaluate each segment separately, because the labeling of speaker in each segment can be different.

What is the right way to handle long record with SA-EEND model?

Licensing Information

Hey
Thank you for releasing this github repository. It will be very helpful if you can add a Licence for your code.

运行日志为空，gpu使用率为0

There is no error reported in the running log, but the running log is always empty, and it keeps showing pre-training, and the gpu usage is 0.

best_score.sh file is missing

model is training properly.
But its ending with

inference at exp/diarize/infer/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train.avg8-10.infer
scoring at exp/diarize/scoring/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train.avg8-10.infer
./run.sh: line 147: best_score.sh: command not found
Could you plz tell where to find best_score.sh

Question about CUDA toolkit version

Hi, thanks a lot for your work.
I have the question about CUDA toolkit version.

My experiment setting is based on NVIDIA RTX 3090 GPU, so I have to use CUDA toolkit higher than 11.1 version.
But the dependencies for your repo seem to require the CUDA Toolkit version from 8.0 to 10.1.

Can I run your recipe on the CUDA toolkit of higher than 11.1 version?

The SA-EEND-EDA experiment DER result on mini_librispeech

Excellent code!
I have one question that I tried the experiment on egs/mini_librispeech, but I got the DER about 30.34.
I choose mini_librispeech due to lack of LDC data,.
I directly trained SA-EEND-EDA on mini_librispeech set (set "speaker_num=2" in train.yaml) and then run the infering (also set "speaker_num=2" in infer.yaml) and scoring command.
Other conf remain unchanged, so the epoch is 100.
I wondering that maybe I did something wrong.
If you have already tried this experimet on mini_librispeech, can you tell me about your final DER on mini_librispeech?
I want to know whether I done this experiment correct or not.
Thanks a lot!!

How to resume training?

To resume training, do we only provide snapshot path for the --resume argument,
or do we also have to provide the --init_model argument?
If yes, what file do we input for init_model?

Confusion about the store of uttid

hi, I'm a little confused about the calculation and store of uttid. In make_mixture.py, when calculating the uttid, it multiples 100, like
uttid = '{}{}{:07d}_{:07d}'.format(spkid, recid, int(startpos / args.rate * 100),int(endpos / args.rate * 100)).But why not do sth like str(int(startpos / args.rate *1000)/1000) to store it more precisely?

High memory usage

Hi there,
Thanks for the nice papers and implementation!
As part of a few set of experiments I'm trying to run this on a custom dataset with around 1500 recordings, accounting for a total of 100 hours of speech.
Though, it seems to use way too much RAM. It filled 64G of RAM, and almost 64G of swap, and I had to stop to prevent the system to freeze.
Is this a known behavior?
Are there some workarounds to limit memory usage?
Thanks!

Note: the mini_librispeech recipe seems to run properly though, using around 2G of memory. There might be some issues with my setup/data.

The question about running lstm.sh

When i run the lstm.sh which in eend/egs/mini_librisppech/v1/local/run_blstm, i get the usingwarning: shared memory size is too small.
Please set shared_mem option for MultiprocessIterator.
Expect shared memory size: 4298700 bytes.
Actual shared meory size: 3118900 bytes.
But i use v100, 16G memory!
The error information is follow.

Exception in main training loop: 'cupy.cuda.memory.MemoryPointer' object is not iterable
Traceback (most recent call last):
File "/home/chenyafeng.cyf/EEND-master/tools/miniconda3/envs/eend/lib/python3.7/site-packages/chainer/training/trainer.py", line 316, in run
update()
File "/home/chenyafeng.cyf/EEND-master/tools/miniconda3/envs/eend/lib/python3.7/site-packages/chainer/training/updaters/standard_updater.py", line 175, in update
self.update_core()
File "/home/chenyafeng.cyf/EEND-master/tools/miniconda3/envs/eend/lib/python3.7/site-packages/chainer/training/updaters/standard_updater.py", line 189, in update_core
optimizer.update(loss_func, **in_arrays)
File "/home/chenyafeng.cyf/EEND-master/tools/miniconda3/envs/eend/lib/python3.7/site-packages/chainer/optimizer.py", line 825, in update
loss = lossfun(*args, **kwds)
File "/home/chenyafeng.cyf/EEND-master/eend/chainer_backend/models.py", line 588, in call
F.stack([dc_loss(em, t) for (em, t) in zip(ems, ts)]))
File "/home/chenyafeng.cyf/EEND-master/eend/chainer_backend/models.py", line 588, in
F.stack([dc_loss(em, t) for (em, t) in zip(ems, ts)]))
File "/home/chenyafeng.cyf/EEND-master/eend/chainer_backend/models.py", line 333, in dc_loss
[int(''.join(str(x) for x in t), base=2) for t in label.data]] = 1
File "/home/chenyafeng.cyf/EEND-master/eend/chainer_backend/models.py", line 333, in dc_loss
[int(''.join(str(x) for x in t), base=2) for t in label.data]] = 1
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
File "../../../eend/bin/train.py", line 82, in
train(args)
File "/home/chenyafeng.cyf/EEND-master/eend/chainer_backend/train.py", line 223, in train
trainer.run()
File "/home/chenyafeng.cyf/EEND-master/tools/miniconda3/envs/eend/lib/python3.7/site-packages/chainer/training/trainer.py", line 349, in run
six.reraise(*exc_info)
File "/home/chenyafeng.cyf/EEND-master/tools/miniconda3/envs/eend/lib/python3.7/site-packages/six.py", line 719, in reraise
raise value
File "/home/chenyafeng.cyf/EEND-master/tools/miniconda3/envs/eend/lib/python3.7/site-packages/chainer/training/trainer.py", line 316, in run
update()
File "/home/chenyafeng.cyf/EEND-master/tools/miniconda3/envs/eend/lib/python3.7/site-packages/chainer/training/updaters/standard_updater.py", line 175, in update
self.update_core()
File "/home/chenyafeng.cyf/EEND-master/tools/miniconda3/envs/eend/lib/python3.7/site-packages/chainer/training/updaters/standard_updater.py", line 189, in update_core
optimizer.update(loss_func, **in_arrays)
File "/home/chenyafeng.cyf/EEND-master/tools/miniconda3/envs/eend/lib/python3.7/site-packages/chainer/optimizer.py", line 825, in update
loss = lossfun(*args, **kwds)
File "/home/chenyafeng.cyf/EEND-master/eend/chainer_backend/models.py", line 588, in call
F.stack([dc_loss(em, t) for (em, t) in zip(ems, ts)]))
File "/home/chenyafeng.cyf/EEND-master/eend/chainer_backend/models.py", line 588, in
F.stack([dc_loss(em, t) for (em, t) in zip(ems, ts)]))
File "/home/chenyafeng.cyf/EEND-master/eend/chainer_backend/models.py", line 333, in dc_loss
[int(''.join(str(x) for x in t), base=2) for t in label.data]] = 1
TypeError: 'cupy.cuda.memory.MemoryPointer' object is not iterable

What should i do? I have no idea to solve it.

When I reproduced the code, I got the following error when running ./run.sh, what should I do?

When I reproduced the code, I got the following error when running ./run.sh, what should I do?
bash: line 1: 147388 Segmentation fault (core dumped) ( train.py -c conf/eda/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.eda_train ) 2>> exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.eda_train/.work/train.log >> exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.eda_train/.work/train.log
run.pl: job failed, log is in exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.eda_train/.work/train.log

The following is the specific content of train.log：
1 # train.py -c conf/eda/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2 _beta2_500.eda_train
2 # Started at Mon Dec 20 15:23:59 CST 2021
3 #
4 # Accounting: time=2 threads=1
5 # Ended (code 139) at Mon Dec 20 15:24:01 CST 2021, elapsed time 2 seconds

about the input_dim in infer stage

Thank for the open source eend code.

There are something wrong about my model in infer stage. The data I used for training is 16KHz, so I use the feature "logmel". But when in the infer stage, the input_dim by calculated is 3855（should be 40*15）.

In "get_input_dim" function， if feature type is “logmel”，run the "else" branch. Because I can't understand the Calculation process about "else" branch ，so I ask for help（Now I just enforce the input_dim）.

infer.yarml
sampling_rate: 16000
frame_size: 400
frame_shift: 160
input_transform: logmel

def get_input_dim(
frame_size,
context_size,
transform_type,
):
if transform_type.startswith('logmel23'):
frame_size = 23
else:
fft_size = 1 << (frame_size - 1).bit_length()
frame_size = int(fft_size / 2) + 1
input_dim = (2 * context_size + 1) * frame_size
return input_dim

Unable to identify speaker cluster from generated RTTM

I have trained a model using callhome recipe. The generated RTTM looks like follows.

SPEAKER 2290120-audio 1 0.00 0.55 < NA > < NA > 2290120-audio_0 < NA >
SPEAKER 2290120-audio 1 0.60 1.05 < NA > < NA > 2290120-audio_0 < NA >
SPEAKER 2290120-audio 1 1.75 0.20 < NA > < NA > 2290120-audio_0 < NA >

Is there any way to identify who spoke when? For example, the RTTM generated by Kaldi diarization recipe looks like follows

SPEAKER 2290120-audio 1 0.00 0.55 < NA > < NA > A < NA >
SPEAKER 2290120-audio 1 0.60 1.05 < NA > < NA > B < NA >
SPEAKER 2290120-audio 1 1.75 0.20 < NA > < NA > A < NA >

can't find file . parse_options.sh

How can I get this file at line 32 .parse_option.sh || exit in egs/mini_librispeech/v1/run_prepare_shared.sh?

about the "get_free_gpus()" function

EEND/eend/chainer_backend/utils.py

Line 40 in b851eec

del gpus[busid]

when there are 2 process running on the same GPU IDs，the "del gpus[busid]" will raise key error.

PositionalEncoding does not used?

Hello, in transformer.py I found that the pos_enc was initialized in the encoder but it was not used ind the forward?

Smoothing the activations at the output of the transformer

Hey there,
I was wondering if you encountered any issues related to smoothing the speaker activations predicted using the Transformer model. An encoder only transformer tends to output speaker activations which are not as smooth as the ones provided by other recurrent models (such as Bi-LSTMs and such).
Did you resort to some tricks for smoothing the output activations provided by the Transformer or this was not an issue at all?

Requesting for the 2-spk callhome result.

Hello researches. Recently, we want to compare the result with EEND-EDA in callhome 2-spk condition. For a fair comparsion, can you kindly provide the original rttm hypothesis for us.

The result is only used for academic illustration.

Thanks.

About the train loss

Hi, sorry to disturb you. When i read your code in EEND/eend/chainer_backend/models.py. When you train the model, you did not use the sigmoid, and only use it in infer. So i try to add it to the model as you article said, but i found the loss can't go down, i don't know the reason of it. Thanks for your reply!

The problem about to run.sh

I have finished task about run_prep.sh ,but when i start to run the run.sh,it will appear some problems just like this:
training model at exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train.
run.pl: job failed, log is in exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train/.work/train.log
And the log document is empty ，you guys know how to solove it？

Experiment results on multiple speakers dataset

Great paper! I enjoy reading it and like the idea of having a simple model to solving speaker diarization problem.

I do noticed that your model can classify multiple speakers and, wonder if you have benchmark your model performance against state-of-the-art techniques on dataset with more than 2 speakers.
Appreciate if can you share the experiment results on dataset with larger set of speakers. :-)

Confusion about the calculation of DER

In general, diarization tasks tolerate 250ms diar-error at the start and the end of each segment, but the function "calc_diarization_error()" in "model.py" seems do not take it into account. Did I miss something that concerns the "tolerance principle"?

Trained custom data on mini_librispeech recipe but inference just gives 1 speaker for whole audio file.

SPEAKER aaak 1   11.40    0.10 <NA> <NA> aaak_4 <NA>
SPEAKER aaak 1   14.00    0.10 <NA> <NA> aaak_4 <NA>

This is the hyp_0.3_1.rttm I got after scoring. For the entire aaak.wav file only aaak_4 speaker is detected.

"main/DER": 0.4484034770634306,
"validation/main/DER": 0.5290581162324649,

This is the DER after 200 epochs. Can someone help me understand why the inference is detecting just one speaker.

aaaa wav_8/aaaa.wav
aaab wav_8/aaab.wav

This is wav.scp (first 2 lines)

aaab-000521-000625 Khanna
aaab-000829-000923 Khanna

This is the utt2spk file

aaab-000521-000625 aaab 5.21 6.25
aaab-000829-000923 aaab 8.29 9.23

This is the segments file

Question about pit loss calculation in EDA-EEND model

Thanks for your valuable work on EEND. In eend/chainer_backend/models.py (line 491-497), why the loss is calculated twice, substituted with standard loss finally?

loss, labels = batch_pit_n_speaker_loss(ys_padded, ts_padded, n_speakers)

loss = standard_loss(ys, labels)

I am looking forward to your reply. Thanks!

Can this code support multi-gpu training?

Question about infer.py for EDA

Hello! I read the infer.py file and as I understand, it firstly divide complete audio into chunks and fed these chunks into the model. At the end, it stack all the outputs to make rttm file.
out_chunks.append(ys[0].data)
......
out_chunks = [np.insert(o, o.shape[1], np.zeros((max_n_speakers - o.shape[1], o.shape[0])), axis=1) for o in out_chunks]
outdata = np.vstack(out_chunks)
I'm a little confused about how you can make sure the speaker orders of each chunks are consistent for the EDA model？ Because the attractors in EDA are dynamically generated based on the chunk. One speaker may disappear in another chunk of the same audio?

Adaptation Error

Hi,
I am getting the error (IndexError: index 3 is out of bounds for axis 1 with size 2) during adaptation in the run_eda.sh script. What could be the possible reasons for this?

I want to know how to prepare my own data?

I know for example spk2utt utt2spk wav.scp and so on ,I want to know i need to create rttm and segments file?is segments file created by SAD?

cant start training

I was testing the setup on mini librispeech data .This is log when I started training

# train.py -c conf/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train 
# Started at Thu Dec  5 19:24:21 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11)  [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=64, config=[<yamlargparse.Path object at 0x7ffb7b99c610>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data/simu/data/train_clean_5_ns2_beta2_500', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data/simu/data/dev_clean_2_ns2_beta2_500')
2730  chunks
1863  chunks
Traceback (most recent call last):
  File "../../../eend/bin/train.py", line 72, in <module>
    train(args)
  File "/home/gamut/Downloads/EEND/eend/chainer_backend/train.py", line 100, in train
    gpuid = use_single_gpu()
  File "/home/gamut/Downloads/EEND/eend/chainer_backend/utils.py", line 56, in use_single_gpu
    cvd = get_free_gpus()[0]
  File "/home/gamut/Downloads/EEND/eend/chainer_backend/utils.py", line 40, in get_free_gpus
    del gpus[busid]
KeyError: ' 00000000:01:00.0'
# Accounting: time=1 threads=1
# Ended (code 1) at Thu Dec  5 19:24:22 IST 2019, elapsed time 1 seconds

can you guys suggest whats going wrong?

How to Run Callhome

(Speech) Sangram:v1 sing$ ./run_prepare_shared.sh
./run_prepare_shared.sh: line 34: Modify: command not found
./run_prepare_shared.sh: line 35: This: command not found
prepare kaldi-style datasets
/Users/sing/Postdoc/IMP/EEND/egs/callhome/v1/utils/validate_data_dir.sh: empty file spk2utt
mkdir: /callhome/.tmp/: Read-only file system
--2021-05-19 22:16:44-- http://www.openslr.org/resources/10/sre2000-key.tar.gz
Resolving www.openslr.org... 46.101.158.64
Connecting to www.openslr.org|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 340276 (332K) [application/x-gzip]
/callhome/.tmp: No such file or directory
/callhome/.tmp/sre2000-key.tar.gz: No such file or directory

Cannot write to '/callhome/.tmp/sre2000-key.tar.gz' (No such file or directory).
tar: Error opening archive: Failed to open '/callhome/.tmp//sre2000-key.tar.gz'
local/make_callhome.sh: line 29: /callhome/.tmp//reco.list: No such file or directory
local/make_callhome.sh: line 43: /callhome/.tmp//reco.list: No such file or directory
cp: directory /callhome does not exist
local/make_callhome.sh: line 50: /callhome/utt2spk: No such file or directory
local/make_callhome.sh: line 51: /callhome/spk2utt: No such file or directory
cp: /callhome/.tmp//sre2000-key/reco2num: No such file or directory
cp: directory /callhome does not exist
utils/validate_data_dir.sh: no such directory /callhome
mkdir: /callhome/.backup: Read-only file system
utils/fix_data_dir.sh: no such directory /callhome
copy_data_dir.sh: no such file /callhome/utt2spk
copy_data_dir.sh: no such file /callhome/utt2spk
local/make_callhome.sh: line 62: /callhome1/wav.scp: No such file or directory
Can't open /callhome/wav.scp: No such file or directory at utils/shuffle_list.pl line 37.
mkdir: /callhome1/.backup: Read-only file system
utils/fix_data_dir.sh: no such directory /callhome1
local/make_callhome.sh: line 65: /callhome2/wav.scp: No such file or directory
mkdir: /callhome2/.backup: Read-only file system
utils/fix_data_dir.sh: no such directory /callhome2
local/make_callhome.sh: line 68: /callhome1/reco2num_spk: No such file or directory
local/make_callhome.sh: line 70: /callhome2/reco2num_spk: No such file or directory
copy_data_dir.sh: no such file data/callhome1/utt2spk
awk: can't open file data/callhome1/reco2num_spk
source line number 1
Can't open data/callhome1/wav.scp: No such file or directory at utils/filter_scp.pl line 65.
awk: can't open file data/callhome/fullref.rttm
source line number 1
Empty list of recordings (bad file data/callhome1_spk2/segments)?
Empty list of recordings (bad file data/callhome1_spk2/segments)?
utils/data/get_reco2dur.sh: obtaining durations from recordings
utils/data/get_reco2dur.sh: successfully obtained recording lengths from sphere-file headers
usage: rm [-f | -i] [-dPRrvW] file ...
unlink file
utils/data/get_reco2dur.sh: computed data/callhome1_spk2/reco2dur
copy_data_dir.sh: no such file data/callhome2/utt2spk
awk: can't open file data/callhome2/reco2num_spk
source line number 1
Can't open data/callhome2/wav.scp: No such file or directory at utils/filter_scp.pl line 65.
awk: can't open file data/callhome/fullref.rttm
source line number 1
Empty list of recordings (bad file data/callhome2_spk2/segments)?
Empty list of recordings (bad file data/callhome2_spk2/segments)?
utils/data/get_reco2dur.sh: obtaining durations from recordings
utils/data/get_reco2dur.sh: successfully obtained recording lengths from sphere-file headers
usage: rm [-f | -i] [-dPRrvW] file ...
unlink file
utils/data/get_reco2dur.sh: computed data/callhome2_spk2/reco2dur
/Users/sing/Postdoc/IMP/EEND/egs/callhome/v1/utils/validate_data_dir.sh: no such file spk2utt
--2021-05-19 22:16:44-- http://www.openslr.org/resources/15/speaker_list.tgz
Resolving www.openslr.org... 46.101.158.64
Connecting to www.openslr.org|46.101.158.64|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 163742 (160K) [application/x-gzip]
Saving to: 'data/local/speaker_list.tgz.3'

speaker_list.tgz.3 100%[===================================================>] 159.90K --.-KB/s in 0.08s

2021-05-19 22:16:45 (1.85 MB/s) - 'data/local/speaker_list.tgz.3' saved [163742/163742]

x speaker_list
find: /LDC/LDC2006S44/: No such file or directory
Error getting list of sph files at local/make_sre.pl line 23.
Smartmatch is experimental at local/make_swbd2_phase1.pl line 51.
Could not open /LDC/doc/callstat.tbl at local/make_swbd2_phase1.pl line 18.
Smartmatch is experimental at local/make_swbd2_phase2.pl line 53.
Could not open /LDC/LDC99S79/DISC1/doc/callstat.tbl at local/make_swbd2_phase2.pl line 18.
Smartmatch is experimental at local/make_swbd2_phase3.pl line 48.
Could not open /LDC/LDC2002S06/DISC1/docs/callstat.tbl at local/make_swbd2_phase3.pl line 18.
Smartmatch is experimental at local/make_swbd_cellular1.pl line 28.
Could not open /LDC/LDC2001S13/doc/swb_callstats.tbl at local/make_swbd_cellular1.pl line 18.
Smartmatch is experimental at local/make_swbd_cellular2.pl line 28.
Could not open /LDC/LDC2004S07/docs/swb_callstats.tbl at local/make_swbd_cellular2.pl line 18.
utils/combine_data.sh data/swb_sre_comb data/swbd_cellular1_train data/swbd_cellular2_train data/swbd2_phase1_train data/swbd2_phase2_train data/swbd2_phase3_train data/sre
utils/combine_data.sh: no such file data/swbd_cellular1_train/utt2spk
/Users/sing/Postdoc/IMP/EEND/egs/callhome/v1/utils/validate_data_dir.sh: empty file spk2utt
Preparing data/musan...
In music directory, processed 0 files: 0 had missing wav data
In speech directory, processed 0 files: 0 had missing wav data
In noise directory, processed 0 files: 0 had missing wav data
fix_data_dir.sh: no utterances remained: not proceeding further.
copy_data_dir.sh: no such file data/musan_noise/utt2spk
awk: can't open file /LDC/noise/free-sound/ANNOTATIONS
source line number 1
fix_data_dir.sh: no utterances remained: not proceeding further.
/Users/sing/Postdoc/IMP/EEND/egs/callhome/v1/utils/validate_data_dir.sh: Successfully validated data-directory data/simu_rirs_8k
/Users/sing/Postdoc/IMP/EEND/egs/callhome/v1/utils/validate_data_dir.sh: no such directory exp/segmentation_1a/tdnn_stats_asr_sad_1a/swb_sre_comb_seg
--nj 30 --graph-opts --min-silence-duration=0.03 --min-speech-duration=0.3 --max-speech-duration=10.0 --transform-probs-opts --sil-scale=0.1 --extra-left-context 79 --extra-right-context 21 --frames-per-chunk 150 --extra-left-context-initial 0 --extra-right-context-final 0 --acwt 0.3 data/swb_sre_comb exp/segmentation_1a/tdnn_stats_asr_sad_1a mfcc_hires exp/segmentation_1a/tdnn_stats_asr_sad_1a exp/segmentation_1a/tdnn_stats_asr_sad_1a/swb_sre_comb
utils/data/convert_data_dir_to_whole.sh: Data directory already does not contain segments. So just copying it.
copy_data_dir.sh: no such file data/swb_sre_comb/utt2spk
utils/fix_data_dir.sh: no such file data/swb_sre_comb_whole_hires/utt2spk
(Speech) Sangram:v1 sing$

Have you tried this on the AMI dataset？

Great paper! I like the idea. but I get a terrible result in AMI dataset(about 40+ der)。

The result of BLSTM is better than Transformer?

I run eend/egs/mini_librisppech/v1/run.sh and eend/egs/mini_librisppech/v1/local/run_blstm.sh, respectively. The result of run.sh (Transformer) is the same as RESULT.md (Final DER = 29.96%). However, the result of local/run_blstm.sh (BLSTM) is DER =17.02%, better than Transformer? Furthermore, I use CALLHOME corpora, the simulated mixtures are generated by SRE2008 and SWBD Cellular 1, the results also show that BLSTM is better than Transformer.

It is worth noting that a warning appears when i ran local/run_blstm.sh, as follows.

/workspace/EEND/tools/miniconda3/envs/eend/lib/python3.7/site-packages/chainer/iterators/multiprocess_iterator.py:629: UserWarning: Shared memory size is too small.
Please set shared_mem option for MultiprocessIterator.
Expect shared memory size: 4389780 bytes.
Actual shared memory size: 4171864 bytes.

I have no idea about this. And I'd like to know the results you got when running mini_librisppech/v1/local/run_blstm.sh.

Thank you very much.

hitachi-speech / eend Goto Github PK

eend's Introduction

EEND (End-to-End Neural Diarization)

Install tools

Requirements

Install kaldi and python environment

Test recipe (mini_librispeech)

Configuration

Data preparation

Run training, inference, and scoring

CALLHOME two-speaker experiment

Configuraition

Data preparation

Self-attention-based model using 2-speaker mixtures

BLSTM-based model using 2-speaker mixtures

Self-attention-based model with EDA using 1-4-speaker mixtures

References

Citation

eend's People

Contributors

Stargazers

Watchers

Forkers

eend's Issues

train.py -c conf/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train

Started at 2024年 04月 22日 星期一 17:20:17 CST

model is training properly. But its ending with

infer.yarml sampling_rate: 16000 frame_size: 400 frame_shift: 160 input_transform: logmel

def get_input_dim( frame_size, context_size, transform_type, ): if transform_type.startswith('logmel23'): frame_size = 23 else: fft_size = 1 << (frame_size - 1).bit_length() frame_size = int(fft_size / 2) + 1 input_dim = (2 * context_size + 1) * frame_size return input_dim

Recommend Projects

Recommend Topics

Recommend Org

Started at 2024年 04月 22日星期一 17:20:17 CST

model is training properly.
But its ending with

infer.yarml
sampling_rate: 16000
frame_size: 400
frame_shift: 160
input_transform: logmel

def get_input_dim(
frame_size,
context_size,
transform_type,
):
if transform_type.startswith('logmel23'):
frame_size = 23
else:
fft_size = 1 << (frame_size - 1).bit_length()
frame_size = int(fft_size / 2) + 1
input_dim = (2 * context_size + 1) * frame_size
return input_dim