joonson / voxceleb_unsupervised Goto Github PK

View Code? Open in Web Editor NEW

77.0 77.0 10.0 40.09 MB

Augmentation adversarial training for self-supervised speaker recognition

Python 100.00%

voxceleb_unsupervised's People

Stargazers

Watchers

Forkers

twistedmove hangxiu slocianka javiermartiname wenhao-yang songtaoshi devkihyun ribotot-university zhazhafon wizyke

voxceleb_unsupervised's Issues

Inquiries on multi-modal data loader

Hi, thank you for your amazing work. I'm wondering whether there is an instruction for loading both the image face frames as well as the speech segments.

data pipeline not work

  File "/nfs/spk/voxceleb_unsupervised/DatasetLoader.py", line 66, in __getitem__
    audio = loadWAVSplit(self.data_list[index], self.max_frames).astype(numpy.float)
  File "/nfs/spk/voxceleb_unsupervised/DatasetLoader.py", line 198, in loadWAVSplit
    raise e
  File "/nfs/spk/voxceleb_unsupervised/DatasetLoader.py", line 194, in loadWAVSplit
    startframe = random.sample(range(0, randsize), 2)
  File "/nfs/project/tools/anaconda2/lib/python3.6/random.py", line 320, in sample
    raise ValueError("Sample larger than population or is negative")
ValueError: Sample larger than population or is negative

Question about using GRL layer

Hello, thank you for good resource.
I guess, in the paper, reversing gradients is only done when the embedding training phase.
However, in the source code, it seems that the discriminator training phase also goes through the GRL layer, so I wonder why.

for room impulse responses, we use 1,000 pre-computed RIR filters？

Thanks for the perfect job. Can you describe how to get 1000 pre-computed RIR filters specifically?

I can't follow the result in your paper

Thanks for your work. When I reproduce your paper, i can't achieve the result in your paper: AP+Nosie+RIR eer=9.56%,
AP+AAT+Noise+RIR=8.65%, I only get the result: AP+Nosie+RIR eer=11.78%, AP+AAT+Noise+RIR=10.08%. Is there anything else I need to pay attention to during training?

no requirements.txt

At DatasetLoader, convolve question

def gen_echo(ref, rir, filterGain):

rir     = numpy.multiply(rir, pow(10, 0.1 * filterGain))    
echo    = signal.convolve(ref, rir, mode='full')[:len(ref)]   

return echo

in this function, rir data type is float32, but ref data type is int16.

can convolve the different data type data?

Is there any problem in terms of signal processing?

what's the eval_frames reported by the paper?

Loading previous learning rate state

Hello,

for ii in range(0,it-1):
    if ii % args.test_interval == 0:
        clr = s.updateLearningRate(args.lr_decay)

It seems to be one more learning rate decay when ii is zero.
Thank you.

Anyone reproduce the AAT result?

What's the best unsupervised method do you know? And does anyone reproduce the AAT result?

Question about augmentation type 3 "RIR & Noise" and resource of ATT

Hi. Thank you very much for the good material!

In your code, I can't find the augmentation part including both the RIR and noise employed in your paper.
In addition, I'm wondering if the resource of the Augmentation Adversarial Training will be released or not.

Thank you.

joonson / voxceleb_unsupervised Goto Github PK

voxceleb_unsupervised's People

Stargazers

Watchers

Forkers

voxceleb_unsupervised's Issues

Inquiries on multi-modal data loader

data pipeline not work

Question about using GRL layer

for room impulse responses, we use 1,000 pre-computed RIR filters？

I can't follow the result in your paper

no requirements.txt

At DatasetLoader, convolve question

what's the eval_frames reported by the paper?

Loading previous learning rate state

Anyone reproduce the AAT result?

Question about augmentation type 3 "RIR & Noise" and resource of ATT

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent