The speakergan from bashbaha

Introduction

This repository is about paper SpeakerGAN , and is unofficially implemented by Mingming Huang ([email protected]), Tiezheng Wang ([email protected]) and thanks for advice from TongFeng.

SpeakerGAN paper

SpeakerGAN: Speaker identification with conditional generative adversarial network， by Liyang Chen , Yifeng Liu , Wendong Xiao , Yingxue Wang ,Haiyong Xie.

Usage

Step 1: for vad preprocess.

$python vad.py filelist_with_absolute_path   #It will get vad file saved in the same directory with '_vad' for filename.
$cat filelist_with_absolute_path
/datasdc/librispeech/train-clean-100/458/126305/458-126305-0041.wav
/datasdc/librispeech/train-clean-100/4051/11218/4051-11218-0009.wav
/datasdc/librispeech/train-clean-100/7635/105409/7635-105409-0022.wav
.
.
.

Step 2: for train / test / generate:

python speakergan.py  #You may need to change the path of vad preprocessed wav files.

It costs us about 65 hours to train on NVIDIA Ampere A100 1 card with help of redis cache.

Our results

acc: 98.1955% on testset. Fixed first 1.6 seconds on testset, model/2200_D.pkl.

Generated samples with Generator on model/2200_G.pkl:

Details of paper

The following are details about this paper.

================ input ==================

feature: fbank, 8000hz, 25ms frame, 10ms overlap. shape:(160,64)
dataset: librispeech-100 train-clean-100 POI:251
data preprocess: vad、mean and variance normalization, shuffled.
60% train. 40% test.

================ model architecture ==================

dataflow: data -> feature extraction -> G & D
model architecture:

G: gated CNN, encoder-decoder, Huber loss + adversarial loss

D: ResnetBlocks, template average pooling, FC, softmax, crossentropy loss + adversarial loss
G: shuffler layer, GLU
D: ReLU

================ training ==================

lr: 0-9, 0.0005 | 9-49, 0.0002
L(d): λ1 λ2 = 1
batch_size: 128 # diff with paper.
epoch: 2200 #diff with paper
D_train steps / G_train steps = 4
Ladv Loss: Label smoothing, 1 -> 0.7 ~ 1.0, 0 -> 0 ~ 0.3

======== not sure or differences with paper ========

weights,bias initialize function, we use: xavier_uniform and zeros
pytorch huber_loss.
for shorter wav, paper: padded with zero. we: padded with feature again.
gated cnn architecture.
we use webrtcvad mode(3) for vad preprocess.
Paper error 1: we think the paper missing a plus sign in formula (5)
Paper error 2: we think the structure of conv6 in Generator is wrong, the output channel should be 64.

bashbaha / speakergan Goto Github PK

speakergan's Introduction

Introduction

SpeakerGAN paper

Usage

Our results

Details of paper

speakergan's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent