Code Monkey home page Code Monkey logo

percepnet's Introduction

PercepNet

Unofficial implementation of PercepNet: A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech described in https://arxiv.org/abs/2008.04259

https://www.researchgate.net/publication/343568932_A_Perceptually-Motivated_Approach_for_Low-Complexity_Real-Time_Enhancement_of_Fullband_Speech

Todo

  • pitch estimation
  • Comb filter
  • ERBBand c++ implementation
  • Feature(r,g,pitch,corr) Generator(c++) for pytorch
  • DNNModel pytorch
  • DNNModel c++ implementation
  • Pretrained model
  • Postfiltering (done by @TeaPoly )

Requirements

  • CMake
  • Sox
  • Python>=3.6
  • Pytorch

Prepare sampledata

  1. download and sythesize data DNS-Challenge 2020 Dataset before excute utils/run.sh for training.
git clone -b interspeech2020/master  https://github.com/microsoft/DNS-Challenge.git
  1. Follow the Usage instruction in DNS Challenge repo(https://github.com/microsoft/DNS-Challenge) at interspeech2020/master branch. please modify save directories at DNS-Challenge/noisyspeech_synthesizer.cfg sampledata/speech and sampledata/noise each.

Build & Training

This repository is tested on Ubuntu 20.04(WSL2)

  1. setup CMake build environments
sudo apt-get install cmake
  1. make binary directory & build
mkdir bin && cd bin
cmake ..
make -j
cd ..
  1. feature generation for training with sampleData
bin/src/percepNet sampledata/speech/speech.pcm sampledata/noise/noise.pcm 4000 test.output
  1. Convert output binary to h5
python3 utils/bin2h5.py test.output training.h5
  1. Training run utils/run.sh
cd utils
./run.sh
  1. Dump weight from pytorch to c++ header
python3 dump_percepnet.py model.pt
  1. Inference
cd bin
cmake ..
make -j1
cd ..
bin/src/percepNet_run test_input.pcm percepnet_output.pcm

Acknowledgements

@jasdasdf, @sTarAnna, @cookcodes, @xyx361100238, @zhangyutf, @TeaPoly, @rameshkunasi, @OscarLiau, @YangangCao, Jaeyoung Yang

IIP Lab. Sogang Univ

Reference

https://github.com/wil-j-wil/py_bank

https://github.com/dgaspari/pyrapt

https://github.com/xiph/rnnoise

https://github.com/mozilla/LPCNet

percepnet's People

Contributors

cookcodes avatar jaehyun-ko avatar jzi040941 avatar teapoly avatar tsoliver avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

percepnet's Issues

RSS+NS based on PercepNet

Hi all,

Hope to discuss the RSS+NS based on PercepNet. I want to combine the AEC linear part from webrtc with PercepNet implementation to improve the RSS/NS of webrtc.

The padding in rnn_trian.py

Dear Noah,
I was noticed the lines 23 & 24 in rnn_train.py, the final dimension of the data will keep the same, but the parameter value must be different.(I'm not sure if it will have an impact)
nn.Conv1d has parameter 'padding' can solve this problem,such as:
self.conv1 = nn.Conv1d(128, 512, 5, stride=1, padding=2)
self.conv2 = nn.Conv1d(512, 512, 3, stride=1, padding=1)

Tip: padding = (kernel_size - 1) / 2

comb filter M = 3,and PITCH_MAX_PERIOD 768, how to meet the 40ms look-ahead requirement?

In paper:
"To achieve 40 ms look-ahead including the 10-ms overlap, we use M = 3"
40 ms look ahead, it means have only 30ms data to shift for comb filter in time domain.
30ms data = 3 * 480 = 1440
And M = 3, PITCH_MAX_PERIOD 768, the maxim shift will be 3 * 768 = 2340

Now in code,
FRAME_LOOKAHEAD is set to 5, 5 * 480 > 3 * 768, but it will lead to 60ms look ahead.

In paper:"Low-Complexity, Real-Time Joint Neural Echo Control and Speech Enhancement Based On PercepNet",
M is changed to 2. But even if M is 2, 2 * 768 > 3 * 480.

So what is the correct parameter M, PITCH_MAX_PERIOD to meet 40ms look ahead requirement?

Lookahead in compute_rnn (c++)

Hi Noh,

We queue a buffer when I was training the model (maybe 500 frames), but I think the funciton compute_rnn looks like a frame-in-frame-out flow.
My question is, is there no need to prepare a buffer for inference?

Thanks,
Aaron

Am I use dump_percepnet.py right?

Dear Noah,
Thanks for share.
According to #4 ,I Have done with training and get the model file use sampledata(model.pt 30.3MB)
Q1:Is the file size correct
Use:python3 ./dump_percepnet.py model.pt tmpList/a.c
Have:
printing layer fc
weight: Parameter containing:
tensor([[-5.8342e-02, 8.4117e-02, -1.8991e-02, ..., -1.0439e-01,
-3.0405e-02, 3.7125e-02],
[-8.3928e-02, -8.2344e-02, -9.2069e-02, ..., 1.8947e-02,
-1.1299e-01, -6.5784e-02],
[ 3.6998e-02, 8.9760e-02, 1.7038e-02, ..., 5.5876e-02,
8.1813e-02, 1.0908e-01],
...,
[-2.4296e-02, -1.0941e-02, -7.2806e-02, ..., 1.5993e-02,
-5.7701e-02, -1.0907e-01],
[-3.3082e-02, -9.1393e-02, -1.0323e-01, ..., -9.3106e-02,
7.7872e-02, -8.4516e-02],
[-3.9096e-02, 5.6298e-02, -4.1803e-02, ..., -5.2403e-02,
-4.0629e-02, 2.0898e-05]], requires_grad=True)
printing layer conv1
printing layer conv2
printing layer gru1
printing layer gru2
printing layer gru3
printing layer gru_gb
printing layer gru_rb
printing layer fc_gb
weight: Parameter containing:
tensor([[-0.0119, -0.0091, 0.0048, ..., -0.0063, 0.0110, -0.0173],
[-0.0055, 0.0052, -0.0083, ..., -0.0027, 0.0184, -0.0007],
[ 0.0111, 0.0031, 0.0160, ..., -0.0148, 0.0004, 0.0086],
...,
[-0.0202, 0.0177, 0.0110, ..., -0.0202, 0.0173, 0.0023],
[-0.0017, -0.0150, -0.0045, ..., 0.0106, 0.0158, 0.0015],
[-0.0185, 0.0009, 0.0129, ..., 0.0045, 0.0028, 0.0105]],
requires_grad=True)
printing layer fc_rb
weight: Parameter containing:
tensor([[ 0.0813, -0.0757, 0.0472, ..., 0.0742, -0.0321, 0.0692],
[ 0.0574, 0.0049, 0.0802, ..., 0.0282, 0.0149, 0.0733],
[ 0.0457, 0.0489, -0.0813, ..., 0.0040, 0.0310, 0.0222],
...,
[ 0.0067, -0.0674, 0.0267, ..., -0.0824, 0.0025, 0.0248],
[-0.0164, -0.0548, 0.0088, ..., 0.0619, -0.0342, 0.0319],
[ 0.0752, 0.0771, 0.0405, ..., 0.0106, -0.0278, 0.0479]],
requires_grad=True)
Q2:Does it means succeed

I Have Got a.c(178MB)And the file not finished yet
/This file is automatically generated from a Pytorch model/

#ifdef HAVE_CONFIG_H
#include "config.h"
#endif

#include "nnet.h"
#include "nnet_data.h"

static const float fc_weights[8960] = {
……
const DenseLayer fc_gb = {
fc_gb_bias,
fc_gb_weights,
2560, 34, ACTIVATION_SIGMOID
};

static const float fc_rb_weights[4352] = {
……
have no ‘}’

Q3:is the file dump_percepnet.py error or my doc & training process error?

Hope to get your reply,thanks!

about power_noise_attenuation

I'm curious about the calculation of power_noise_attenuation and I can't find the calculation formula of this value in the paper.Can you tell me why I can get this value as in your code?

about pitch coherence

hello,
In your code:
Exp[i] = Exp[i]/sqrt(1e-15+Ex[i]*Ep[i]).
while bandE[i]=sqrt(sum[i]) in the function compute_band_corr.

I think the code should be
if(Ex[i]*Ep[i]==0)
Exp[i] = 0;(or 1?)
else
Exp[i] = Exp[i]/(1e-15+Ex[i]*Ep[i]).
And this can only make pitch coherence{Exp[i]} be 1,when signal{X} and it's periodic component {P} are exactly the same。

loss increase and appear nan

Hi, thanks for your excellent work.
I extract feature from speech(pcm, 12GB) and noise(pcm, 9GB), and set count into 10000000. Then, I run run_train.py and get the following output:

WechatIMG444

Can you help me? thanks again!

Data Creater(c++) for pytorch

Hi jzi040941! Thanks for sharing your code with us!
But may I ask where can I find the code for Data Creater(c++) for pytorch?

The step of pitch filtering

Hi Noah,

I have been working on the PercepNet project recently, and I read the code in denoise.cpp.
I am not sure the code about "comb filtering" below:

for (k=-COMB_M;k<COMB_M+1; k++){
for (i=0;i<WINDOW_SIZE;i++)
p[i] += st->comb_buf[COMB_BUF_SIZE-FRAME_SIZE*(FRAME_LOOKAHEAD)-WINDOW_SIZE-pitch_index*k+i]*common.comb_hann_window[k+COMB_M]; }

I was wondering that the pitch needs to compute with the previous frames and how it works?
Thank you!

Calculation of Pitch period

Thanks for this great job! I have a question about the input feature of DNN model.
In the original PercepNet paper, pitch period T is used as an feature to the DNN input. In this implementation project code script denoise.cpp, it's calculated as (float)noisy->last_period/(PITCH_MAX_PERIOD-3*PITCH_MIN_PERIOD), and in the RNNoise implementation, it's calculated as .01*(pitch_index-300) , so what's the difference between these two implementations and could you please explain the reason behind this? Thanks a lot.

The step of compute_rnn function in rnn.c

Dear Noah,
Sorry to trouble you again.

Why is the input parameter of the first GRU layer is not second_conv1d_out but convout_buf,I was confused all the places use convout_buf as input.

Another:
`
#define MAX_NEURONS 128

//concat for rb gru
for (i=0;i<MAX_NEURONS;i++) rb_gru_input[i] = rnn->convout_buf[i];
for (i=0;i<MAX_NEURONS;i++) rb_gru_input[i+MAX_NEURONS] = rnn->gru3_state[i];
compute_gru(rnn->model->rb_gru, rnn->rb_gru_state, rb_gru_input);
`
This length is inconsistent with the training script

PESQ score comparison

If we use rnnoise code and generate band gain label and reconstructed. The PESQ score was not good for the generated label file. Is there any way to increase the PESQ score for labels?

pretrain model for test

Thank you very much for this good realization of PercepNet, Can you upload your pretrained model?

Pitch filter output is not good

I have verified the pitch filter output, I didn't observe noise reduction after pitch filter and also speech was distorted in pitch filter output

about pitch coherence

hello,
In your code:
Exp[i] = Exp[i]/sqrt(1e-15+Ex[i]*Ep[i]).
while bandE[i]=sqrt(sum[i]) in the function compute_band_corr.

I think the code should be
if(Ex[i]*Ep[i]==0)
Exp[i] = 0;(or 1?)
else
Exp[i] = Exp[i]/(1e-15+Ex[i]*Ep[i]).
And this can only make pitch coherence{Exp[i]} be 1,when signal{X} and it's periodic component {P} are exactly the same。

Saving test file and transform to .wav

Hi, I have run your latest code and generated the output file.
Then i use sox to transform the 16bit pcm output file to wav file with 48kHz sample rate.
The transformed enchanced signal sounds not good. To furthur find the reason, i send the stft of clean speech X to frame_synthesis() function and generate a clean speech with stft and istft operation only. I transform it to the wav with same operation and it still sounds not good. So i do a simple test which ignores the zeros frames generated in the first 6 steps, and the wav file sounds good. So I suggest you to add
if(count>5)
before saving the calculated features and output files.

Thanks!

This loss of value is correct it?

I run the file of rnn_train.py,get the loss value of the model。
This loss of value is correct it?
[1, 1] loss: 111971.305
[2, 1] loss: 95036.414
[3, 1] loss: 83421.258
[4, 1] loss: 75540.383
[5, 1] loss: 69769.328
[6, 1] loss: 65297.250
[7, 1] loss: 61768.656
[8, 1] loss: 59000.102
[9, 1] loss: 56840.609
[10, 1] loss: 55172.516
[11, 1] loss: 53897.754
[12, 1] loss: 52909.832
[13, 1] loss: 52113.910
[14, 1] loss: 51442.695
[15, 1] loss: 50844.719
...
[86, 1] loss: 33483.242
[87, 1] loss: 33400.211
[88, 1] loss: 33319.367
[89, 1] loss: 33241.418
[90, 1] loss: 33166.523
[91, 1] loss: 33093.812
[92, 1] loss: 33022.258
[93, 1] loss: 32951.094
[94, 1] loss: 32880.258
[95, 1] loss: 32810.066
[96, 1] loss: 32740.826
[97, 1] loss: 32672.703
[98, 1] loss: 32605.770
[99, 1] loss: 32539.883
[100, 1] loss: 32475.033
Finished Training
save model
If it is not correct, please give a reference value!
Thank you !

pitch correlation is too large

I found FP16 training is hard to converge because pitch correlation result is too large .

*pitch_corr = xcorr[best_pitch[0]];

But in Jean-Marc Valin's paper A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet is between 0 and 1.

LPCNet is designed to operate with 10-ms frames. Each frame includes 18 cepstral coefficients, a pitch period (between 16 and 256 samples), and a pitch correlation (between 0 and 1).

pitch_ correlation

I think it can be improvement.

Out of bound in compute_frame_features()

Thanks for sharing your code first!
I have run the demo to generate features with the speech and noise pcm files in the sampledata. But I found that the code in
function compute_frame_features
for (k=-COMB_M;k<COMB_M+1; k++){ for (i=0;i<WINDOW_SIZE;i++) p[i] += st->comb_buf[COMB_BUF_SIZE-FRAME_SIZE*(COMB_M)-WINDOW_SIZE-pitch_index*k+i]*common.comb_hann_window[k+COMB_M]; }

may have a out of bound problem. The
COMB_BUF_SIZE-FRAME_SIZE*(COMB_M)-WINDOW_SIZE-pitch_index*k+i
could be bigger than COMB_BUF_SIZE during calculating when pitch_index is large enough and k in lower than zero (I have observed that).

Could you please check here and tell me if I have made something wrong.
Thank you again!

problem about calculation of Pitch

Hi,
In your code:
float T = noisy->last_period/(PITCH_MAX_PERIOD-3*PITCH_MIN_PERIOD),this make T either 1 or 0.

float T = (float)noisy->last_period/(PITCH_MAX_PERIOD-3*PITCH_MIN_PERIOD) can make T right.

about ERB band

Hello, Mr. Noh

I found that PercepNet used triangular filter in ERB band, and you already caught that point and modified your repo.
(I found this when I read the paper on Personalized PercepNet. In section 2, the author said that they used triangular filters in PercepNet.)

I think that we can get values of ERB subbands from the figure in this post:
https://www.amazon.science/blog/how-amazon-chimes-challenge-winning-noise-cancellation-works

My estimation is :
eband5ms[33] = {2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 23, 26, 29, 33, 37, 41, 46, 52, 58, 64, 71, 81, 91, 103, 117, 133, 154, 177, 206, 241, 283, 336, 400}
In this array, 1 means 50Hz. It can be slightly different from the truth, but it may work properly.
I started from two assumptions to get this array.

  1. I think that each subband is a multiple of 50Hz : You can guess this fact from the implementation of bark scale in RNNoise. He used slightly different version of bark scale so every subband is a multiple of 50Hz and every triangular filter has min/max on borders of subbands.

  2. Every subband is at least 100Hz, to avoid containing just single frequency bin.

Finally, I counted some dots for one hour and got this values. (from 0 to 5000Hz, there are about 280 dots)
I hope that it can help you.

Regards,
Jaeyoung

missing argument:activation

when i dump weights with dump_percepnet.py, i meet missing argument:activation in this line. It seems that for the dense or conv1d layer, there should be an activation function name, isn't it?

How to train and infer model with sampleData

Hi, I'm struggling with your code. I can't train the model with your given sampleData. There are the steps I have done

  1. clone your repository
  2. Download the sampleData (VTCK for speech and DEMAND for noise), then I copied all audio files (.wav) to speech and noise folder
  3. Make binary directory, build, do feature generation and convert output binary to h5 following your README instruction.
  4. run the run.sh script in the utils folder.

However, it raises the FileNotFoundError: No such file or directory: '/home/myname/code/PercepNet/utils/../training_set_sept12_500h/features/train.txt'
I also tried to create this file manually via running split_feature_dataset.py with the path to the sampledata directory, it then generated 2 blank files train.txt and dev.txt. Whats wrong I have done? Thank you.

PercepNet

Hi,
It's is good work and I am looking forward to your completion of this meaningful Proj.
Come on.

Reading file error in train()

On Windows platform.
In denoise.cpp, train()
f1 = fopen(argv[1], "r"); f2 = fopen(argv[2], "r");
is used to read noise and speech file (pcm or raw). But the mode should be set to 'rb'.

f1 = fopen(argv[1], "rb"); f2 = fopen(argv[2], "rb");

About data

Hi:
I would like to ask what is the appropriate 'count' for feature generation if I use 120 hours of human voice and 80 hours of noise as training data according to the paper?
Thanks!

about rnn_train.py

Hi, I have some problems about rnn_train.py, when I follow the README and try to run rnn_train.py, it mentioned that four arguments,--train_filelist_path, --dev_filelist_path, --out_dir, --config, are required. How to set these four arguments? For example, train_filelist_path is the path of a txt file which includes each filepath of the raw audio files I want to train? dev_filelist_path is the path of a txt file which includes each filepath of the dev raw audio files? out_dir is just an empty dir? And how to set config?

Another problem is when I follow the README, after I execute mkdir bin && cd bin and then make .., a warning message is shown------->[WARNING]nnet_data.cpp is not exist. Do not generate inference executable...
Whether this situation needs to be resolved?

Thank you for your help

about window_size

hello,
I'm confused about window_size in the training file.Why do you reshape the input data from (2000,138) to (2000/window_size,window_size,138)?Can window_size take any value?

Discussion about PercepNet

I got a email from Yuyung Liau who want me to make discussion on Github issue
since I got few email about implementing PercepNet, I think it's better to share with more people not one by one
So I opened this Issue
Any kinds of Discussion about PercepNet are welcomed!

Quantisation

Hello! I found that weights are dumped in float32 format. It significantly impact the speed of inference of model on CPU. In my test 10 ms are being processed 25 ms. Would you please clarify if you tested quantisation of weights like it implemented in rnnoise model?

Out of bound in erb_band->nfftborder[i+1]?

ERBBand *erb_band = new ERBBand(WINDOW_SIZE, NB_BANDS-2, 0/low_freq/, 20000/high_freq/);

void compute_band_energy(float *bandE, const kiss_fft_cpx *X) {
int i;
float sum[NB_BANDS] = {0};
for (i=0;i<NB_BANDS;i++)
{
int j;
int band_size;
band_size = (erb_band->nfftborder[i+1]-erb_band->nfftborder[i]);

Here, when i is NB_BANDS-1, then erb_band->nfftborder[i+1] will access erb_band->nfftborder[NB_BANDS],
and in class ERBBand, the size of nfftborder is NB_BANDS

about loss = nan

Hi,
i have met the problem loss=nan too.Here is my solution.
Because the loss function is (sqrt(g_gound)-sqrt(g_hat))^2,grad will be nan when g_hat is 0.The code below may fix the problem:
class CustomLoss(nn.Module):
...
def forward(...):
....
rb=targets[:,:,34:68]

try to avoid nan

mask = gb_hat<0.0003
gamma_gb_hat=torch.FloatTensor(gb_hat.size()).type_as(gb_hat)
gamma_gb_hat=1290*gb_hat[mask]
mask = gb_hat>=0.0003
gamma_gb_hat[mask]=torch,pow(gb_hat[mask],gamma)

mask = (1-rb_hat)<0.0003
gamma_rb_hat=torch.FloatTensor(rb_hat.size()).type_as(rb_hat)
gamma_rb_hat=1290*(1-rb_hat[mask])
mask = (1-rb_hat)>=0.0003
gamma_rb_hat[mask]=torch,pow((1-rb_hat[mask]),gamma)

return torch.mean(torch.pow( ( torch.pow(gb,gamma) - gamma_gb_hat ),2 ) )
+ C4 * torch.mean(torch.pow( ( torch.pow(gb,gamma) - gamma_gb_hat ),4 ) )
+ torch.mean(torch.pow( ( torch.pow(1-rb,gamma) - gamma_rb_hat ),2 ) )\

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.