yluo42 / tac Goto Github PK

transform-average-concatenate (TAC) method for end-to-end microphone permutation and number invariant ad-hoc beamforming.

Python 100.00%

tac's Issues

Could you please share the training script?

I am new to machine learning. I want to improve myself by reproducing your work, and I have encountered many problems in the training. I'm sorry to disturb you. Could you please share the training script if possible? This will help me a lot.

How to train the model?

Could you please share the training script? I can not reproduce the results after reading
FaSNet, https://doi.org/10.1109/ASRU46091.2019.9003849
TAC, https://doi.org/10.1109/ICASSP40776.2020.9054177

Thanks in advance.

how to calculate SI-SNRi in multi-channel speech separation task

In in single-channel speech separation task, SI-SNRi is calculate by the follows:
SI-SINR0 = calc_SISNR(source, estimate_source)
SI-SINR1 = calc_SISNR(source, mixture)
SI-SNRi = SI-SINR0 - SI-SINR1
But in in multi-channel speech separation task, what is equivalent respectively for source, estmate_source, mixture?
Assume that there 2 speaks to be separated, then source is the 2 raw speech audios, each of which contains 1 speaker, and estimate_source is the 2 separated audios, each of which contains 1 speaker .
if so, how to calculate the SISNR between source and mixture in that mixture contains 2 channel audios?

II hope I've made myself plain on this issue.

Use FasNet as Frontend

Hi,

I'm trying to use FasNet as frontend to denoise and dereverb the audio at the same time. I noticed that in create_dataset.py, you save spk1_echoic_sig and spk2_echoic_sig as label, which I think refers to "Reverberant clean" in your paper (please correct me if i'm wrong). What should I do if I want to do dereverberation (like "Clean source" or "Mel-spectrogram" in your paper)?

To be more specific,

Do I need to normalize or rescale the original audio (I find that the energy of the original audio is much larger than the echoic one. Si-sdr would even be a positive number when given original audio)?
What should I do to follow the shift invariant training?
If I what to learn Mel-spectrogram, what is the input? Audio or mel-spectrogram? Do I need to flatten the last dimension to be compatible with current loss function? (Since there will be one more dimension storing mel-spectrogram feature)

how to understand the rescaling with snr in create_dataset.py

Hallo,
I am a little confused about these operations
"spk2 = spk2 / np.sqrt(np.sum(spk22)+1e-8) * 1e2" and
"noise = noise / np.sqrt(np.sum(noise2)+1e-8) * np.sqrt(np.sum((spk1+spk2)**2)+1e-8)"
in create_dataset.py, could u please explain why multiply "1e2" and "np.sqrt(np.sum((spk1+spk2)**2)+1e-8)" at the end.
thanks

Loss function

Hello
Can you please provide instructions on how to use the suggested loss function?
Is the loss used for training contained in the sdr.py file? If so, how to call it? and if not, is there an implementation you can reference to?

how to generate the pkl?

Hi, I find the data generation part needs to read the pkl file , but if I want to modify the room size and num of mics, how to generate pkl? Does this mean that I need to manually modify the pkl file you provided?

What's the meaning of the ad-hoc array？

How to use the dataset in the paper?

Hello, I am new to speech separation by using python, can you tell me the details of using the data sets？I want to know how to input the data during training，because I have seen the code that generates the data。The data naming format is 'spk1_mic'+str(mic+1)+'.wav', so how to use multi-channel information，and how to set the batchsize ？
Thank you very much.

can FasNet be used to be a frontend for subsequent AEC

Hi , Yi,

Great work. And I 've been following your research, it seems the model is small and effective. And I notice there is two task, ESE and ESS, but have you experimented this beamformer towards AEC for echo cancellation? I suppose AEC is still substantially important in real scenarios.

Maybe this module could serve as a frontend of AEC, or make up a more complex network including both AEC and denoising. Does it do good to AEC than the original noisy signal?

Any thoughts on that? I am working on far-field AEC and trying to integrate it to our network if possible. Look forward to your reply.

BRs,

can not download the '100 Nonspeech Sounds corpus' can you share it?

hi
because the url of the 100 Nonspeech Sounds corpus changed, so i can not download it, so could you share it with me? thanks a lot!

The network architecture related to dual-path RNN TasNet.

I tried reimplementation of dual-path RNN TasNet reading your paper.

DUAL-PATH RNN: EFFICIENT LONG SEQUENCE MODELING FOR TIME-DOMAIN SINGLE-CHANNEL SPEECH SEPARATION

Is the general structure published in this repository the same as dual-path RNN TasNet? I understand this repository is written for TAC.
There seem to be some improvements like gated outputs. Are these modules included in dual-path RNN TasNet?

TAC/FaSNet.py

Lines 15 to 21 in 96640a8

    
           # gated output layer 
        
           self.output = nn.Sequential(nn.Conv1d(self.feature_dim, self.output_dim, 1), 
        
                                       nn.Tanh() 
        
                                      ) 
        
           self.output_gate = nn.Sequential(nn.Conv1d(self.feature_dim, self.output_dim, 1), 
        
                                            nn.Sigmoid() 
        
                                           )

Q about the data generation!

How to generate "original clean signal" as training target when the distance of speaker to mac is not fixed?
I have tried to use h(f)=\aplha\exp{-1j 2 \pi f d/c} (a impulse response in TD), fftconvolve with clean speech signal as direct path label signal. However, my small ConvTasNet dose not work well with this data generation method. Can you give me some suggestion about this question? Thank you very much.

num_mic array

Hi ,I want ask you a question about the model forward func, there have two input, one is the feature called 'input', another tensor is called 'num_mic'. And I saw your example num_mic = torch.from_numpy(np.array([3, 2])).view(-1,).type(x.type()) , is that means there have two set , one is two mics condition , another one is 3 mics condition? Suppose I use 8 channel dataset, should I set the unm_mic = np.array([8]) ? Thank you very much.

how to run this project

your work is so awesome! I am interested in how to run this project using the model.
would you mind update more details about running these script?

yluo42 / tac Goto Github PK

tac's Issues

Could you please share the training script?

How to train the model?

how to calculate SI-SNRi in multi-channel speech separation task

Use FasNet as Frontend

how to understand the rescaling with snr in create_dataset.py

Loss function

how to generate the pkl?

What's the meaning of the ad-hoc array？

How to use the dataset in the paper?

can FasNet be used to be a frontend for subsequent AEC

can not download the '100 Nonspeech Sounds corpus' can you share it?

The network architecture related to dual-path RNN TasNet.

Q about the data generation!

num_mic array

how to run this project

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	# gated output layer
	self.output = nn.Sequential(nn.Conv1d(self.feature_dim, self.output_dim, 1),
	nn.Tanh()
	)
	self.output_gate = nn.Sequential(nn.Conv1d(self.feature_dim, self.output_dim, 1),
	nn.Sigmoid()
	)