Hi Vivek, Thanks for opensourcing this interesting project - nice wo

The effect of model size on the overall performance about cone-of-silence HOT 5 CLOSED

yluo42 commented on August 29, 2024

The effect of model size on the overall performance

from cone-of-silence.

Comments (5)

vivjay30 commented on August 29, 2024

Hi Yi, thanks for reaching out! We based the model on the Demucs architecture, and the Demucs architecture has a large number of parameters. It's actually quite likely that all of these parameters are not needed. In fact it looks like they have released a light version of Demucs which is much smaller, so this would probably work. For the sake of research we went with the largest model, but it's true that this is probably not the most practical.

It should also be possible to use our idea to create the same cone of silence effect with a smaller network like Conv TasNet. That would be an interesting follow up especially since Conv-TasNet is a much smaller network and more practical.

from cone-of-silence.

yluo42 commented on August 29, 2024

Hi Vivek,

Thanks for the clarification. I'm also curious about (1) what's the performance of the Demucs-only model without CoS, and (2) what's the effect of using different initial searching regions. I think (1) would be important to show the importance of CoS, and (2) can lead to a better understanding on both the number of iterations in the searching procedure and the sensitivity of initial choices. Have you done any ablations on these configurations?

from cone-of-silence.

vivjay30 commented on August 29, 2024

Regarding (1), this is actually a good question that we could add to the paper. If we look at a scenario with, for example, 2 voices and BG, then the performance of a simple multi-channel Demucs is similar to CoS. In fact, even TAC and Conv-Tasnet are very close in performance, and if you look at the supplementary materials, you'll see that they outperform our method at lower sample rates. With that context, the key contribution of this work is not just improving separation quality but 1. making separation spatially grounded so we choose who to listen to, 2. moving beyond a network that has a fixed number of output channels, allowed us to handle arbitrary many speakers with a single network 3. Performing explicit source localization with a separation network which greatly outperforms a number of localization baselines.

Regarding (2), I don't think the initial search choices matters too much as long as you cover the entire search space. We start with 4 regions of 90 degrees, but you could start with 2 regions of 180 degrees, or just do a linear search at 2 or 5 degree resolution. The network is trained with more positives than negatives, so it ends up having a slightly greater window size than the target value. You can see this in the interactive demo on our website, sources often leak into neighboring regions. This is because we greatly prefer more false positives, and doing more search space, rather than false negatives and missing a source.

from cone-of-silence.

yluo42 commented on August 29, 2024

Thanks again for the detailed response! Sorry that I won't be registering for this year's NeurIPS so I can only post my questions here instead of in your presentation.

With that context, the key contribution of this work is not just improving separation quality but 1. making separation spatially grounded so we choose who to listen to, 2. moving beyond a network that has a fixed number of output channels, allowed us to handle arbitrary many speakers with a single network 3. Performing explicit source localization with a separation network which greatly outperforms a number of localization baselines.

I fully agree with that. What I tried to say is that:

If CoS is an important factor for improving multi-channel separation in this framework (as the title is about CoS itself), then I would appreciate if ablation results are available for the same model without CoS, or even with different backbone model architectures (as CoS should be easily applied to any model). I'm simply curious about the choice of Demucs as backbone, but in Table 1 MC-TasNet and FaSNet-TAC are selected as baselines.
There are other methods for using this space-splitting scheme for multi-channel separation (mainly with fixed beamformers, e.g. see [1,2] below, which are not included in your bibliography). A general idea is to pre-design a set of fixed beamformers that point to different angles in the space, extract the sources receiving in those directions, and perform post-selection and post-enhancement on the beamformed outputs. The post-selection can select more than one outputs, and the beampattern of the selected fixed beamformer can be used for SSL. I'd also like to learn about the comparison between these systems and CoS.
There are also DOA-guided methods that use either oracle DOA (or multimodal data) or DOA-like feature to perform speech extraction (e.g. see [3,4] below, which are not included in your bibliography). To compare with CoS, another possible pipeline is to first perform multi-source DOA estimation (which is definitely doable and can estimate unknown numbers of sources), and then perform extraction on each of the estimated DOAs. The entire procedure can also be done in an end-to-end fashion, bypassing some of the issues of beamformers you mentioned in the paper (e.g. cancellation power). I'm also curious about how you differ CoS from such systems.

but you could start with 2 regions of 180 degrees, or just do a linear search at 2 or 5 degree resolution.

I'm not sure whether a linear search at 2-5 degree resolution will create too many null outputs during training that forces the model to collapse and always generate silent outputs - have you done any experiments on this?

There are two other question I come up with:

You mentioned in equation 1 that the BG can be either a point-source or a diffused noise. In the case of a point-source noise with its location close to one speaker (e.g. a speaker holding a cellphone playing music), does it mean that the output of the model will not cancel out the noise?
I didn't find the reverberation conditions in the data simulation section. With strong reverberation (at higher T60 values), the late reverberation components reflecting from a larger range of angles can be perceivable, while it sees that in CoS only the accurate DOA (angle of the direct-path signal) is used to select the proper training targets. Have you done more detailed analysis on the model performance under different T60 ranges?

[1] Chen, Zhuo, et al. "Efficient integration of fixed beamformers and speech separation networks for multi-channel far-field speech separation." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
[2] Chen, Zhuo, et al. "Cracking the cocktail party problem by multi-beam deep attractor network." 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017.
[3] Gu, Rongzhi, et al. "Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information." Interspeech. 2019.
[4] Xu, Yong, et al. "Neural Spatio-Temporal Beamformer for Target Speech Separation." arXiv preprint arXiv:2005.03889 (2020).

from cone-of-silence.

vivjay30 commented on August 29, 2024

CoS should be easily applied to any model: This is true, and actually something we are hoping people explore in follow up work. The pre-shift idea would work with any network. The global conditioning might have to be modified based on the specific architecture, but it should work with any network. We chose Demucs because it was a recent work that showed promising results on 44.1kHz separation, but we also strongly considered using Conv-TasNet as the backbone with the pre-shift and global conditioning for spatial separation. I agree that an ablation study would be interesting.

first perform multi-source DOA estimation: It's true that you could first estimate the DOA of every source then separate those regions. From a research perspective, it is nice that we can do everything with a single network. This could also be useful regarding memory constraints on real devices that might not support two separate networks. The explicit window size parametrization also allows us to separate moving speakers, which would be more challenging with a two step approach.

a linear search at 2-5 degree resolution will create too many null outputs: This problem actually occurs even when we train with different window sizes. At small degree resolutions, we actually use a weighted loss otherwise it does only produce silence. A linear search at 2-5 degree resolution during inference is independent from how we train the model as long as the model can do that kind of separation.

point-source noise with its location close to one speaker: The model is still trained to only output the voice. However, in this scenario, the model doesn't have spatial information to use, so the output is essentially the same as running a single channel Demucs.

model performance under different T60 ranges: We'll add these details to the supplementary of the paper. In the synthetic experiments, the absorption parameter was varied significantly (see generate_dataset.py). In the synthetic and real environments we tried, it worked well enough to only train based on the true DOA. The reflections don't create "phantom" sources that are picked up

from cone-of-silence.

The effect of model size on the overall performance about cone-of-silence HOT 5 CLOSED

Comments (5)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent