Comments (2)
Hi @space-pope , thanks for your interest!
I think the prior context for that passage might help understanding:
In preliminary experiments, we used features from
later layers (22, 24, and the mean of the last several layers),
which perform well on linear phone recognition tasks [6]. The
idea was to improve nearest neighbors mapping by including
more content information. However, this led to worse pitch and
energy reconstruction. Recent work [15] confirms that later
layers give poorer predictions of pitch, prosody, and speaker
identity. Based on these observations, we found that using a
layer with high correlation with speaker identification โ layer 6
in WavLM-Large โ was necessary for good speaker similarity
and retention of the prosody information from the source utterance.
Here, the two references are meant to be understood together: [6] analyses several tasks for both WavLM-Base+ and WavLM-Large, while [15] considers additional tasks but only analyses the layer-wise contributing from WavLM-Base. From [6] we can see that there is an extremely strong correlation between WavLM-Base+ and WavLM-Large. e.g. if the last few layers of WavLM-Base is highly weighted for a task, we can expect the last few layers of WavLM-Large to also be highly weighted for those same tasks.
So, from [15] we know the later layers of WavLM-Base perform poorly on pitch and energy reconstruction (important aspects of prosody), so taken together with [6] (what we referred to as 'these observations') we can infer that for WavLM-Large the later layers will also still struggle with pitch and energy reconstruction.
And, as you mention and as hinted in the passage, in preliminary experiments we did try several other layers, and found layer 6 to be the best of the settings we tested -- but the other layers also perform reasonable (i.e. none of the layers are completely unusable). While we are not fully certain of the reason for this, we suspect that it is because of the high weight it has for speaker identification [6]. The earlier layers might have better pitch and energy reconstruction, but they are lower-level features, and so yield slightly more artifacts after the k-means operation during vocoding. i.e. we suspect that if pitch is too readily available, then the effective shuffling of WavLM frames after the k-means operation (which distorts the pitch information between adjacent frames) causes the output pitch contour to also be more distorted. However, there is much more room for investigation here as we are not certain of all the effects at play.
Hope that helps a bit with some of the intuition behind why we use layer 6 :)
from knn-vc.
Wow...that's what I get for reading a paper then several days later getting so laser-focused on a task that I go back to the paper, search for "layer 6", and only read two sentences and one reference around the search result. Thanks for the kind and detailed response to a half-baked question.
Fig. 2 from the original WavLM paper ([6]) does show that WavLM-Large's layerwise task contribution is sort of a "stretched" version of WavLM-Base+'s 12 layers, but the correlation isn't perfect (for example, layer 24 in Large seems to do well in the speaker ID task, but none of the later layers in Base does). Perhaps there's some interference from the semantic information captured by the later layers, and that makes the best-performing early layer a better choice for the VC task.
I guess my main remaining question given the info in the original WavLM paper is whether WavLM-Base+ layers 4/6 would perform similarly. No way to know except trying, I suppose :). Thanks againโand thanks for releasing your research and code; it's a creative and elegant use of pretrained models that doesn't add a lot of extra machinery to the process, which is refreshing in the current environment.
from knn-vc.
Related Issues (18)
- Link to paper
- WavLM Base+ over Large? HOT 1
- Training HiFiGAN on higher quality data HOT 16
- out_wav is a wav file? HOT 1
- Torch Hub CPU inference support HOT 1
- prematch_dataset run very slow HOT 4
- prematch argument HOT 2
- Choice for k HOT 2
- Conversion output has very strong similarity to source audio. HOT 3
- Considering context around source features HOT 2
- An error when check input type
- bigvgan as vocoder HOT 1
- SoX effect fails on Windows with SoundFile backend HOT 1
- An error when loading models HOT 2
- torchaudio version HOT 2
- Discriminator checkpoint HOT 1
- Output is a bit shaky, how to fix that? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from knn-vc.