In your paper, you say: Recent work confirms that later

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Question about WavLM layer choice about knn-vc HOT 2 CLOSED

bshall commented on May 31, 2024

Question about WavLM layer choice

from knn-vc.

Comments (2)

RF5 commented on May 31, 2024

Hi @space-pope , thanks for your interest!

I think the prior context for that passage might help understanding:

In preliminary experiments, we used features from
later layers (22, 24, and the mean of the last several layers),
which perform well on linear phone recognition tasks [6]. The
idea was to improve nearest neighbors mapping by including
more content information. However, this led to worse pitch and
energy reconstruction. Recent work [15] confirms that later
layers give poorer predictions of pitch, prosody, and speaker
identity. Based on these observations, we found that using a
layer with high correlation with speaker identification – layer 6
in WavLM-Large – was necessary for good speaker similarity
and retention of the prosody information from the source utterance.

Here, the two references are meant to be understood together: [6] analyses several tasks for both WavLM-Base+ and WavLM-Large, while [15] considers additional tasks but only analyses the layer-wise contributing from WavLM-Base. From [6] we can see that there is an extremely strong correlation between WavLM-Base+ and WavLM-Large. e.g. if the last few layers of WavLM-Base is highly weighted for a task, we can expect the last few layers of WavLM-Large to also be highly weighted for those same tasks.

So, from [15] we know the later layers of WavLM-Base perform poorly on pitch and energy reconstruction (important aspects of prosody), so taken together with [6] (what we referred to as 'these observations') we can infer that for WavLM-Large the later layers will also still struggle with pitch and energy reconstruction.

And, as you mention and as hinted in the passage, in preliminary experiments we did try several other layers, and found layer 6 to be the best of the settings we tested -- but the other layers also perform reasonable (i.e. none of the layers are completely unusable). While we are not fully certain of the reason for this, we suspect that it is because of the high weight it has for speaker identification [6]. The earlier layers might have better pitch and energy reconstruction, but they are lower-level features, and so yield slightly more artifacts after the k-means operation during vocoding. i.e. we suspect that if pitch is too readily available, then the effective shuffling of WavLM frames after the k-means operation (which distorts the pitch information between adjacent frames) causes the output pitch contour to also be more distorted. However, there is much more room for investigation here as we are not certain of all the effects at play.

Hope that helps a bit with some of the intuition behind why we use layer 6 :)

from knn-vc.

space-pope commented on May 31, 2024

Wow...that's what I get for reading a paper then several days later getting so laser-focused on a task that I go back to the paper, search for "layer 6", and only read two sentences and one reference around the search result. Thanks for the kind and detailed response to a half-baked question.

Fig. 2 from the original WavLM paper ([6]) does show that WavLM-Large's layerwise task contribution is sort of a "stretched" version of WavLM-Base+'s 12 layers, but the correlation isn't perfect (for example, layer 24 in Large seems to do well in the speaker ID task, but none of the later layers in Base does). Perhaps there's some interference from the semantic information captured by the later layers, and that makes the best-performing early layer a better choice for the VC task.

I guess my main remaining question given the info in the original WavLM paper is whether WavLM-Base+ layers 4/6 would perform similarly. No way to know except trying, I suppose :). Thanks again—and thanks for releasing your research and code; it's a creative and elegant use of pretrained models that doesn't add a lot of extra machinery to the process, which is refreshing in the current environment.

from knn-vc.

Question about WavLM layer choice about knn-vc HOT 2 CLOSED

Comments (2)

Related Issues (18)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent