Code Monkey home page Code Monkey logo

Comments (2)

RF5 avatar RF5 commented on May 31, 2024

Hi @space-pope , thanks for your interest!

I think the prior context for that passage might help understanding:

In preliminary experiments, we used features from
later layers (22, 24, and the mean of the last several layers),
which perform well on linear phone recognition tasks [6]. The
idea was to improve nearest neighbors mapping by including
more content information. However, this led to worse pitch and
energy reconstruction. Recent work [15] confirms that later
layers give poorer predictions of pitch, prosody, and speaker
identity. Based on these observations, we found that using a
layer with high correlation with speaker identification โ€“ layer 6
in WavLM-Large โ€“ was necessary for good speaker similarity
and retention of the prosody information from the source utterance.

Here, the two references are meant to be understood together: [6] analyses several tasks for both WavLM-Base+ and WavLM-Large, while [15] considers additional tasks but only analyses the layer-wise contributing from WavLM-Base. From [6] we can see that there is an extremely strong correlation between WavLM-Base+ and WavLM-Large. e.g. if the last few layers of WavLM-Base is highly weighted for a task, we can expect the last few layers of WavLM-Large to also be highly weighted for those same tasks.

So, from [15] we know the later layers of WavLM-Base perform poorly on pitch and energy reconstruction (important aspects of prosody), so taken together with [6] (what we referred to as 'these observations') we can infer that for WavLM-Large the later layers will also still struggle with pitch and energy reconstruction.

And, as you mention and as hinted in the passage, in preliminary experiments we did try several other layers, and found layer 6 to be the best of the settings we tested -- but the other layers also perform reasonable (i.e. none of the layers are completely unusable). While we are not fully certain of the reason for this, we suspect that it is because of the high weight it has for speaker identification [6]. The earlier layers might have better pitch and energy reconstruction, but they are lower-level features, and so yield slightly more artifacts after the k-means operation during vocoding. i.e. we suspect that if pitch is too readily available, then the effective shuffling of WavLM frames after the k-means operation (which distorts the pitch information between adjacent frames) causes the output pitch contour to also be more distorted. However, there is much more room for investigation here as we are not certain of all the effects at play.

Hope that helps a bit with some of the intuition behind why we use layer 6 :)

from knn-vc.

space-pope avatar space-pope commented on May 31, 2024

Wow...that's what I get for reading a paper then several days later getting so laser-focused on a task that I go back to the paper, search for "layer 6", and only read two sentences and one reference around the search result. Thanks for the kind and detailed response to a half-baked question.

Fig. 2 from the original WavLM paper ([6]) does show that WavLM-Large's layerwise task contribution is sort of a "stretched" version of WavLM-Base+'s 12 layers, but the correlation isn't perfect (for example, layer 24 in Large seems to do well in the speaker ID task, but none of the later layers in Base does). Perhaps there's some interference from the semantic information captured by the later layers, and that makes the best-performing early layer a better choice for the VC task.

I guess my main remaining question given the info in the original WavLM paper is whether WavLM-Base+ layers 4/6 would perform similarly. No way to know except trying, I suppose :). Thanks againโ€”and thanks for releasing your research and code; it's a creative and elegant use of pretrained models that doesn't add a lot of extra machinery to the process, which is refreshing in the current environment.

from knn-vc.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.