Code Monkey home page Code Monkey logo

Comments (3)

liutaocode avatar liutaocode commented on August 22, 2024 1

Yes, your observation is correct. In fact, we have also tested the number of frames stitched in this first stage, with the number ranging from 0 to 10. We found that the difference is not significant, and there are a few key reasons:

  1. Most of the HDTF dataset consists of frontal faces, with fewer multi-angle shots, so the dataset difficulty is not high.
  2. The 512-dimensional motion latent already includes mouth-related information, so providing more reference frames is not very meaningful.

Considering that a talking head model supporting arbitrary speakers will have at least one frame that can be used as a reference, all of our results use a model (first stage) with a "reference image" count of 1 (N=1) for inference, in order to maximize the utilization of information as much as possible.

My suggestions are as follows:

  • If your final scenario requires multiple angles or includes some background after expanding the original mouth mask, I recommend inputting as many frames as possible to the rendering stage for reference.
  • Otherwise, you can proceed without adding reference frames.

from diffdub.

kradkfl avatar kradkfl commented on August 22, 2024

Thanks! Did you find that, for in the wild predictions, the additional reference frames helped even with a model solely trained on HDTF? Or did training with HDTF encourage the model to ignore the reference frames during training?

from diffdub.

liutaocode avatar liutaocode commented on August 22, 2024

Hello. We haven't conducted experiments beyond HDTF, but I try to analyze the situation logically.

Intuitively, adding references during the diffusion rendering stage seems to be effective. Here, the motion latent space we use (512 dimensions) includes both motion and color information. So, it is challenging to perfectly reconstruct the masked area both in motion and color; theoretically, using some reference frames can result in better reconstruction, as the color information can be derived from the references, allowing the latent space to focus more on motion.

This analysis is supported by the 7th row of Table 1 in reference [1], which shows that a 512-dimensional latent space alone is not sufficient for restoring any arbitrary facial image. However, when modeling smaller areas, such as the mouth area, the situation may be different:

(1) For in-the-wild datasets, the latent space may struggle to accurately replicate areas within the mask, as it needs to accommodate facial imagery from any individual. I suggest adding concatenation to allow the latent space to focus more on motion.

(2) For HDTF, based on our tests with N ranging from 0 to 10 showing no difference, it is likely unnecessary. Given that HDTF features only about 300 people, the diffusion model might easily learn the distribution of these individuals's mouth area, leading to overfitting. This could be due to what you mentioned: "training with HDTF encourages the model to ignore the reference frames."

Reference:
[1] Preechakul, K., Chatthee, N., Wizadwongsa, S., et al. Diffuser Autoencoders: Toward a Meaningful and Decodable Representation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 10619-10629.

from diffdub.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.