Hi! Thanks for open-sourcing your code, it's helpful to see a refere

Reference Image Concatenation about diffdub HOT 3 OPEN

kradkfl commented on August 22, 2024

Reference Image Concatenation

from diffdub.

Comments (3)

liutaocode commented on August 22, 2024 1

Yes, your observation is correct. In fact, we have also tested the number of frames stitched in this first stage, with the number ranging from 0 to 10. We found that the difference is not significant, and there are a few key reasons:

Most of the HDTF dataset consists of frontal faces, with fewer multi-angle shots, so the dataset difficulty is not high.
The 512-dimensional motion latent already includes mouth-related information, so providing more reference frames is not very meaningful.

Considering that a talking head model supporting arbitrary speakers will have at least one frame that can be used as a reference, all of our results use a model (first stage) with a "reference image" count of 1 (N=1) for inference, in order to maximize the utilization of information as much as possible.

My suggestions are as follows:

If your final scenario requires multiple angles or includes some background after expanding the original mouth mask, I recommend inputting as many frames as possible to the rendering stage for reference.
Otherwise, you can proceed without adding reference frames.

from diffdub.

kradkfl commented on August 22, 2024

Thanks! Did you find that, for in the wild predictions, the additional reference frames helped even with a model solely trained on HDTF? Or did training with HDTF encourage the model to ignore the reference frames during training?

from diffdub.

liutaocode commented on August 22, 2024

Hello. We haven't conducted experiments beyond HDTF, but I try to analyze the situation logically.

Intuitively, adding references during the diffusion rendering stage seems to be effective. Here, the motion latent space we use (512 dimensions) includes both motion and color information. So, it is challenging to perfectly reconstruct the masked area both in motion and color; theoretically, using some reference frames can result in better reconstruction, as the color information can be derived from the references, allowing the latent space to focus more on motion.

This analysis is supported by the 7th row of Table 1 in reference [1], which shows that a 512-dimensional latent space alone is not sufficient for restoring any arbitrary facial image. However, when modeling smaller areas, such as the mouth area, the situation may be different:

(1) For in-the-wild datasets, the latent space may struggle to accurately replicate areas within the mask, as it needs to accommodate facial imagery from any individual. I suggest adding concatenation to allow the latent space to focus more on motion.

(2) For HDTF, based on our tests with N ranging from 0 to 10 showing no difference, it is likely unnecessary. Given that HDTF features only about 300 people, the diffusion model might easily learn the distribution of these individuals's mouth area, leading to overfitting. This could be due to what you mentioned: "training with HDTF encourages the model to ignore the reference frames."

Reference:
[1] Preechakul, K., Chatthee, N., Wizadwongsa, S., et al. Diffuser Autoencoders: Toward a Meaningful and Decodable Representation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 10619-10629.

from diffdub.

Reference Image Concatenation about diffdub HOT 3 OPEN

Comments (3)

Related Issues (4)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent