Comments (3)
Yes, your observation is correct. In fact, we have also tested the number of frames stitched in this first stage, with the number ranging from 0 to 10. We found that the difference is not significant, and there are a few key reasons:
- Most of the HDTF dataset consists of frontal faces, with fewer multi-angle shots, so the dataset difficulty is not high.
- The 512-dimensional motion latent already includes mouth-related information, so providing more reference frames is not very meaningful.
Considering that a talking head model supporting arbitrary speakers will have at least one frame that can be used as a reference, all of our results use a model (first stage) with a "reference image" count of 1 (N=1) for inference, in order to maximize the utilization of information as much as possible.
My suggestions are as follows:
- If your final scenario requires multiple angles or includes some background after expanding the original mouth mask, I recommend inputting as many frames as possible to the rendering stage for reference.
- Otherwise, you can proceed without adding reference frames.
from diffdub.
Thanks! Did you find that, for in the wild predictions, the additional reference frames helped even with a model solely trained on HDTF? Or did training with HDTF encourage the model to ignore the reference frames during training?
from diffdub.
Hello. We haven't conducted experiments beyond HDTF, but I try to analyze the situation logically.
Intuitively, adding references during the diffusion rendering stage seems to be effective. Here, the motion latent space we use (512 dimensions) includes both motion and color information. So, it is challenging to perfectly reconstruct the masked area both in motion and color; theoretically, using some reference frames can result in better reconstruction, as the color information can be derived from the references, allowing the latent space to focus more on motion.
This analysis is supported by the 7th row of Table 1 in reference [1], which shows that a 512-dimensional latent space alone is not sufficient for restoring any arbitrary facial image. However, when modeling smaller areas, such as the mouth area, the situation may be different:
(1) For in-the-wild datasets, the latent space may struggle to accurately replicate areas within the mask, as it needs to accommodate facial imagery from any individual. I suggest adding concatenation to allow the latent space to focus more on motion.
(2) For HDTF, based on our tests with N ranging from 0 to 10 showing no difference, it is likely unnecessary. Given that HDTF features only about 300 people, the diffusion model might easily learn the distribution of these individuals's mouth area, leading to overfitting. This could be due to what you mentioned: "training with HDTF encourages the model to ignore the reference frames."
Reference:
[1] Preechakul, K., Chatthee, N., Wizadwongsa, S., et al. Diffuser Autoencoders: Toward a Meaningful and Decodable Representation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 10619-10629.
from diffdub.
Related Issues (4)
- Will the code be released? HOT 2
- Contact HOT 1
- How to increase face likeness? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from diffdub.