Hello, I am trying to reconstruct the training code and there are several questions I

You're right. Maybe the function comments mislead you because the codes have bee

I use a 73 landmarks detector, different from the popular used 68 landmarks. The

Questions about training audio2feature model about livespeechportraits HOT 13 CLOSED

yuanxunlu commented on July 21, 2024

Questions about training audio2feature model

from livespeechportraits.

Comments (13)

YuanxunLu commented on July 21, 2024

You're right. Maybe the function comments mislead you because the codes have been iterated many times (change network structure, hyper-paras, etc), and I forgot to correct them. Please just follow the shape of the running codes (you can print them during inference).
18 frames delay can be found in the function 'generate_sequences', controlled by the parameters "frame_future". Yes, LSTM will receive h0, h1, ..., hn and generate 'y0, y1, ..., yn'. You can simply compare the 'y17, y18, ..., yn' with the corresponding groundtruth.
It is not fixed. Same as 1, codes were iterated many times. In the early version, I did some experiments using several clips (some data are sentences by sentences you know). If you have a consecutive audio clip with groundtruth, there is no need to cut them.

Hope the above helps.

from livespeechportraits.

TimmmYang commented on July 21, 2024

Thank you! This helps me a lot.

from livespeechportraits.

TimmmYang commented on July 21, 2024

Hi, there are still a couple of questions about training audio2feature model:

How to decide the mouth-related landmarks? I found that for 68 facial landmarks project, the mouth points is less than 20 and in the paper you have 25 points. Any other points are included? Like eyes, nose landmarks, etc.
I use 3DDFA to extract landmarks for the training videos and the landmarks values are just the pixel location. You made difference and normalization before sending input into networks, right? For example, Delta v1 = v1 - v0, Delta v2 = v2 - v0 , ... where v0 is the mean_pts3d.
How to determine mean_pts3d? Should I just choose a neural expression frame of the target person and get 3d landmark points or use mean value of the whole dataset? What's more, for frame_jump_stride=4, is this the frame increment for each iteration? Like batch size=32, the input tensor is [32, 240*2, 512], where T=240 is the 0-240 frame data for item1, 4-244 for item2 ... 124-364 frame for item32?

from livespeechportraits.

YuanxunLu commented on July 21, 2024

I use a 73 landmarks detector, different from the popular used 68 landmarks. The semantic definition of the mouth shape just contains the mouth shape. For 68 points, using mouth landmarks is just OK. Of course, you can add more points if you find it works better in your experiments.
I did 3D face tracking on each video, and extract these landmarks in 3D object space. Yes. The network learns the delta positions instead of the absolute positions.
The mean_pts3d should be fixed for one target I think. Each method to choose the mean landmarks is OK I think (I tested with both methods, and the results are similar). Frame jump is just an option to accelerate the training, it decides how many frames and how frequent the network sees them in one epoch. I think this hyper-para doesn't affect the performance much, but you can try it yourself.

from livespeechportraits.

TimmmYang commented on July 21, 2024

Okay, thanks a lot!

from livespeechportraits.

TimmmYang commented on July 21, 2024

Hello, I am still confused about the coordinates change between 2D and 3D.

How to set the camera parameters? I noticed that in project _landmarks you use camera_intrinsic and scale to convert 3d_pts to 2D landmarks. But how's the camera_intrinsic and scale determined if I train another dataset?

Now I had a frame_dataset with 3d points and head poses, but 3d points are ranging from 0-512, which is the image coordinates I believe. How should I process the current dataset to train a new model?

from livespeechportraits.

YuanxunLu commented on July 21, 2024

It depends on your camera model, which is a perspective camera (pinhole) model in our settings. I guess current you use a scaled orthogonal model (or called weak perspective in some papers) for your 3d points ranging from 0-512.
The camera intrinsic and scale params should change for your camera models.
If you are not familiar with camera-related knowledge, I recommend you check Sec.4.1 in Paper: 3D Morphable Face Models - Past, Present and Future or any other 3D face related papers.

from livespeechportraits.

TimmmYang commented on July 21, 2024

I read the papers you recommended and I know that I should set a camera to turn the 3d landmarks in world coordinate to camera coordinate then use camera intrinsic to compute 2D landmarks in the image coordinates. But how I set the camera in the world coordinate? Could it be done automatically?

Also, I find your paper section 4.1, you said for camera calibration, you use binary search to compute focal depth. Are there any open source tool to complete this process? I read the reference paper(Cao 2013) and found no quick implementations.

In my case, all the values of 3D landmarks are ranging from 0 to 512 because I got these points from a cropped video. I noticed that your camera's rotation and translation are all 0. Can I just set camera in the middle of the image(might not be accurate)? Like for point (x, y, z), the transformed points are (x-256)/256, (y-256)/256, (z-256)/256? Because it might be good to transform all values to [-1, 1] for training? I know I should also consider the head pose and shoulder points and do the same process but not sure if it works.

Thanks!

from livespeechportraits.

YuanxunLu commented on July 21, 2024

Check the tool you used (from the doc or its paper? )to find out which camera model/projection model is used. Once you have the camera model, you know how to project the detected 3D points to 2D images (just follow the formula of your used tool's paper/doc.). I guess you may just drop z-coords for your 3D landmarks are ranging from 0-512.

Binary search is used to compute the perspective camera focal length f. I don't know any open-source tools. Again, if you use a scaled orthogonal camera, this step is not necessary.

I think any transform over the landmarks for training is wise as long as it improves the experiment results.

from livespeechportraits.

TimmmYang commented on July 21, 2024

Hi, did you use 18-frame latency both during training and inferencing or just train the model with 0 latency and use this delay during inferencing?

I tried trained audio2feature model these days and the problem is the validation loss is much higher than the training loss. I just use the default 80%/20% train/val split but the validation loss is nearly 20 times larger than the training loss, as you can see from the plot:

from livespeechportraits.

YuanxunLu commented on July 21, 2024

Adding n-frame latency is an effective scheme to generate better mouth shapes, it should be used both in training and testing.

I don't know your experiment setting (and so it is hard to compare with mine - my validation loss is not as large as yours), but validation loss is higher is common. If you use only a few minutes of audio data, the model tends to overfit the training data. It is not a bad thing because you actually want to learn the distribution of the training data. The key problem lies in how to learn the mapping from input audio to the training audio. More training data is better but it is hard to acquire, SynthesizeObama-SIG17 paper has done such ablation study on the training corpus size. More data, the validation loss is closer to the training loss.

Back to the plot, it is clear that when the training loss drops the validation loss also drops, and this tendency is good. Usually, you can choose the model which owns the best validation loss to work with your testing.

You can try to decrease the validation loss by using more data, different training targets, or any other methods.

from livespeechportraits.

TimmmYang commented on July 21, 2024

Thanks for your reply! I use similar experiment settings(network structure, optimizer, learning rate, frame jump, n-frame latency, etc.) as yours except I use 20*3 as mouth-shape(68-point face alignment model) and because I didn't see any camera information from that model, I simply normalized the keypoints value by (x - 256)/256 to let data ranging from -1 to 1 (it might also be a simple coordinates changing ). Not sure if it is appropriate or not.

Maybe I should do some tests on my trained model to see how it performs.

from livespeechportraits.

foocker commented on July 21, 2024

Thanks for your reply! I use similar experiment settings(network structure, optimizer, learning rate, frame jump, n-frame latency, etc.) as yours except I use 20*3 as mouth-shape(68-point face alignment model) and because I didn't see any camera information from that model, I simply normalized the keypoints value by (x - 256)/256 to let data ranging from -1 to 1 (it might also be a simple coordinates changing ). Not sure if it is appropriate or not.

Maybe I should do some tests on my trained model to see how it performs.

the last result is good?

from livespeechportraits.

Questions about training audio2feature model about livespeechportraits HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent