yzhou359 / makeittalk Goto Github PK
View Code? Open in Web Editor NEWThis project forked from adobe-research/makeittalk
License: Other
This project forked from adobe-research/makeittalk
License: Other
Thanks for your great work! The results are amazing!
I have a questions about the choice of the sound encoder. As stated in the last paragraph of "Related Work" in your paper, you use Resemblyer to extract the identity embedding, and AutoVC to extract the content embedding.
Here is my question: since AutoVC itself can decompose content and identity, why bother use Resemblyer? Does it make a big difference?
Thank you in advance!
您好,我在用makeittalk训练自己的模型,src/dataset/utils/目录下的STD_FACE_LANDMARKS.txt, MEAN_STD_AUTOVC_RETRAIN_AU.txt这些文件能通用吗,如果不是通用的,您是通过那个文件生成的,没有从项目中找到对应的代码。
Hey! it is an awesome work on animating faces according to text. I wanted to know, if the image(just a sketch) is fixed and the audio received is varying, can we make a custom lightweight model like mobilenet, which can generate Generate input data for inference and Audio-to-Landmarks prediction using the browser's webgl only, in real time.
作者您好,autovc作者已经开源训练代码,您这边对autovc模型做了一些修改,能否贡献您这边修改后autovc的训练代码呢,期待您的回复
Hi, good job, how to train?
Hello!I met with some problem when i was training content branch.
I could not find a file named 'autovc_retrain_mel_train_au.pickle' from your dataset uploaded in google drive.
so the code stopped while running '/src/dataset/audio2landmark/audio2landmark_dataset.py line 33'
Could you please tell me how to solve this problem.
btw, i wonder what is the meaning of 'align' and 'mel' of the filenames in your dataset.🤔
Hi,
thanks again for this amazing work. In your opinion, do you think that this approach can replace the 3D Morphable models based approaches? Regardless of the simplicity and less DoF, only the quality.
Thanks
hi,thanks for your shares ,the training data extract landmarks, use face_alignment.LandmarksType._3D or face_alignment.LandmarksType._2D
The cartoon demo encountered the following errors:
ffmpeg version 4.2.2 Copyright (c) 2000-2019 the FFmpeg developers
built with gcc 9.2.1 (GCC) 20200122
configuration: --disable-static --enable-shared --enable-gpl --enable-version3 --enable-sdl2 --enable-fontconfig --enable-gnutls --enable-iconv --enable-libass --enable-libdav1d --enable-libbluray --enable-libfreetype --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libopus --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libtheora --enable-libtwolame --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libzimg --enable-lzma --enable-zlib --enable-gmp --enable-libvidstab --enable-libvorbis --enable-libvo-amrwbenc --enable-libmysofa --enable-libspeex --enable-libxvid --enable-libaom --enable-libmfx --enable-amf --enable-ffnvcodec --enable-cuvid --enable-d3d11va --enable-nvenc --enable-nvdec --enable-dxva2 --enable-avisynth --enable-libopenmpt
libavutil 56. 31.100 / 56. 31.100
libavcodec 58. 54.100 / 58. 54.100
libavformat 58. 29.100 / 58. 29.100
libavdevice 58. 8.100 / 58. 8.100
libavfilter 7. 57.100 / 7. 57.100
libswscale 5. 5.100 / 5. 5.100
libswresample 3. 5.100 / 3. 5.100
libpostproc 55. 5.100 / 55. 5.100
[image2 @ 000002659ffd11c0] Could find no file with path '%06d.tga' and index in the range 0-4
%06d.tga: No such file or directory
I am using Ubuntu 18.04
Thanks
Hello! Thanks for your great results!
I have a question about the data you training on:
I can't find exactly the resolution of the face crop in the "underlined" part of the data. Also the amount of it one.
Can you mention it please ^^
P.S. Did you use the full VoxCeleb2 dataset for Img-to-Img training?
目前谷歌网盘上没有提供train集,请问可否用文字说明
Thanks for sharing the works. I wonder if I can use the codes to drive a new cartoon image. I can get the landmarks by hand, but what does the file triangulation.txt
mean? How can I get it from a new cartoon image?
Thanks a lot.
Hi,
I am looking how to get the good quality "out.mp4", with the audio embeded in it (instead of having the test_audio_embed.mp4 which is less good as quality).
If you have a clue what to change in the code, thank you
When will the Speaker-Aware Branch be released?
When I run the two jupyter notebooks, or even main_end2end.py
, I get the following error:
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-7-fc3b009acf6e> in <module>
76 for ain in ains:
77 os.system('ffmpeg -y -loglevel error -i examples/{} -ar 16000 examples/tmp.wav'.format(ain))
---> 78 shutil.copyfile('examples/tmp.wav', 'examples/{}'.format(ain))
79
80 # au embedding
c:\users\admin\appdata\local\programs\python\python37\lib\shutil.py in copyfile(src, dst, follow_symlinks)
118 os.symlink(os.readlink(src), dst)
119 else:
--> 120 with open(src, 'rb') as fsrc:
121 with open(dst, 'wb') as fdst:
122 copyfileobj(fsrc, fdst)
FileNotFoundError: [Errno 2] No such file or directory: 'examples/tmp.wav'
Where can I find this tmp.wav
file?
the preprocessed dataset for training Content Branch only includes 3 files i.e. autovc_retrain_mel_test_au.pickle autovc_retrain_mel_test_fl.pickle emb.pickle ? when running command 'python main_train_content.py --train' ,i will get FileNotFoundError: [Errno 2] No such file or directory: autovc_retrain_mel_train_au.pickle.
是否支持中文语音的音频?卡通形象的图片只能是示例中的图片吗?还是可以我自己找的卡通形象图片?
Hi, after struggling to get a good configuration of conda + pytorch + pyvision + cuda etc..(first time).
I managed to run the script but I'm facing another problem;
(makeittalk_env) vincent@denaes:~/Desktop/development/MakeItTalk-main$ python main_end2end.py --jpg examples/327-3275260_leonardo-dicaprio-png-famous-actor.png
Downloading: "https://www.adrianbulat.com/downloads/python-fan/3DFAN4-4a694010b9.zip" to /home/vincent/.cache/torch/hub/checkpoints/3DFAN4-4a694010b9.zip
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 91.9M/91.9M [03:41<00:00, 434kB/s]
Downloading: "https://www.adrianbulat.com/downloads/python-fan/depth-6c4283c0e0.zip" to /home/vincent/.cache/torch/hub/checkpoints/depth-6c4283c0e0.zip
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 224M/224M [09:08<00:00, 429kB/s]
Traceback (most recent call last):
File "main_end2end.py", line 72, in <module>
shapes = predictor.get_landmarks(img)
File "/home/vincent/anaconda3/envs/makeittalk_env/lib/python3.6/site-packages/face_alignment/api.py", line 110, in get_landmarks
return self.get_landmarks_from_image(image_or_path, detected_faces, return_bboxes, return_landmark_score)
File "/home/vincent/anaconda3/envs/makeittalk_env/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/vincent/anaconda3/envs/makeittalk_env/lib/python3.6/site-packages/face_alignment/api.py", line 138, in get_landmarks_from_image
image = get_image(image_or_path)
File "/home/vincent/anaconda3/envs/makeittalk_env/lib/python3.6/site-packages/face_alignment/utils.py", line 342, in get_image
if image.ndim == 2:
AttributeError: 'NoneType' object has no attribute 'ndim'
If you have a clue, thank you !
Hi @yzhou359 , while using custom audio I'm getting this error in colab.
请问在哪里输入音频呢?我运行的quickdemo时候一直报错没有tmp.wav文件。请问tmp音频是什么?怎么来的呢?
Looks like in the paper, the image-image translation is trained using batch 16.
When using the code to do inference with batch 16, the input tensor is of dimension (16, 6, 256, 256). However the output from image-image translation model is still (1,6,256,256). Any idea what's the issue behind that?
The cartoon demo encountered the following errors:
ffmpeg version 4.2.2 Copyright (c) 2000-2019 the FFmpeg developers
built with gcc 9.2.1 (GCC) 20200122
configuration: --disable-static --enable-shared --enable-gpl --enable-version3 --enable-sdl2 --enable-fontconfig --enable-gnutls --enable-iconv --enable-libass --enable-libdav1d --enable-libbluray --enable-libfreetype --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libopus --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libtheora --enable-libtwolame --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libzimg --enable-lzma --enable-zlib --enable-gmp --enable-libvidstab --enable-libvorbis --enable-libvo-amrwbenc --enable-libmysofa --enable-libspeex --enable-libxvid --enable-libaom --enable-libmfx --enable-amf --enable-ffnvcodec --enable-cuvid --enable-d3d11va --enable-nvenc --enable-nvdec --enable-dxva2 --enable-avisynth --enable-libopenmpt
libavutil 56. 31.100 / 56. 31.100
libavcodec 58. 54.100 / 58. 54.100
libavformat 58. 29.100 / 58. 29.100
libavdevice 58. 8.100 / 58. 8.100
libavfilter 7. 57.100 / 7. 57.100
libswscale 5. 5.100 / 5. 5.100
libswresample 3. 5.100 / 3. 5.100
libpostproc 55. 5.100 / 55. 5.100
[image2 @ 000002659ffd11c0] Could find no file with path '%06d.tga' and index in the range 0-4
%06d.tga: No such file or directory
` img =cv2.imread('examples/' + opt_parser.jpg)
predictor = face_alignment.FaceAlignment(face_alignment.LandmarksType._3D, device='cpu', flip_input=True)
shapes = predictor.get_landmarks(img)
if (not shapes or len(shapes) != 1):
print('Cannot detect face landmarks. Exit.')
exit(-1)
shape_3d = shapes[0]
if(opt_parser.close_input_face_mouth):
util.close_input_face_mouth(shape_3d)
`
RuntimeError Traceback (most recent call last)
in
1 img =cv2.imread('examples/' + opt_parser.jpg)
----> 2 predictor = face_alignment.FaceAlignment(face_alignment.LandmarksType._3D, device='cpu', flip_input=True)
3 shapes = predictor.get_landmarks(img)
4 if (not shapes or len(shapes) != 1):
5 print('Cannot detect face landmarks. Exit.')
~\Anaconda3\lib\site-packages\face_alignment\api.py in init(self, landmarks_type, network_size, device, flip_input, face_detector, face_detector_kwargs, verbose)
83 network_name = '3DFAN-' + str(network_size)
84 self.face_alignment_net = torch.jit.load(
---> 85 load_file_from_url(models_urls.get(pytorch_version, default_model_urls)[network_name]))
86
87 self.face_alignment_net.to(device)
~\Anaconda3\lib\site-packages\torch\jit_serialization.py in load(f, map_location, _extra_files)
159 cu = torch._C.CompilationUnit()
160 if isinstance(f, str) or isinstance(f, pathlib.Path):
--> 161 cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
162 else:
163 cpp_module = torch._C.import_ir_module_from_buffer(
RuntimeError: [enforce fail at ..\caffe2\serialize\inline_container.cc:145] . PytorchStreamReader failed reading zip archive: failed finding central directory
Thank you for sharing your work.
I want to train your model on my own dataset. I wonder how to preprocess the dataset.
Could you tell me how you did it?
Thanks for your sharing, this is a good project. I have a question now. The portrait image (256x256.jpg) is cropped from an image (1280x720.jpg) and used as input, then a video (256x256.mp4) is generated. Is there any way to convert 1280x720 based on your results the background of .jpg is also increased to 256x256.mp4? Thanks.
Hello,
thanks for this amazing work and for sharing it. I need to understand how to get the Delaunay triangulation for the custom cartoon?
Thanks in advance and will really appreciate your response
1.什么是register?
在训练content model的时候,论文提到
We also register the facial landmarks to a frontfacing standard facial template using a
best-estimated affine transformation 是对应src/approaches/train_content.py下的如下代码吗
''' register face '''
if (self.opt_parser.use_reg_as_std):
landmarks = input_face_id.detach().cpu().numpy().reshape(68, 3)
frame_t_shape = landmarks[self.t_shape_idx, :]
T, distance, itr = icp(frame_t_shape, self.anchor_t_shape)
landmarks = np.hstack((landmarks, np.ones((68, 1))))
registered_landmarks = np.dot(T, landmarks.T).T
input_face_id = torch.tensor(registered_landmarks[:, 0:3].reshape(1, 204), requires_grad=False,
dtype=torch.float).to(device)
在准备content的训练数据时候,即目标fl坐标,需要做register吗?
3.训练speaker_aware_model是使用 src/approaches/train_speaker_aware.py 代码吗?
1)为什么“Discriminator D_T” 训练的部分被注释掉,需要训练D吗?
2)训练这个模型需要register相关的东西吗?
论文中4.2中提到:
we do not register the landmarks to a front-facing template since
here we are interested in learning the overall head motion.
但是src/approaches/train_speaker_aware.py中还加载了inputs_reg_fl, 不明白是什么意思
3)计算loss:
fl_dis_pred = fl_dis_pred + face_id[0:1].detach()
loss_reg_fls = torch.nn.functional.l1_loss(fl_dis_pred, reg_fls_gt)
V = (fl_dis_pred + face_id[0:1]).view(-1, 68, 3) -> 这里是不是不用 + face_id了?
I have three questions:
autovc_retrain_mel_test_au.pickle
, the scale of facial landmark in (-1, 1)
.(27, 28, 29, 30, 33, 36, 39, 42, 45)
and only consider the displacement of other points (mouth, jaw movements) and factoring out head movement. But in file src\approaches\train_content.py
, in line 176
, you only register the chosen closing lip landmark to standard landmark in file src\dataset\utils\STD_FACE_LANDMARKS.txt
, other facial landmark frames are not aligned, this mean contains head movement in speech content training?Can your model accept a normal photo with transparency?
Hi, When you extract the content embedding of the vox dataset, do you use the same target identity embedding(autovc/retrain_version/obama_emb.txt
) as that in the Obama dataset?
Thanks for your great project. As we know, fps of the video is 25 or 29.97, the rate of AutoVC output is 62.5Hz. I wonder how to align facial landmark and AutoVC output temporal steps?
您好,请问一下ANCHOR_T_SHAPE_9.txt
和STD_FACE_LANDMARKS.txt
这两个文件有什么区别,只是是否做了归一化的区别吗?
can I use voices of any language?
No such file or directory: 'examples/pred_fls_examples/M6_04_16k_audio_embed.txt'. There is a similar closed issue on Windows but I am experiencing this on Colab.
作者您好,我对你MakeItTalk很感兴趣,但是我被两个问题所困扰,第一个问题是我输入的中文音频和嘴型对不上,这个问题从何入手去定位?第二个问题是我输入一段24s的音频,生成的out.mp4所需要的时间在170-190s左右,时间有点太长了,这个性能能否优化到50%,请问如何优化呢?
I tried out the demo and received a video clip with 3 faces at the end. I want to post process the images. How can I download the rendered images only?
I couldn't find them. I guess they are deleted each time. How can I stop that?
Thanks for the release of this great repo!
I am interested in your proposed quantitative evaluation metrics (i.e., D-VL, D-A, D-LL, D-Rot/Pos). A standard evaluation metrics will make the following comparison more fairer. Is there any code for this part? Or can you provide more details (e.g., the definition of the mouth shape for the D-A metric) for computing those metrics?
Thanks for any reply!
Just following colab example step by step and getting and error
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-6-a3006e2fd807> in <module>()
9 import pickle
10 import face_alignment
---> 11 from thirdparty.autovc.AutoVC_mel_Convertor_retrain_version import AutoVC_mel_Convertor
12 import shutil
13 import time
ModuleNotFoundError: No module named 'thirdparty.autovc'
---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.
To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------
您好!我在windows系统下运行quickdemo时出现了如题问题。请问这个txt文件是什么?可能是哪里出了问题?
You used extract_f0_func_audiofile(audio_file, 'M')
in here which seems to be used in generating training data. However, extract_f0_func_audiofile(audio_file, 'F')
is used while test time.
Is this intended behavior?
Hi yangzhou, Did you use pos_pred
when training the speaker_aware branch? pos_pred
is defined here
MakeItTalk Quick Demo (natural human face animation)
Step 3/3: One-click to Run (just wait in seconds).
Error
Loaded Image...
---------------------------------------------------------------------------
SameFileError Traceback (most recent call last)
<ipython-input-10-fc3b009acf6e> in <module>()
76 for ain in ains:
77 os.system('ffmpeg -y -loglevel error -i examples/{} -ar 16000 examples/tmp.wav'.format(ain))
---> 78 shutil.copyfile('examples/tmp.wav', 'examples/{}'.format(ain))
79
80 # au embedding
/home/ifarkas/anaconda3/envs/makeittalk_env/lib/python3.6/shutil.py in copyfile(src, dst, follow_symlinks)
102 """
103 if _samefile(src, dst):
--> 104 raise SameFileError("{!r} and {!r} are the same file".format(src, dst))
105
106 for fn in [src, dst]:
SameFileError: 'examples/tmp.wav' and 'examples/tmp.wav' are the same file
您好,我使用你发布的img2img预训练模型生成的视频效果不好,请问一下这是什么原因造成的
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.