Hello everyone, today we are going to build an interesting application that creates video messages from Youtube Videos.
A simple application that replaces the original speech of the video by your text. The program try to clone the original voice and replace them by your text and the libs of the mouth are synchronized with the new speech with Machine Learning.
The application that we will create will be on hosted on Amazon Web Services (AWS) , in ordering to to perform the calculations
We are going to use SageMaker Notebook to create this application.
There are two versions of this program:
- The notebook version (.ipynb ) that is the extended version that we are going to see first,
- The python version (.py) that is the simplified version.
We are going to discuss the extended version to understand how the things were done.
You should login to AWS account and create a SageMaker Notebook, as was explained in the previous blog here. For this project I have used
ml.g4dn.4xlarge instances and we open a new terminal File>New> Terminal
Then in the terminal, we clone the repository, and type the following
git clone https://github.com/ruslanmv/Video-Speech-Generator-from-Youtube.git
then enter to the directory
cd Video-Speech-Generator-from-Youtube
First at all, we need to install the environment to this application, that will be VideoMessage, that will be executed on python=3.7.13 and it is required ffmpeg and git-lfs.
To do this task. open a terminal and type:
sh-4.2$ sh install.sh
after is done, you will git something like
You can open the Video-Message-Generator.ipynb notebook and choose the kernell VideoMessage.
In odering to begin the construction of the application we requiere to manage well the enviroments of this Cloud Service. In SageMaker Notebook
import sys ,os
print(sys.prefix)
/home/ec2-user/anaconda3/envs/VideoMessage
However if we want to install to our enviroment by using our terminal we have to be careful, becase Sagemaker runs on their own container, then in the terminal of Sagemaker we have to load propertly the enviroment that we have created
Sagemaker = True
if Sagemaker :
env='source activate python3 && conda activate VideoMessage &&'
else:
env=''
In this part we have install all dependencies of our custom program Due to we are going to install for first time, we define True otherwise False
is_first_time = True
#Install dependency
# Download pretrained model
# Import the os module
import os
# Get the current working directory
parent_dir = os.getcwd()
print(parent_dir)
if is_first_time:
# Directory
directory = "sample_data"
# Path
path = os.path.join(parent_dir, directory)
print(path)
try:
os.mkdir(path)
print("Directory '% s' created" % directory)
except Exception:
print("Directory '% s'was already created" % directory)
os.system('git clone https://github.com/Rudrabha/Wav2Lip')
os.system('cd Wav2Lip &&{} pip install -r requirements.txt'.format(env))
/home/ec2-user/SageMaker/VideoMessageGen
/home/ec2-user/SageMaker/VideoMessageGen/sample_data
Directory 'sample_data'was already created
fatal: destination path 'Wav2Lip' already exists and is not an empty directory.
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting librosa==0.7.0
Downloading librosa-0.7.0.tar.gz (1.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 24.4 MB/s eta 0:00:00
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'done'
.
.
.
Successfully built librosa audioread
Installing collected packages: llvmlite, tqdm, threadpoolctl, pycparser, pillow, numpy, joblib, audioread, torch, scipy, opencv-python, opencv-contrib-python, numba, cffi, torchvision, soundfile, scikit-learn, resampy, librosa
Successfully installed audioread-3.0.0 cffi-1.15.1 joblib-1.1.0 librosa-0.7.0 llvmlite-0.31.0 numba-0.48.0 numpy-1.17.1 opencv-contrib-python-4.6.0.66 opencv-python-4.1.0.25 pillow-9.2.0 pycparser-2.21 resampy-0.3.1 scikit-learn-1.0.2 scipy-1.7.3 soundfile-0.10.3.post1 threadpoolctl-3.1.0 torch-1.1.0 torchvision-0.3.0 tqdm-4.45.0
Now we download some models from google that uses Wav2Lip
from utils.default_models import ensure_default_models
from pathlib import Path
## Load the models one by one.
print("Preparing the models of Wav2Lip")
ensure_default_models(Path("Wav2Lip"))
Preparing the models of Wav2Lip
Wav2Lip/checkpoints/wav2lip_gan.pth
Wav2Lip/face_detection/detection/sfd/s3fd.pth
then we need to download some repositories about the generation of voice
if is_first_time:
os.system('git clone https://github.com/Edresson/Coqui-TTS -b multilingual-torchaudio-SE TTS')
os.system('{} pip install -q -e TTS/'.format(env))
os.system('{} pip install -q torchaudio==0.9.0'.format(env))
os.system('{} pip install -q youtube-dl'.format(env))
os.system('{} pip install ffmpeg-python'.format(env))
os.system('{} pip install gradio==3.0.4'.format(env))
os.system('{} pip install pytube==12.1.0'.format(env))
then we load the repositories
#this code for recording audio
from IPython.display import HTML, Audio
from base64 import b64decode
import numpy as np
from scipy.io.wavfile import read as wav_read
import io
import ffmpeg
from pytube import YouTube
import random
from subprocess import call
import os
from IPython.display import HTML
from base64 import b64encode
from IPython.display import clear_output
from datetime import datetime
def showVideo(path):
mp4 = open(str(path),'rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
return HTML("""
<video width=700 controls>
<source src="%s" type="video/mp4">
</video>
""" % data_url)
def time_between(t1, t2):
FMT = '%H:%M:%S'
t1 = datetime.strptime(t1, FMT)
t2 = datetime.strptime(t2, FMT)
delta = t2 - t1
return str(delta)
In order to check that works
time_between("00:00:01","00:00:10" )
the result is ok.
'0:00:09'
The next step is create the module to download videos from YouTube, I have defined two versions, one given by pytube and the other youtube-dl
def download_video(url):
print("Downloading...")
local_file = (
YouTube(url)
.streams.filter(progressive=True, file_extension="mp4")
.first()
.download(filename="youtube{}.mp4".format(random.randint(0, 10000)))
)
print("Downloaded")
return local_file
def download_youtube(url):
#Select a Youtube Video
#find youtube video id
from urllib import parse as urlparse
url_data = urlparse.urlparse(url)
query = urlparse.parse_qs(url_data.query)
YOUTUBE_ID = query["v"][0]
url_download ="https://www.youtube.com/watch?v={}".format(YOUTUBE_ID)
# download the youtube with the given ID
os.system("{} youtube-dl -f mp4 --output youtube.mp4 '{}'".format(env,url_download))
the difference is that youtube-dl takes too much time, can be used to download YouTube videos with higher quality but I prefer for now, better performance.
We can test both modules.
if is_first_time:
URL = 'https://www.youtube.com/watch?v=xw5dvItD5zY'
#URL = 'https://www.youtube.com/watch?v=uIaY0l5qV0c'
#download_youtube(URL)
download_video(URL)
Downloading...
Downloaded
Then we need to define some modules to clean our files.
def cleanup():
import pathlib
import glob
types = ('*.mp4','*.mp3', '*.wav') # the tuple of file types
#Finding mp4 and wave files
junks = []
for files in types:
junks.extend(glob.glob(files))
try:
# Deleting those files
for junk in junks:
print("Deleting",junk)
# Setting the path for the file to delete
file = pathlib.Path(junk)
# Calling the unlink method on the path
file.unlink()
except Exception:
print("I cannot delete the file because it is being used by another process")
def clean_data():
# importing all necessary libraries
import sys, os
# initial directory
home_dir = os.getcwd()
# some non existing directory
fd = 'sample_data/'
# Join various path components
path_to_clean=os.path.join(home_dir,fd)
print("Path to clean:",path_to_clean)
# trying to insert to false directory
try:
os.chdir(path_to_clean)
print("Inside to clean", os.getcwd())
cleanup()
# Caching the exception
except:
print("Something wrong with specified\
directory. Exception- ", sys.exc_info())
# handling with finally
finally:
print("Restoring the path")
os.chdir(home_dir)
print("Current directory is-", os.getcwd())
The next step is define a program that will trim the YouTube videos.
def youtube_trim(url,start,end):
#cancel previous youtube
cleanup()
#download youtube
#download_youtube(url) # with youtube-dl (slow)
input_videos=download_video(url)
# Get the current working directory
parent_dir = os.getcwd()
# Trim the video (start, end) seconds
start = start
end = end
#Note: the trimmed video must have face on all frames
interval = time_between(start, end)
trimmed_video= parent_dir+'/sample_data/input_video.mp4'
trimmed_audio= parent_dir+'/sample_data/input_audio.mp3'
#delete trimmed if already exits
clean_data()
# cut the video
call(["ffmpeg","-y","-i",input_videos,"-ss", start,"-t",interval,"-async","1",trimmed_video])
# cut the audio
call(["ffmpeg","-i",trimmed_video, "-q:a", "0", "-map","a",trimmed_audio])
#Preview trimmed video
print("Trimmed Video+Audio")
return trimmed_video, trimmed_audio
print("In our enviroment")
os.system('{} pip show pandas numpy'.format(env))
In our enviroment
Name: pandas
Version: 1.3.5
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: The Pandas Development Team
Author-email: [email protected]
License: BSD-3-Clause
Location: /home/ec2-user/anaconda3/envs/VideoMessage/lib/python3.7/site-packages
Requires: numpy, python-dateutil, pytz
Required-by: gradio, TTS
---
Name: numpy
Version: 1.19.5
Summary: NumPy is the fundamental package for array computing with Python.
Home-page: https://www.numpy.org
Author: Travis E. Oliphant et al.
Author-email:
License: BSD
Location: /home/ec2-user/anaconda3/envs/VideoMessage/lib/python3.7/site-packages
Requires:
Required-by: gradio, librosa, matplotlib, numba, opencv-contrib-python, opencv-python, pandas, resampy, scikit-learn, scipy, tensorboardX, torchvision, TTS, umap-learn
I need to select a custom version OpenCV.
if is_first_time:
os.system('{} pip install opencv-contrib-python-headless==4.1.2.30'.format(env))
We neeed to clone the voice from the Youtube clip, in order to reproduce th.e most real possible the speech.
import sys
VOICE_PATH = "utils/"
# add libraries into environment
sys.path.append(VOICE_PATH) # set this if VOICE is not installed globally
from utils.voice import *
Here we mix the clonned voice with the trimmed video, where with face recognition, and gan we generate the motion of of the libs
def create_video(Text,Voicetoclone):
out_audio=greet(Text,Voicetoclone)
current_dir=os.getcwd()
clonned_audio = os.path.join(current_dir, out_audio)
#Start Crunching and Preview Output
#Note: Only change these, if you have to
pad_top = 0#@param {type:"integer"}
pad_bottom = 10#@param {type:"integer"}
pad_left = 0#@param {type:"integer"}
pad_right = 0#@param {type:"integer"}
rescaleFactor = 1#@param {type:"integer"}
nosmooth = False #@param {type:"boolean"}
out_name ="result_voice.mp4"
out_file="../"+out_name
if nosmooth == False:
is_command_ok = os.system('{} cd Wav2Lip && python inference.py --checkpoint_path checkpoints/wav2lip_gan.pth --face "../sample_data/input_video.mp4" --audio "../out/clonned_audio.wav" --outfile {} --pads {} {} {} {} --resize_factor {}'.format(env,out_file,pad_top ,pad_bottom ,pad_left ,pad_right ,rescaleFactor))
else:
is_command_ok = os.system('{} cd Wav2Lip && python inference.py --checkpoint_path checkpoints/wav2lip_gan.pth --face "../sample_data/input_video.mp4" --audio "../out/clonned_audio.wav" --outfile {} --pads {} {} {} {} --resize_factor {} --nosmooth'.format(env,out_file,pad_top ,pad_bottom ,pad_left ,pad_right ,rescaleFactor))
if is_command_ok > 0:
print("Error : Ensure the video contains a face in all the frames.")
out_file="./demo/tryagain1.mp4"
return out_file
else:
print("OK")
#clear_output()
print("Creation of Video done")
return out_name
Here we choose the video The King's Speech: King Charles III and here we trim the video and sound.
URL = 'https://www.youtube.com/watch?v=xw5dvItD5zY'
#Testing first time
trimmed_video, trimmed_audio=youtube_trim(URL,"00:00:01","00:00:10")
showVideo(trimmed_video)
Trimmed Video+Audio
size= 165kB time=00:00:09.01 bitrate= 150.2kbits/s speed=76.7x
video:0kB audio:165kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.210846%
From the trimmed video The King's Speech: King Charles III we process everything together.
Text=' I am clonning your voice. Charles!. Machine intelligence is the last invention that humanity will ever need to make.'
Voicetoclone=trimmed_audio
print(Voicetoclone)
#Testing first time
outvideo=create_video(Text,Voicetoclone)
#Preview output video
print("Final Video Preview")
final_video= parent_dir+'/'+outvideo
print("Dowload this video from", final_video)
showVideo(final_video)
/home/ec2-user/SageMaker/VideoMessageGen/sample_data/input_audio.mp3
path url
/home/ec2-user/SageMaker/VideoMessageGen/sample_data/input_audio.mp3
> text: I am clonning your voice. Charles!. Machine intelligence is the last invention that humanity will ever need to make.
Generated Audio