Video Speech Generator from YouTube

Hello everyone, today we are going to build an interesting application that creates video messages from Youtube Videos.

A simple application that replaces the original speech of the video by your text. The program try to clone the original voice and replace them by your text and the libs of the mouth are synchronized with the new speech with Machine Learning.

The application that we will create will be on hosted on Amazon Web Services (AWS) , in ordering to to perform the calculations

We are going to use SageMaker Notebook to create this application.

There are two versions of this program:

The notebook version (.ipynb ) that is the extended version that we are going to see first,
The python version (.py) that is the simplified version.

We are going to discuss the extended version to understand how the things were done.

Step 1 - Cloning the repository

You should login to AWS account and create a SageMaker Notebook, as was explained in the previous blog here. For this project I have used

ml.g4dn.4xlarge instances and we open a new terminal File>New> Terminal

Then in the terminal, we clone the repository, and type the following

git clone https://github.com/ruslanmv/Video-Speech-Generator-from-Youtube.git

then enter to the directory

cd Video-Speech-Generator-from-Youtube

Step 2 - Environment setup

First at all, we need to install the environment to this application, that will be VideoMessage, that will be executed on python=3.7.13 and it is required ffmpeg and git-lfs.

To do this task. open a terminal and type:

sh-4.2$ sh install.sh

after is done, you will git something like

You can open the Video-Message-Generator.ipynb notebook and choose the kernell VideoMessage.

Step 3 - Definition of variables.

In odering to begin the construction of the application we requiere to manage well the enviroments of this Cloud Service. In SageMaker Notebook

import sys ,os
print(sys.prefix)

/home/ec2-user/anaconda3/envs/VideoMessage

However if we want to install to our enviroment by using our terminal we have to be careful, becase Sagemaker runs on their own container, then in the terminal of Sagemaker we have to load propertly the enviroment that we have created

Sagemaker = True
if Sagemaker :
    env='source activate python3 && conda activate VideoMessage &&'
else:
    env=''

Step 4. Setup of the dependencies

In this part we have install all dependencies of our custom program Due to we are going to install for first time, we define True otherwise False

is_first_time = True

#Install dependency
# Download pretrained model
# Import the os module
import os
# Get the current working directory
parent_dir = os.getcwd()
print(parent_dir)
if is_first_time:
    # Directory 
    directory = "sample_data"
    # Path 
    path = os.path.join(parent_dir, directory) 
    print(path)
    try:
        os.mkdir(path)
        print("Directory '% s' created" % directory) 
    except Exception:
         print("Directory '% s'was already created" % directory)
    os.system('git clone https://github.com/Rudrabha/Wav2Lip')
    os.system('cd Wav2Lip &&{} pip install  -r requirements.txt'.format(env))

/home/ec2-user/SageMaker/VideoMessageGen
/home/ec2-user/SageMaker/VideoMessageGen/sample_data
Directory 'sample_data'was already created


fatal: destination path 'Wav2Lip' already exists and is not an empty directory.


Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting librosa==0.7.0
  Downloading librosa-0.7.0.tar.gz (1.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 24.4 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
.
.
.

Successfully built librosa audioread
Installing collected packages: llvmlite, tqdm, threadpoolctl, pycparser, pillow, numpy, joblib, audioread, torch, scipy, opencv-python, opencv-contrib-python, numba, cffi, torchvision, soundfile, scikit-learn, resampy, librosa
Successfully installed audioread-3.0.0 cffi-1.15.1 joblib-1.1.0 librosa-0.7.0 llvmlite-0.31.0 numba-0.48.0 numpy-1.17.1 opencv-contrib-python-4.6.0.66 opencv-python-4.1.0.25 pillow-9.2.0 pycparser-2.21 resampy-0.3.1 scikit-learn-1.0.2 scipy-1.7.3 soundfile-0.10.3.post1 threadpoolctl-3.1.0 torch-1.1.0 torchvision-0.3.0 tqdm-4.45.0

Now we download some models from google that uses Wav2Lip

from utils.default_models import ensure_default_models
from pathlib import Path

## Load the models one by one.
print("Preparing the models of Wav2Lip")
ensure_default_models(Path("Wav2Lip"))

Preparing the models of Wav2Lip
Wav2Lip/checkpoints/wav2lip_gan.pth
Wav2Lip/face_detection/detection/sfd/s3fd.pth

then we need to download some repositories about the generation of voice

if is_first_time:
    os.system('git clone https://github.com/Edresson/Coqui-TTS -b multilingual-torchaudio-SE TTS')
    os.system('{} pip install -q -e TTS/'.format(env))
    os.system('{} pip install -q torchaudio==0.9.0'.format(env))
    os.system('{} pip install -q youtube-dl'.format(env))
    os.system('{} pip install ffmpeg-python'.format(env))
    os.system('{} pip install gradio==3.0.4'.format(env))
    os.system('{} pip install pytube==12.1.0'.format(env))

then we load the repositories

#this code for recording audio
from IPython.display import HTML, Audio
from base64 import b64decode
import numpy as np
from scipy.io.wavfile import read as wav_read
import io
import ffmpeg
from pytube import YouTube
import random
from subprocess import call
import os
from IPython.display import HTML
from base64 import b64encode
from IPython.display import clear_output
from datetime import datetime

Step 5. Definition de modules used in this program

def showVideo(path):
  mp4 = open(str(path),'rb').read()
  data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
  return HTML("""
  <video width=700 controls>
        <source src="%s" type="video/mp4">
  </video>
  """ % data_url)

def time_between(t1, t2):
    FMT = '%H:%M:%S'
    t1 = datetime.strptime(t1, FMT)
    t2 = datetime.strptime(t2, FMT)
    delta = t2 - t1
    return str(delta)

In order to check that works

time_between("00:00:01","00:00:10" )

the result is ok.

'0:00:09'

The next step is create the module to download videos from YouTube, I have defined two versions, one given by pytube and the other youtube-dl

def download_video(url):
    print("Downloading...")
    local_file = (
        YouTube(url)
        .streams.filter(progressive=True, file_extension="mp4")
        .first()
        .download(filename="youtube{}.mp4".format(random.randint(0, 10000)))
    )
    print("Downloaded")
    return local_file

def download_youtube(url):
    #Select a Youtube Video
    #find youtube video id
    from urllib import parse as urlparse
    url_data = urlparse.urlparse(url)
    query = urlparse.parse_qs(url_data.query)
    YOUTUBE_ID = query["v"][0]
    url_download ="https://www.youtube.com/watch?v={}".format(YOUTUBE_ID)
    # download the youtube with the given ID
    os.system("{} youtube-dl -f  mp4 --output youtube.mp4 '{}'".format(env,url_download))

the difference is that youtube-dl takes too much time, can be used to download YouTube videos with higher quality but I prefer for now, better performance.

We can test both modules.

if is_first_time:
    URL = 'https://www.youtube.com/watch?v=xw5dvItD5zY'
    #URL = 'https://www.youtube.com/watch?v=uIaY0l5qV0c'
    #download_youtube(URL)
    download_video(URL)

Downloading...
Downloaded

Then we need to define some modules to clean our files.

def cleanup():
    import pathlib
    import glob
    types = ('*.mp4','*.mp3', '*.wav') # the tuple of file types
    #Finding mp4 and wave files
    junks = []
    for files in types:
        junks.extend(glob.glob(files))
    try:    
        # Deleting those files
        for junk in junks:
            print("Deleting",junk)
            # Setting the path for the file to delete
            file = pathlib.Path(junk)
            # Calling the unlink method on the path
            file.unlink()               
    except Exception:
        print("I cannot delete the file because it is being used by another process")

def clean_data():
    # importing all necessary libraries
    import sys, os
    # initial directory
    home_dir = os.getcwd()
    # some non existing directory
    fd = 'sample_data/'
    # Join various path components
    path_to_clean=os.path.join(home_dir,fd)
    print("Path to clean:",path_to_clean)
    # trying to insert to false directory
    try:
        os.chdir(path_to_clean)
        print("Inside to clean", os.getcwd())
        cleanup()
    # Caching the exception
    except:
        print("Something wrong with specified\
        directory. Exception- ", sys.exc_info())
    # handling with finally
    finally:
        print("Restoring the path")
        os.chdir(home_dir)
        print("Current directory is-", os.getcwd())

The next step is define a program that will trim the YouTube videos.

def youtube_trim(url,start,end):
    #cancel previous youtube
    cleanup()
    #download youtube
    #download_youtube(url) # with youtube-dl (slow)
    input_videos=download_video(url)
    # Get the current working directory
    parent_dir = os.getcwd()
    # Trim the video (start, end) seconds
    start =  start
    end =  end
    #Note: the trimmed video must have face on all frames
    interval = time_between(start, end)
    trimmed_video= parent_dir+'/sample_data/input_video.mp4'
    trimmed_audio= parent_dir+'/sample_data/input_audio.mp3'
    #delete trimmed if already exits
    clean_data()
    # cut the video  
    call(["ffmpeg","-y","-i",input_videos,"-ss", start,"-t",interval,"-async","1",trimmed_video])
    # cut the audio
    call(["ffmpeg","-i",trimmed_video, "-q:a", "0", "-map","a",trimmed_audio])
    #Preview trimmed video
    print("Trimmed Video+Audio")
    return trimmed_video, trimmed_audio

Step 7- Simple check of pandas and numpy versions

print("In our enviroment")
os.system('{} pip show pandas numpy'.format(env))

In our enviroment
Name: pandas
Version: 1.3.5
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: The Pandas Development Team
Author-email: [email protected]
License: BSD-3-Clause
Location: /home/ec2-user/anaconda3/envs/VideoMessage/lib/python3.7/site-packages
Requires: numpy, python-dateutil, pytz
Required-by: gradio, TTS
---
Name: numpy
Version: 1.19.5
Summary: NumPy is the fundamental package for array computing with Python.
Home-page: https://www.numpy.org
Author: Travis E. Oliphant et al.
Author-email: 
License: BSD
Location: /home/ec2-user/anaconda3/envs/VideoMessage/lib/python3.7/site-packages
Requires: 
Required-by: gradio, librosa, matplotlib, numba, opencv-contrib-python, opencv-python, pandas, resampy, scikit-learn, scipy, tensorboardX, torchvision, TTS, umap-learn

I need to select a custom version OpenCV.

if is_first_time:
    os.system('{} pip install opencv-contrib-python-headless==4.1.2.30'.format(env))

Step 8 - In this part we import the Libraries for voice recognition

We neeed to clone the voice from the Youtube clip, in order to reproduce th.e most real possible the speech.

import sys
VOICE_PATH = "utils/"
# add libraries into environment
sys.path.append(VOICE_PATH) # set this if VOICE is not installed globally

from utils.voice import *

Step 9 - Video creation

Here we mix the clonned voice with the trimmed video, where with face recognition, and gan we generate the motion of of the libs

def create_video(Text,Voicetoclone):
    out_audio=greet(Text,Voicetoclone)
    current_dir=os.getcwd()
    clonned_audio = os.path.join(current_dir, out_audio)
    
    #Start Crunching and Preview Output
    #Note: Only change these, if you have to
    pad_top =  0#@param {type:"integer"}
    pad_bottom =  10#@param {type:"integer"}
    pad_left =  0#@param {type:"integer"}
    pad_right =  0#@param {type:"integer"}
    rescaleFactor =  1#@param {type:"integer"}
    nosmooth = False #@param {type:"boolean"}
    
    out_name ="result_voice.mp4"
    out_file="../"+out_name
    
    if nosmooth == False:
        is_command_ok = os.system('{} cd Wav2Lip && python inference.py --checkpoint_path checkpoints/wav2lip_gan.pth --face "../sample_data/input_video.mp4" --audio "../out/clonned_audio.wav" --outfile {} --pads {} {} {} {} --resize_factor {}'.format(env,out_file,pad_top ,pad_bottom ,pad_left ,pad_right ,rescaleFactor))
    else:
        is_command_ok = os.system('{} cd Wav2Lip && python inference.py --checkpoint_path checkpoints/wav2lip_gan.pth --face "../sample_data/input_video.mp4" --audio "../out/clonned_audio.wav" --outfile {} --pads {} {} {} {} --resize_factor {} --nosmooth'.format(env,out_file,pad_top ,pad_bottom ,pad_left ,pad_right ,rescaleFactor))

    if is_command_ok > 0:
        print("Error : Ensure the video contains a face in all the frames.")
        out_file="./demo/tryagain1.mp4"
        return  out_file
    else:
        print("OK") 
    #clear_output()
    print("Creation of Video done")
    return out_name

Step 10 - Test trimmed video

Here we choose the video The King's Speech: King Charles III and here we trim the video and sound.

URL = 'https://www.youtube.com/watch?v=xw5dvItD5zY'
#Testing first time
trimmed_video, trimmed_audio=youtube_trim(URL,"00:00:01","00:00:10")
showVideo(trimmed_video)

Trimmed Video+Audio

size=     165kB time=00:00:09.01 bitrate= 150.2kbits/s speed=76.7x    
video:0kB audio:165kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.210846%

Step 11 - Add text + cloned voice + lib motion

From the trimmed video The King's Speech: King Charles III we process everything together.

Text=' I am clonning your voice. Charles!. Machine intelligence is the last invention that humanity will ever need to make.'
Voicetoclone=trimmed_audio
print(Voicetoclone)
#Testing first time
outvideo=create_video(Text,Voicetoclone)
#Preview output video
print("Final Video Preview")
final_video= parent_dir+'/'+outvideo
print("Dowload this video from", final_video)
showVideo(final_video)

/home/ec2-user/SageMaker/VideoMessageGen/sample_data/input_audio.mp3
path url
/home/ec2-user/SageMaker/VideoMessageGen/sample_data/input_audio.mp3
 > text:  I am clonning your voice. Charles!. Machine intelligence is the last invention that humanity will ever need to make.
Generated Audio

chintanpeg / video-speech-generator-from-youtube Goto Github PK