Code Monkey home page Code Monkey logo

digitaltwin's Introduction

Digital Twin Generation with Voice Cloning and Realistic Video

This project was developed as part of the Techolution AI Done Right Hackathon. This repository contains the code and instructions for creating a near real-time digital twin using advanced voice cloning and realistic video generation techniques. Our solution aims to tackle the challenge of generating a digital clone of a person, combining their voice, expressions, and speech in a lifelike manner.

Problem Statement

The challenge presented in the hackathon involves creating AI models capable of the following:

  1. Advanced Neural Architectures: Utilizing state-of-the-art deep learning techniques, including recurrent neural networks (RNNs), convolutional neural networks (CNNs), and generative adversarial networks (GANs) for voice cloning and spoof video generation.

  2. Expressiveness: Developing models that can faithfully capture a wide range of emotions, accents, and speaking styles, allowing for expressive voice cloning and fluid video generation from a 2D image.

  3. Naturalness: Ensuring that the generated voice clones sound natural and human-like, while focusing on generating accurate lip sync and realistic video corresponding to cloned audio.

  4. Robustness: Enhancing the robustness of the AI models to perform well with limited training data and in challenging acoustic environments. For video, the goal is to minimize traces of fake elements.

  5. Real-Time Nature: Creating an ensemble of voice cloning and spoof video generation models that operate in near real-time, making it suitable for conversational AI applications.

Solution Architecture

We approached this problem by leveraging two separate components:

  1. Voice Cloning and Text-to-Speech (TTS): We utilize the Tortoise-TTS repository for both voice cloning and generating speech from a text prompt. This component allows users to upload a sample audio file for voice cloning and specify a text prompt for generating the cloned voice.

  2. Realistic Video Generation: For generating realistic videos with lip-sync, we used the SadTalker repository. It takes an input image, an audio file from the voice cloning step, and generates a video with lip-sync.

To accommodate the processing requirements and avoid crashes, we used separate Google Colab instances for each component. We also configured ngrok with Flask to create accessible URLs for integration with a Streamlit application.

Running the Code

To run the complete system, follow these steps:

  1. Upload TorTTS_API.ipynb to one Colab instance and Vid_API.ipynb to another Colab instance.

  2. Configure ngrok APIs in both instances.

  3. Enter the ngrok URLs generated in step 2 into the app.py file.

  4. Run the Streamlit application using the command: streamlit run app.py.

  5. In the Streamlit application, users can perform the following steps:

    • Upload a sample audio file (.wav) of 10 to 15 seconds for voice cloning.
    • Specify a text prompt for generating speech.
    • Upload an image (.png) of the person whose voice is to be cloned.
    • The system will generate a video with lip-sync using the audio and image provided.
Demo.mp4

Additional Notes

  • If you have access to powerful GPUs, you can combine both TorTTS_API.ipynb and Vid_API.ipynb to run in a single Colab instance for faster processing or in your local machine.

  • Note that the Streamlit application is designed for local use, so users will need to clone this repository and run the application on their own machines.

digitaltwin's People

Contributors

goblincomet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.