kpister / oratio Goto Github PK

View Code? Open in Web Editor NEW

8.0 3.0 3.0 75.53 MB

Open Source Video Localization Pipeline

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

video-localization translation

oratio's Introduction

Open Oratio

An open source pipeline to translate .mp4 video files to .mov video files in 20 different languages.

Generate quality video and podcast localizations at scale.

Setup

Most important:

python --version >= 3.7

pip install -r docs/requirements.txt

Also install rubberband brew install rubberband

And follow the instructions in docs/ for aws and gcloud integration. Then make sure to setup the names of the s3 or gcloud bucket you will store your audio in. Set the AWS_BUCKET_NAME and the GCLOUD_BUCKET_NAME constants in src/constants/constants.py.

Optional Setup

Also install image magick, (if you want text overlay) brew install imagemagick

Setup pre-commit, if you want to contribute pre-commit install

Test setup: pre-commit run This should run black and run_tests.py but both should be skipped until code changes

Running the pipeline

python src/main.py tests/test_config.yaml will test your setup to make sure everything is in the right place.

After test_config.yaml starts working, make your own project folder in media/prod and edit the config.yaml to get going! Checkout my test video in media/prod/kaiser to familarize yourself with the setup.

python src/main.py will use the default config.yaml provided in the home directory.

Understanding the Repo

Start with src/main.py. Run it. Read it.

Follow the commands it executes with a debugger.

Then check out src/client.py. This is our biggest piece of abstraction, and especially if you are adding an API feature, you'll want a good understanding of what it is doing.

src/config.py and src/video_project.py have important setup information and maintain the state of the project.

File structure

. home
/docs - contains documentation on ideas, most documentation is in the relevant .py files
/src - contains source code for the pipeline
/src/api - the neural apis we work with, abstracted in the client.py
/media - contains input and output media
/media/dev - stores temporary files made during translation
/media/prod - stores the finalized input and output files
/media/test - stores test input files
/tests - unit tests for the pipeline

Metrics

Performance (speed) Performance (accuracy)

oratio's People

Contributors

Stargazers

Watchers

Forkers

shandilya21 jwstanly myforking

oratio's Issues

Integrate a segment-level quality score

Machine translation is integrated, but of course some of the translations are bad.

With an instant segment-level quality score, Oratio can implement various flavours of "hybrid" translation:

create a priority queue for human post-editors
send segments below a certain quality to human post-editors
as well as monitor aggregate quality and more system- or process-level issues (bad segmentation, wrong language, issues with content types...).

I'd suggest the ModelFront API.

Full-disclosure: I'm a co-founder of ModelFront.

Increase music volume during silence

We know where each track is silent (since we add it ourselves) and we can create a numpy array full of the value to scale the music by.

Allow per language providers

Some providers do better on different languages or have better options. There are a lot of female only voices in google for example

Write helper function to improve locale/language distinction

The project class should have a helper function which takes

locale or language
include input or only targets

example uses:

for locale in self.get_targets(type=LOCALE, include_input=True)

for lang_code in self.get_targets(type=LANGUAGE_CODE, include_input=False)

Add better test coverage

Especially the AWS implementations aren't tested.

Add background noise to tts

Making the tts sound more natural is 90% of our product. Even with the recent devs, we are dropping all background while the tts is playing. We could maybe add slight background tracks etc, for when bg music isn't provided to us.

Low priority, should go with a general rework/clarification of how we handle sound and track generation.

Audio only files are broken

Audio only has several broken features in config.py.

I'll investigate further to see where it is broken, my hunch is that some of the recent original sentences work should be treated as video only.

Could be related to video only flags, config.py might not be properly accounting for those, or project.py is not properly abstracted.

Add caption input/output

We will want to allow captions to be input with the video (since some youtubers will have those) and also output captions in the foreign languages.
These are stored in SRT files which have the following format

<section number>
<start time> --> <end time>
<Caption text
Can be multiple lines>

Example:

1
00:00:00,000 --> 00:00:04,400
Here is the first caption of the video.

2
00:00:06,200 --> 00:01:00,000
Then we have a really long
caption that takes up most of the video.

Tag individual sentences as speakers

We need to start considering multispeaker transitions.
One step here would be tagging a sentences with an identity that would track what synthesis model is being used (e.g. en-US-wavenet-A, en-US-wavenet-B or AWS Dave). Will allow for manual implementation of multispeaker videos before we start working with people.

Experiment with gcloud translation apis

https://cloud.google.com/translate/docs/reference/rest/v2/translate

Some features to play with

Batch Translation
https://cloud.google.com/translate/docs/advanced/batch-translation
Glossaries
https://cloud.google.com/translate/docs/advanced/glossary
AutoML model
https://cloud.google.com/translate/docs/advanced/translating-text-v3#automl-model

Add AWS translate API

Mimic input audio volume

Listening to some vloggers, they often modulate the volume of their voice - could be cool to add this feature.

An initial step would be tagging sentences at 3 different volume levels and applying numpy masks.
Bigger steps:
if we can tag word to word or phrase to phrase translation we can better connect these.
if we can quantify the amplitude of the input audio, we could make a continuous volume mask.

Allow for more precise timing

Aws gives us hundredths for stt, we should have that as our standard and allow gcloud to just be slightly less precise instead of truncating aws.