Code Monkey home page Code Monkey logo

automatic-prosody-annotation's Introduction

Automatic Prosody Annotation with Pre-Trained Text-Speech Model

This is the official PyTorch implementation of the following paper:

Automatic Prosody Annotation with Pre-Trained Text-Speech Model
Ziqian Dai, Jianwei Yu, Yan Wang, Nuo Chen, Yanyao Bian, Guangzhi Li, Deng Cai, Dong Yu

Abstract: Prosodic boundary plays an important role in text-to-speech synthesis (TTS) in terms of naturalness and readability. However, the acquisition of prosodic boundary labels relies on manual annotation, which is costly and time-consuming. In this paper, we propose to automatically extract prosodic boundary labels from text-audio data via a neural text-speech model with pre-trained audio encoders. This model is pre-trained on text and speech data separately and jointly fine-tuned on TTS data in a triplet format: {speech, text, prosody}. The experimental results on both automatic evaluation and human evaluation demonstrate that: 1) the proposed text-speech prosody annotation framework significantly outperforms text-only baselines; 2) the quality of automatic prosodic boundary annotations is comparable to human annotations; 3) TTS systems trained with model-annotated boundaries are slightly better than systems that use manual ones.

Visit our demo page for audio samples.

This implementation supports our prosody estimation model (Conformer-Char as the audio encoder) and the code for inference. Note that the training code was not released due to the company's confidentiality policy, please refer to our paper for technical details.

Getting Started

Dependencies

For data preparation in STEP I, you need to install kaldi for feature extraction. For inference in STEP II, a couple of packages are required with Python 3.6. In stall them use

pip install -r requirements.txt

I. Data Preparation

The audio data needs to be preprocessed before fed into the model. Firstly, install kaldi tookit. Then extract audio features with kaldi using the following command. 80 dimension FBank and 3 dimension pitch are concatenated as the input feature.

echo "input raw_audio.wav" > tmp.scp
compute-fbank-feats --num-mel-bins=80 scp:tmp.scp ark:fbk.ark
compute-kaldi-pitch-feats scp:tmp.scp ark:- | process-kaldi-pitch-feats ark:- ark:pitch.ark
paste-feats --length-tolerance=3 ark:fbk.ark ark:pitch.ark ark,scp:feature.ark,feature.scp

This command saves the extracted features in feature.ark, which can be read through feature.scp.

II. Inference

Download the project, put the features files mentioned above in folder “data”, then run the inference code by

python code/main.py

The result will be stored in "prediction_save/test.txt". Note that the label 0-5 corresponds to CC, LW, PW, PPH, IPH in the paper respectively.

automatic-prosody-annotation's People

Contributors

daisyqk avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.