Code Monkey home page Code Monkey logo

ipa-transcriber's Introduction

IPA Transcriber - Auto-transcribe arbitrary languages into phonemic IPA

This Ruby script leverages the IPA dictionary databases from the ipa-dict project to automatically convert orthographic text in a variety of languages into phonemic transcription in the International Phonetic Alphabet (IPA).

Take the following text (from here) for example:

Goat, Dog, and Cow were great friends. One day they went on a journey in a taxi.

Using the English (US) dictionary, this would be converted to:

ˈɡoʊt ˈdɔɡ ˈænd ˈkaʊ ˈwɝ ˈɡɹeɪt ˈfɹɛndz ˈwən ˈdeɪ ˈðeɪ ˈwɛnt ˈɑn ˈeɪ ˈdʒɝni ˈɪn ˈeɪ ˈtæksi

Results are adjustable using optional custom vocabulary lists.

Requirements

Since Ruby's inbuilt .upcase and .downcase methods don't support non-ASCII text, this script requires the alternative versions provided by the UnicodeUtils package:

gem install unicode_utils

Usage

To convert some text in a file, just execute ipa_transcriber.rb and provide an input file and IPA dictionary to use as a basis for conversion:

./ipa_transcriber.rb -f [TEXTFILE] -i [DICTIONARY]

See below for details on command-line options and example invocations.

Options

The following options are available. The -f and -i options are mandatory, but -w is optional:

  • -f, --filename FILE: Source file (specify a source text file to convert)
  • -i, --ipa-dict DICT: IPA dictionary file (specify the location of the IPA dictionary file to use for the language to convert from)
  • -w, --wordlist LIST: Optional custom word list (an additional list of words and IPA pronunciations to use for words that don't match the provided dictionary file -- e.g., proper names, nonce words, loanwords, etc.)

Examples

The following examples assume that you have cloned or downloaded and extracted the ipa-dict to your home folder.

Transcribe some English (US) text into IPA:

./ipa_transcriber.rb -f ~/english.txt -i ~/ipa-dict/data/en_US.txt

Transcribe some French (Standard) text into IPA:

./ipa_transcriber.rb -f ~/french.txt -i ~/ipa-dict/data/fr_FR.txt

Transcribe some Japanese text into IPA:

./ipa_transcriber.rb -f ~/japanese.txt -i ~/ipa-dict/data/ja.txt

Notes

  • The automated IPA transcription will generally need to be manually tweaked in order to disambiguate homographs (e.g., "read" or "bow"), as well as words not found in the IPA dictionary. Some of this work can be aided by using the -w option and supplying a custom list of special words used in a particular text.
  • Languages whose orthographies do not use spaces to separate words (such as Chinese and Japanese) will need to be manually spaced before converting to IPA. There are tools available that can automate this process to some extent, but their results will need to be carefully reviewed as parsing errors are common.

Contributing

This project was developed to support the creation of Storybooks Speech and Hearing, and has been used to convert a corpus of stories in more than a dozen languages. PRs and other contributions to expand functionality for other use cases are more than welcome!

To do

  • Allow for one-off conversion of text on the command-line
  • Handle conversion of text from STDIN
  • Add config file to allow setting default language

License

MIT.

ipa-transcriber's People

Contributors

dohliam avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

hermetique

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.