Code Monkey home page Code Monkey logo

h2p-parser's Introduction

Heteronym to Phoneme Parser

Python package CodeQL codecov FOSSA Status

Fast parsing of English Heteronyms to Phonemes using contextual part-of-speech.

Provides the ability to convert heteronym graphemes to their phonetic pronunciations.

Designed to be used in conjunction with other fixed grapheme-to-phoneme dictionaries such as CMUdict

This package also offers a combined Grapheme-to-Phoneme dictionary, combining the functionality of fixed lookups handled by CMUdict and context-based parsing as offered by this module.

Usage

1. Combined Grapheme-to-Phoneme dictionary

The CMUDictExt class combines a pipeline for context-based heteronym parsing to phonemes and a fixed dictionary lookup replacement using the CMU Pronouncing Dictionary.

Example:

from h2p_parser.cmudictext import CMUDictExt

CMUDictExt = CMUDictExt()

# Parsing replacements for a line. This can be one or more sentences.
line = CMUDictExt.convert("The cat read the book. It was a good book to read.")
# -> "{DH AH0} {K AE1 T} {R EH1 D} {DH AH0} {B UH1 K}. {IH1 T} {W AA1 Z} {AH0} {G UH1 D} {B UH1 K} {T UW1} {R IY1 D}."

Additional optional parameters are available when defining a CMUDictExt instance:

Parameter Type Default Value Description
cmu_dict_path str None Path to a custom CMUDict file in .txt format
h2p_dict_path str None Path to a custom H2p Dictionary file in .json format. See the example.json for the expected format.
cmu_multi_mode int 0 Default selection index for CMUDict entries with multiple pronunciations as donated by the (1) or (n) format
process_numbers bool True Toggles conversion of some numbers and symbols to their spoken pronunciation forms. See numbers.py for details on what is covered.
phoneme_brackets bool True Surrounds phonetic words with curly brackets i.e. {R IY1 D}
unresolved_mode str keep Unresolved word resolution modes:
keep - Keeps the text-form word in the output.
remove - Removes the text-form word from the output.
drop - Returns the line as None if any word is unresolved.

2. Heteronym-to-Phoneme parsing only

To use only the core heteronym-to-phoneme parsing functions, without fixed dictionary support, use H2p class.

Example:

from h2p_parser.h2p import H2p

h2p = H2p(preload=True) # preload flag improves first-inference performance

# checking if a line contains a heteronym
state = h2p.contains_het("There are no heteronyms in this line.")
# -> False

# replacing a single line
line = h2p.replace_het("I read the book. It was a good book to read.")
# -> "I {R EH1 D} the book. It was a good book to {R IY1 D}."

# replacing a list of lines
lines = h2p.replace_het_list(["I read the book. It was a good book to read.",
                              "Don't just give the gift; present the present.",
                              "If you were to reject the product, it would be a reject."])
# -> ["I {R EH1 D} the book. It was a good book to {R IY1 D}.",
#     "Don't just give the gift; {P R IY0 Z EH1 N T} the {P R EH1 Z AH0 N T}.",
#     "If you were to {R IH0 JH EH1 K T} the product, it would be a {R IY1 JH EH0 K T}."]

Note: Depending on your performance requirements, there is a speed improvement for processing large text line batches by using replace_het_list() with a list of all text lines, instead of making repeated calls to replace_het(). See the performance section for more details and guidelines for optimizations.

License

The code in this project is released under Apache License 2.0.

FOSSA Status

h2p-parser's People

Contributors

fossabot avatar ionite34 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.