Code Monkey home page Code Monkey logo

g2p's Introduction

image image

g2pE: A Simple Python Module for English Grapheme To Phoneme Conversion

  • [v.2.0] We removed TensorFlow from the dependencies. After all, it changes its APIs quite often, and we don't expect you to have a GPU. Instead, NumPy is used for inference.

This module is designed to convert English graphemes (spelling) to phonemes (pronunciation). It is considered essential in several tasks such as speech synthesis. Unlike many languages like Spanish or German where pronunciation of a word can be inferred from its spelling, English words are often far from people's expectations. Therefore, it will be the best idea to consult a dictionary if we want to know the pronunciation of some word. However, there are at least two tentative issues in this approach. First, you can't disambiguate the pronunciation of homographs, words which have multiple pronunciations. (See a below.) Second, you can't check if the word is not in the dictionary. (See b below.)

  • a. I refuse to collect the refuse around here. (rɪ|fju:z as verb vs. |refju:s as noun)
  • b. I am an activationist. (activationist: newly coined word which means n. A person who designs and implements programs of treatment or therapy that use recreation and activities to help people whose functional abilities are affected by illness or disability. from WORD SPY

For the first homograph issue, fortunately many homographs can be disambiguated using their part-of-speech, if not all. When it comes to the words not in the dictionary, however, we should make our best guess using our knowledge. In this project, we employ a deep learning seq2seq framework based on TensorFlow.

Algorithm

  1. Spells out arabic numbers and some currency symbols. (e.g. $200 -> two hundred dollars) (This is borrowed from Keith Ito's code)
  2. Attempts to retrieve the correct pronunciation for heteronyms based on their POS)
  3. Looks up The CMU Pronouncing Dictionary for non-homographs.
  4. For OOVs, we predict their pronunciations using our neural net model.

Environment

  • python 3.x

Dependencies

  • numpy >= 1.13.1
  • nltk >= 3.2.4
  • python -m nltk.downloader "averaged_perceptron_tagger" "cmudict"
  • inflect >= 0.3.1
  • Distance >= 0.1.3

Installation

pip install g2p_en

OR

python setup.py install

nltk package will be automatically downloaded at your first run.

Usage

from g2p_en import G2p

texts = ["I have $250 in my pocket.", # number -> spell-out
         "popular pets, e.g. cats and dogs", # e.g. -> for example
         "I refuse to collect the refuse around here.", # homograph
         "I'm an activationist."] # newly coined word
g2p = G2p()
for text in texts:
    out = g2p(text)
    print(out)
>>> ['AY1', ' ', 'HH', 'AE1', 'V', ' ', 'T', 'UW1', ' ', 'HH', 'AH1', 'N', 'D', 'R', 'AH0', 'D', ' ', 'F', 'IH1', 'F', 'T', 'IY0', ' ', 'D', 'AA1', 'L', 'ER0', 'Z', ' ', 'IH0', 'N', ' ', 'M', 'AY1', ' ', 'P', 'AA1', 'K', 'AH0', 'T', ' ', '.']
>>> ['P', 'AA1', 'P', 'Y', 'AH0', 'L', 'ER0', ' ', 'P', 'EH1', 'T', 'S', ' ', ',', ' ', 'F', 'AO1', 'R', ' ', 'IH0', 'G', 'Z', 'AE1', 'M', 'P', 'AH0', 'L', ' ', 'K', 'AE1', 'T', 'S', ' ', 'AH0', 'N', 'D', ' ', 'D', 'AA1', 'G', 'Z']
>>> ['AY1', ' ', 'R', 'IH0', 'F', 'Y', 'UW1', 'Z', ' ', 'T', 'UW1', ' ', 'K', 'AH0', 'L', 'EH1', 'K', 'T', ' ', 'DH', 'AH0', ' ', 'R', 'EH1', 'F', 'Y', 'UW2', 'Z', ' ', 'ER0', 'AW1', 'N', 'D', ' ', 'HH', 'IY1', 'R', ' ', '.']
>>> ['AY1', ' ', 'AH0', 'M', ' ', 'AE1', 'N', ' ', 'AE2', 'K', 'T', 'IH0', 'V', 'EY1', 'SH', 'AH0', 'N', 'IH0', 'S', 'T', ' ', '.']

References

If you use this code for research, please cite:

@misc{g2pE2019,
  author = {Park, Kyubyong & Kim, Jongseok},
  title = {g2pE},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Kyubyong/g2p}}
}

Cited in

May, 2018.

Kyubyong Park & Jongseok Kim

g2p's People

Contributors

kyubyong avatar nicklambourne avatar ozmig77 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

g2p's Issues

Bug: Throws NumOutOfRangeError for numbers longer than 36 digits

A NumOutOfRangeError is thrown when numbers longer than 36 digits are converted.
This makes the system prone to break, when the input text is unknown.

To reproduce:

>>> from g2p_en import G2p
>>> g2p_en = G2p()
>>> g2p_en('999999999999999999999999999999999999')  # This works
['N', 'AY1', 'N', ' ', 'HH', 'AH1', 'N', ...]
>>> g2p_en('1000000000000000000000000000000000000')  # This doesn't
...
inflect.NumOutOfRangeError

This error stems from the inspect.py module, which isn't part of this projects source code.
Workaround: Remove numbers longer than 36 digits instead of transforming them to words. This would lose some edge case information but is the only way to keep the transcribed output clean.

File ('homographs.en' ) is not closed after opening by construct_homograph_dictionary

File ('homographs.en' ) remains open after using it. Please consider using context manager to auto close as in this example: https://book.pythontips.com/en/latest/context_managers.html

The actual warning:

  File "/home/dev/.local/lib/python3.8/site-packages/g2p_en/g2p.py", line 35, in construct_homograph_dictionary
    for line in codecs.open(f, 'r', 'utf8').read().splitlines():
ResourceWarning: unclosed file <_io.BufferedReader name='/home/dev/.local/lib/python3.8/site-packages/g2p_en/homographs.en'>

Should UW be included in the phoneme set?

Should UW be included in the phoneme set?
It seems g2p.phonemes operates under the general rule of of excluding the 'parent' category when its variants exist. For example, AA is not included since its variants AA0, AA1, AA2 are in the set. Same for AE, AH, AW, AY, etc. But UW seems to be the only exception. Furthermore, when I do simple frequency analyses on sizable corpora (not super rigorously though), UW never occurs while its variants do. I wonder if the phoneme set can safely forgo UW.

NLTK dependencies are downloaded during the import phase

Unfortunately, NLTK dependencies are downloaded during the import phase, therefore, making it very difficult to configure nltk data paths and download directories beforehand.

I think simply moving the following code to from the top level to the G2p object initialization should solve the issue.

try:
    nltk.data.find('taggers/averaged_perceptron_tagger.zip')
except LookupError:
    nltk.download('averaged_perceptron_tagger')
try:
    nltk.data.find('corpora/cmudict.zip')
except LookupError:
    nltk.download('cmudict')

Maybe we could also provide nltk_data_diretcory as an optional argument and pass it to nltk.download invocations?

Apostrophe converts to space. Gives funny pronunciation in TTS

Hi I've noticed that

e.g. I'm
...converts to...
'AY1', ' ', 'AH0', 'M'

This gives a funny pronunciation in the TTS I'm working on. It says the space instead of saying it as one sound.

Is this the normal behaviour?

Is there a workaround?

Thanks in advance.

heteronym problem

Hi

I have an example heteronym 'wind'

a) The 'wind' is blowing

image

b) Lets all 'wind' down.

image

For both these cases, the g2p 'wind' is pronounced the same
image

Is there any problem with the POS tagging, how can we get the correct pronunciation [W IY1 N D] and [W AY1 N D]

Handling of abbreviations and other characters

When converting e.g. this sentence:
"TTS comes with pretrained models, tools for measuring dataset quality and already used in 20+ languages for products and research projects."

In that sentence the abbreviation "TTS" is not correctly converted to letter by letter (instead it's converted to an incorrect word) and the "+" in 20+ is also omitted.

I originally filed this here: espnet/espnet#2990 (comment) . Is this handling of abbreviations and other special characters something that could be added to g2p?

ImportError: cannot import name 'g2p'

Hello,
Thank you for making this handy package. I installed the package using the setup.py file and tried importing it but I get an import error.

>>> from g2p_en import g2p Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/g2p_en/__init__.py", line 1, in <module> from g2p import g2p, Session ImportError: cannot import name 'g2p'

I also tried uninstalling the package and installing it again using pip but the problem persisted. Am I doing something wrong?

Thanks in advance.
Matt

Model Attributes

@Kyubyong Do you have any write up for what the details of this model are? Clearly it's is a seq2seq model utilizing GRU units, but do you have details on the # of layers, layer shape, etc. and what some of the constraints of this model are?

CPU usage

I found G2p exhausted all my CPUs. Is there any way I can config resource usage?

Wrong parsing

g2p module can't parse sentences like:

g2p("HTTP")
['T', 'AE1', 'P', 'T', 'IY1']

g2p("RFC")
['R', 'EH1', 'F', 'S', 'IY1']

and can't split words like taxidriver -> taxi driver.

Is it possible to tune predict function of g2p to get a correct grapheme to phoneme conversion for abbreviations and composite words?

Issue after installing

I've installed g2p_en on my macOS system using setup.py

I created a test_g2p.py file containing the sample code provided in the README. When I run python test_g2p.py I receive the error below. I'm using python 3.7.

Any idea of how I proceed?

Traceback (most recent call last):
File "test_g2p.py", line 1, in
from g2p_en import G2p
File "/Users/Tristan/anaconda3/lib/python3.7/site-packages/g2p_en-2.0.0-py3.7.egg/g2p_en/init.py", line 1, in
from .g2p import G2p
File "/Users/Tristan/anaconda3/lib/python3.7/site-packages/g2p_en-2.0.0-py3.7.egg/g2p_en/g2p.py", line 21, in
nltk.data.find('taggers/averaged_perceptron_tagger.zip')
File "/Users/Tristan/anaconda3/lib/python3.7/site-packages/nltk/data.py", line 660, in find
return ZipFilePathPointer(p, zipentry)
File "/Users/Tristan/anaconda3/lib/python3.7/site-packages/nltk/compat.py", line 228, in _decorator
return init_func(*args, **kwargs)
File "/Users/Tristan/anaconda3/lib/python3.7/site-packages/nltk/data.py", line 506, in init
zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
File "/Users/Tristan/anaconda3/lib/python3.7/site-packages/nltk/compat.py", line 228, in _decorator
return init_func(*args, **kwargs)
File "/Users/Tristan/anaconda3/lib/python3.7/site-packages/nltk/data.py", line 1055, in init
zipfile.ZipFile.init(self, filename)
File "/Users/Tristan/anaconda3/lib/python3.7/zipfile.py", line 1222, in init
self._RealGetContents()
File "/Users/Tristan/anaconda3/lib/python3.7/zipfile.py", line 1289, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

Unnecessary POS Tagging causing slow speed?

Hi, thank you for this awesome library, I wish I'd discovered it months ago. The way you've set it up makes it seamless to convert to arpa. The one thing keeping me from integrating it into my work is the speed. It takes around 1ms to translate an in vocab phrase like "Hello this is a test of the public broadcasting system". The code I'm currently using, which doesnt do POS tagging and requires an extra preprocessing step for OOV items, takes 32µs for the same phrase.

Has anyone tried to optimize the library? I think the bottleneck is that nltk.pos_tag is called for all of the words, but if I'm not mistaken this is only used when one of the words is in the homograph dictionary (homograph2features). Could the code be possibly changed to only do pos-tagging only if there are multiple matches one of the words in the label?


image


image

For I'm, it's, I'll, you're, I've, I'd

>>> g2p('It\'s')
['IH1', 'T', ' ', 'EH1', 'S'] # Should be ['IH1', 'T', ' ', 'S']
>>> g2p('I\'m')
['AY1', ' ', 'AH0', 'M'] # Should be ['AY1', ' ', 'M']
's
S
Z
'll
L
've
V
'd
D
're
R
't
T
'm
M

predict homograph words phoneme

there are some words, may have different pronunciation but have same POS-tagging, like

YOURSELF Y_B AO1_I R_I S_I EH0_I L_I F_E
YOURSELF Y_B ER0_I S_I EH1_I L_I F_E
YOURSELF Y_B UH0_I R_I S_I EH1_I L_I F_E

from kaldi librispeech align_lexicon.txt

how do I get the right pronunciation?

It would be my great appreciation if you could give any help

Train the model for new languages

Hi, thank you for making the code open source.
I want to check if we can train the model in new languages? It would be nice if you could point me to some documentation. Thank you.

PER/WER performance of neural net model

Hi there, is there any recorded performance of PER or WER of the neural network model used on OOV words? Would be helpful in comparing the underlying approach to more recent architectures for the task of G2P.

Thanks!

Reverse g2p

Hey, maybe a stupid question but haven't been able to find any good resources.

What is the best way to map the output of this model back to readable English?

Thanks

What is each phoneme in IPA terms?

Describing the pronunciation of words is usually done using the International Phonetic Alphabet (IPA) which uses Unicode characters.
However, this package outputs ASCII character and there exists multiple mappings from Unicode IPA to ASCII.
Which one are you using?

The readme file says you are using The CMU Pronouncing Dictionary, which in turns says it's based upon ARPABET. Can I thus safely assume you are using this subset of ARPABET for all results?

Can you put back the train algo?

I have been using this library for six months now and must say it is really useful. Good job. However, I noticed you had included the training file in v1 but now removed it. I am really interested in training my own sequences and even the p2g side. I am able to use git history to get the earlier version. But was hoping if you could add it back in case there are any improvements.

Ved

Is `distance` required in the dependencies?

Thanks for the package, it is really convenient!
It seems to me that distance is never imported despite being listed in setup.py.
I got the example snippet outputting the results after uninstalling distance. Is it required? Can it be safely dropped?

Paper to cite?

Is there a paper I can cite for this work (in addition to adding a footnote of the Git repo)? Thanks.

Mid-word hyphens are removed, should be treated similar to spaces

For example, running G2P on "text-to-speech" returns ['T', 'EH1', 'K', 'S', 'T', 'S', 'P', 'EH2', 'K'], the same as "texttospeech", when it should return something closer to ['T', 'EH1', 'K', 'S', 'T', ' ', 'T', 'UW1', ' ', 'S', 'P', 'IY1', 'CH'], the result for "text to speech" (though the stress could use some adjustment).

Simple workaround for now: use .replace("-", " ") on the input being passed in.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.