kyubyong / g2p Goto Github PK

View Code? Open in Web Editor NEW

775.0 19.0 126.0 7.3 MB

g2p: English Grapheme To Phoneme Conversion

License: Apache License 2.0

Python 100.00%

g2p-seq2seq cmudict g2p pronunciation english-grapheme

g2p's Introduction

g2pE: A Simple Python Module for English Grapheme To Phoneme Conversion

[v.2.0] We removed TensorFlow from the dependencies. After all, it changes its APIs quite often, and we don't expect you to have a GPU. Instead, NumPy is used for inference.

This module is designed to convert English graphemes (spelling) to phonemes (pronunciation). It is considered essential in several tasks such as speech synthesis. Unlike many languages like Spanish or German where pronunciation of a word can be inferred from its spelling, English words are often far from people's expectations. Therefore, it will be the best idea to consult a dictionary if we want to know the pronunciation of some word. However, there are at least two tentative issues in this approach. First, you can't disambiguate the pronunciation of homographs, words which have multiple pronunciations. (See a below.) Second, you can't check if the word is not in the dictionary. (See b below.)

a. I refuse to collect the refuse around here. (rɪ|fju:z as verb vs. |refju:s as noun)
b. I am an activationist. (activationist: newly coined word which means n. A person who designs and implements programs of treatment or therapy that use recreation and activities to help people whose functional abilities are affected by illness or disability. from WORD SPY

For the first homograph issue, fortunately many homographs can be disambiguated using their part-of-speech, if not all. When it comes to the words not in the dictionary, however, we should make our best guess using our knowledge. In this project, we employ a deep learning seq2seq framework based on TensorFlow.

Algorithm

Spells out arabic numbers and some currency symbols. (e.g. $200 -> two hundred dollars) (This is borrowed from Keith Ito's code)
Attempts to retrieve the correct pronunciation for heteronyms based on their POS)
Looks up The CMU Pronouncing Dictionary for non-homographs.
For OOVs, we predict their pronunciations using our neural net model.

Environment

python 3.x

Dependencies

numpy >= 1.13.1
nltk >= 3.2.4
python -m nltk.downloader "averaged_perceptron_tagger" "cmudict"
inflect >= 0.3.1
Distance >= 0.1.3

Installation

pip install g2p_en

python setup.py install

nltk package will be automatically downloaded at your first run.

Usage

from g2p_en import G2p

texts = ["I have $250 in my pocket.", # number -> spell-out
         "popular pets, e.g. cats and dogs", # e.g. -> for example
         "I refuse to collect the refuse around here.", # homograph
         "I'm an activationist."] # newly coined word
g2p = G2p()
for text in texts:
    out = g2p(text)
    print(out)
>>> ['AY1', ' ', 'HH', 'AE1', 'V', ' ', 'T', 'UW1', ' ', 'HH', 'AH1', 'N', 'D', 'R', 'AH0', 'D', ' ', 'F', 'IH1', 'F', 'T', 'IY0', ' ', 'D', 'AA1', 'L', 'ER0', 'Z', ' ', 'IH0', 'N', ' ', 'M', 'AY1', ' ', 'P', 'AA1', 'K', 'AH0', 'T', ' ', '.']
>>> ['P', 'AA1', 'P', 'Y', 'AH0', 'L', 'ER0', ' ', 'P', 'EH1', 'T', 'S', ' ', ',', ' ', 'F', 'AO1', 'R', ' ', 'IH0', 'G', 'Z', 'AE1', 'M', 'P', 'AH0', 'L', ' ', 'K', 'AE1', 'T', 'S', ' ', 'AH0', 'N', 'D', ' ', 'D', 'AA1', 'G', 'Z']
>>> ['AY1', ' ', 'R', 'IH0', 'F', 'Y', 'UW1', 'Z', ' ', 'T', 'UW1', ' ', 'K', 'AH0', 'L', 'EH1', 'K', 'T', ' ', 'DH', 'AH0', ' ', 'R', 'EH1', 'F', 'Y', 'UW2', 'Z', ' ', 'ER0', 'AW1', 'N', 'D', ' ', 'HH', 'IY1', 'R', ' ', '.']
>>> ['AY1', ' ', 'AH0', 'M', ' ', 'AE1', 'N', ' ', 'AE2', 'K', 'T', 'IH0', 'V', 'EY1', 'SH', 'AH0', 'N', 'IH0', 'S', 'T', ' ', '.']

References

If you use this code for research, please cite:

@misc{g2pE2019,
  author = {Park, Kyubyong & Kim, Jongseok},
  title = {g2pE},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Kyubyong/g2p}}
}

Cited in

Learning pronunciation from a foreign language in speech synthesis networks

May, 2018.

Kyubyong Park & Jongseok Kim

g2p's People

Contributors

Stargazers

Watchers

Forkers

ozmig77 thetimeofblack stevenlol wbgxx333 gaoyiyeah ttslr shaun95 pythainlp hiyoung-asr ftorregrossa nicklambourne koomook sanqiaiziji lvscar knowbetterhelps jesseabeyta amirunpri2018 g-wang kuanleo lksy0217 maisyzhang lym0302 daaliang peter05010402 chunyuqiang neosapience chochobo mdda antimora ntzzc dacson appleholic hadaev8 cookieppp songyf caichun razibayati mayank-k-jha copperdong siddalmia wgwangang ma-chengcheng jonneryr macroustc amosg dapao1988 gzfffff ruclion whaozl vikneo2017 minkyung4900 tubbz-alt nikitalita liusongxiang sshuster lixucuhk solfung-8x8 vladbataev kftsehk chienlinhuang1116 tacerdi v-yunbin gavin90s sx-tts zge mnfutao hdegroote zaouk casonclagg light-cao pkadambi divyanx stefantaubert luomingshuang rogervaas uberduck-ai jamesbrosnahan bogihsu yuzhang112 fighting41love fb029ed taalua sunilsivadas cris140 assassindesign zhuxiaoxuhit jinmingche asdlei99 zhang291 klei22 zhangbo2008 bharath-kumar-3231 manhph2211 jjandnn patchethium wgfi110 liyc1968 sudosadia amorjnyh mepc36

g2p's Issues

Bug: Throws NumOutOfRangeError for numbers longer than 36 digits

A NumOutOfRangeError is thrown when numbers longer than 36 digits are converted.
This makes the system prone to break, when the input text is unknown.

To reproduce:

>>> from g2p_en import G2p
>>> g2p_en = G2p()
>>> g2p_en('999999999999999999999999999999999999')  # This works
['N', 'AY1', 'N', ' ', 'HH', 'AH1', 'N', ...]
>>> g2p_en('1000000000000000000000000000000000000')  # This doesn't
...
inflect.NumOutOfRangeError

This error stems from the inspect.py module, which isn't part of this projects source code.
Workaround: Remove numbers longer than 36 digits instead of transforming them to words. This would lose some edge case information but is the only way to keep the transcribed output clean.

How to get probabilities for each phoneme

File ('homographs.en' ) is not closed after opening by construct_homograph_dictionary

File ('homographs.en' ) remains open after using it. Please consider using context manager to auto close as in this example: https://book.pythontips.com/en/latest/context_managers.html

The actual warning:

  File "/home/dev/.local/lib/python3.8/site-packages/g2p_en/g2p.py", line 35, in construct_homograph_dictionary
    for line in codecs.open(f, 'r', 'utf8').read().splitlines():
ResourceWarning: unclosed file <_io.BufferedReader name='/home/dev/.local/lib/python3.8/site-packages/g2p_en/homographs.en'>

Should UW be included in the phoneme set?

Should UW be included in the phoneme set?
It seems g2p.phonemes operates under the general rule of of excluding the 'parent' category when its variants exist. For example, AA is not included since its variants AA0, AA1, AA2 are in the set. Same for AE, AH, AW, AY, etc. But UW seems to be the only exception. Furthermore, when I do simple frequency analyses on sizable corpora (not super rigorously though), UW never occurs while its variants do. I wonder if the phoneme set can safely forgo UW.

NLTK dependencies are downloaded during the import phase

Unfortunately, NLTK dependencies are downloaded during the import phase, therefore, making it very difficult to configure nltk data paths and download directories beforehand.

I think simply moving the following code to from the top level to the G2p object initialization should solve the issue.

try:
    nltk.data.find('taggers/averaged_perceptron_tagger.zip')
except LookupError:
    nltk.download('averaged_perceptron_tagger')
try:
    nltk.data.find('corpora/cmudict.zip')
except LookupError:
    nltk.download('cmudict')

Maybe we could also provide nltk_data_diretcory as an optional argument and pass it to nltk.download invocations?

Apostrophe converts to space. Gives funny pronunciation in TTS

Hi I've noticed that

e.g. I'm
...converts to...
'AY1', ' ', 'AH0', 'M'

This gives a funny pronunciation in the TTS I'm working on. It says the space instead of saying it as one sound.

Is this the normal behaviour?

Is there a workaround?

Thanks in advance.

heteronym problem

I have an example heteronym 'wind'

a) The 'wind' is blowing

b) Lets all 'wind' down.

For both these cases, the g2p 'wind' is pronounced the same

Is there any problem with the POS tagging, how can we get the correct pronunciation [W IY1 N D] and [W AY1 N D]

Handling of abbreviations and other characters

When converting e.g. this sentence:
"TTS comes with pretrained models, tools for measuring dataset quality and already used in 20+ languages for products and research projects."

In that sentence the abbreviation "TTS" is not correctly converted to letter by letter (instead it's converted to an incorrect word) and the "+" in 20+ is also omitted.

I originally filed this here: espnet/espnet#2990 (comment) . Is this handling of abbreviations and other special characters something that could be added to g2p?

ImportError: cannot import name 'g2p'

Hello,
Thank you for making this handy package. I installed the package using the setup.py file and tried importing it but I get an import error.

>>> from g2p_en import g2p Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/g2p_en/__init__.py", line 1, in <module> from g2p import g2p, Session ImportError: cannot import name 'g2p'

I also tried uninstalling the package and installing it again using pip but the problem persisted. Am I doing something wrong?

Thanks in advance.
Matt

Can this be used for ASR?

How is g2p's performance for lowercase utterances without punctuations in ASR task?

Model Attributes

@Kyubyong Do you have any write up for what the details of this model are? Clearly it's is a seq2seq model utilizing GRU units, but do you have details on the # of layers, layer shape, etc. and what some of the constraints of this model are?

CPU usage

I found G2p exhausted all my CPUs. Is there any way I can config resource usage?

Wrong parsing

g2p module can't parse sentences like:

g2p("HTTP")
['T', 'AE1', 'P', 'T', 'IY1']

g2p("RFC")
['R', 'EH1', 'F', 'S', 'IY1']

and can't split words like taxidriver -> taxi driver.

Is it possible to tune predict function of g2p to get a correct grapheme to phoneme conversion for abbreviations and composite words?

Issue after installing

I've installed g2p_en on my macOS system using setup.py

I created a test_g2p.py file containing the sample code provided in the README. When I run python test_g2p.py I receive the error below. I'm using python 3.7.

Any idea of how I proceed?

Traceback (most recent call last):
File "test_g2p.py", line 1, in
from g2p_en import G2p
File "/Users/Tristan/anaconda3/lib/python3.7/site-packages/g2p_en-2.0.0-py3.7.egg/g2p_en/init.py", line 1, in
from .g2p import G2p
File "/Users/Tristan/anaconda3/lib/python3.7/site-packages/g2p_en-2.0.0-py3.7.egg/g2p_en/g2p.py", line 21, in
nltk.data.find('taggers/averaged_perceptron_tagger.zip')
File "/Users/Tristan/anaconda3/lib/python3.7/site-packages/nltk/data.py", line 660, in find
return ZipFilePathPointer(p, zipentry)
File "/Users/Tristan/anaconda3/lib/python3.7/site-packages/nltk/compat.py", line 228, in _decorator
return init_func(*args, **kwargs)
File "/Users/Tristan/anaconda3/lib/python3.7/site-packages/nltk/data.py", line 506, in init
zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
File "/Users/Tristan/anaconda3/lib/python3.7/site-packages/nltk/compat.py", line 228, in _decorator
return init_func(*args, **kwargs)
File "/Users/Tristan/anaconda3/lib/python3.7/site-packages/nltk/data.py", line 1055, in init
zipfile.ZipFile.init(self, filename)
File "/Users/Tristan/anaconda3/lib/python3.7/zipfile.py", line 1222, in init
self._RealGetContents()
File "/Users/Tristan/anaconda3/lib/python3.7/zipfile.py", line 1289, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

Unnecessary POS Tagging causing slow speed?

Hi, thank you for this awesome library, I wish I'd discovered it months ago. The way you've set it up makes it seamless to convert to arpa. The one thing keeping me from integrating it into my work is the speed. It takes around 1ms to translate an in vocab phrase like "Hello this is a test of the public broadcasting system". The code I'm currently using, which doesnt do POS tagging and requires an extra preprocessing step for OOV items, takes 32µs for the same phrase.

Has anyone tried to optimize the library? I think the bottleneck is that nltk.pos_tag is called for all of the words, but if I'm not mistaken this is only used when one of the words is in the homograph dictionary (homograph2features). Could the code be possibly changed to only do pos-tagging only if there are multiple matches one of the words in the label?

For I'm, it's, I'll, you're, I've, I'd

>>> g2p('It\'s')
['IH1', 'T', ' ', 'EH1', 'S'] # Should be ['IH1', 'T', ' ', 'S']
>>> g2p('I\'m')
['AY1', ' ', 'AH0', 'M'] # Should be ['AY1', ' ', 'M']

's
S
Z
'll
L
've
V
'd
D
're
R
't
T
'm
M

predict homograph words phoneme

there are some words, may have different pronunciation but have same POS-tagging, like

YOURSELF Y_B AO1_I R_I S_I EH0_I L_I F_E
YOURSELF Y_B ER0_I S_I EH1_I L_I F_E
YOURSELF Y_B UH0_I R_I S_I EH1_I L_I F_E

from kaldi librispeech align_lexicon.txt

how do I get the right pronunciation?

It would be my great appreciation if you could give any help

Train the model for new languages

Hi, thank you for making the code open source.
I want to check if we can train the model in new languages? It would be nice if you could point me to some documentation. Thank you.

PER/WER performance of neural net model

Hi there, is there any recorded performance of PER or WER of the neural network model used on OOV words? Would be helpful in comparing the underlying approach to more recent architectures for the task of G2P.

Thanks!

Reverse g2p

Hey, maybe a stupid question but haven't been able to find any good resources.

What is the best way to map the output of this model back to readable English?

Thanks

What is each phoneme in IPA terms?

Describing the pronunciation of words is usually done using the International Phonetic Alphabet (IPA) which uses Unicode characters.
However, this package outputs ASCII character and there exists multiple mappings from Unicode IPA to ASCII.
Which one are you using?

The readme file says you are using The CMU Pronouncing Dictionary, which in turns says it's based upon ARPABET. Can I thus safely assume you are using this subset of ARPABET for all results?

Can you put back the train algo?

I have been using this library for six months now and must say it is really useful. Good job. However, I noticed you had included the training file in v1 but now removed it. I am really interested in training my own sequences and even the p2g side. I am able to use git history to get the earlier version. But was hoping if you could add it back in case there are any improvements.

Ved

Is `distance` required in the dependencies?

Thanks for the package, it is really convenient!
It seems to me that distance is never imported despite being listed in setup.py.
I got the example snippet outputting the results after uninstalling distance. Is it required? Can it be safely dropped?

Paper to cite?

Is there a paper I can cite for this work (in addition to adding a footnote of the Git repo)? Thanks.

Mid-word hyphens are removed, should be treated similar to spaces

For example, running G2P on "text-to-speech" returns ['T', 'EH1', 'K', 'S', 'T', 'S', 'P', 'EH2', 'K'], the same as "texttospeech", when it should return something closer to ['T', 'EH1', 'K', 'S', 'T', ' ', 'T', 'UW1', ' ', 'S', 'P', 'IY1', 'CH'], the result for "text to speech" (though the stress could use some adjustment).

Simple workaround for now: use .replace("-", " ") on the input being passed in.