jacksonllee / pycantonese Goto Github PK

Cantonese Linguistics and NLP

License: MIT License

Python 100.00%

pycantonese cantonese computational-linguistics natural-language-processing nlp linguistics python jyutping stop-words word-segmentation

pycantonese's Introduction

PyCantonese: Cantonese Linguistics and NLP in Python

Full Documentation: https://pycantonese.org

PyCantonese is a Python library for Cantonese linguistics and natural language processing (NLP). Currently implemented features (more to come!):

Accessing and searching corpus data
Parsing and conversion tools for Jyutping romanization
Parsing Cantonese text
Stop words
Word segmentation
Part-of-speech tagging

Download and Install

To download and install the stable, most recent version:

$ pip install --upgrade pycantonese

Ready for more? Check out the Quickstart page.

Consulting

If your team would like professional assistance in using PyCantonese, freelance consulting and training services are available for both academic and commercial groups. Please email Jackson L. Lee.

Support

If you have found PyCantonese useful and would like to offer support, buying me a coffee would go a long way!

How to Cite

PyCantonese is authored and maintained by Jackson L. Lee.

Lee, Jackson L., Litong Chen, Charles Lam, Chaak Ming Lau, and Tsz-Him Tsui. 2022. PyCantonese: Cantonese Linguistics and NLP in Python. Proceedings of the 13th Language Resources and Evaluation Conference.

@inproceedings{lee-etal-2022-pycantonese,
   title = "PyCantonese: Cantonese Linguistics and NLP in Python",
   author = "Lee, Jackson L.  and
      Chen, Litong  and
      Lam, Charles  and
      Lau, Chaak Ming  and
      Tsui, Tsz-Him",
   booktitle = "Proceedings of The 13th Language Resources and Evaluation Conference",
   month = june,
   year = "2022",
   publisher = "European Language Resources Association",
   language = "English",
}

License

MIT License. Please see LICENSE.txt in the GitHub source code for details.

The HKCanCor dataset included in PyCantonese is substantially modified from its source in terms of format. The original dataset has a CC BY license. Please see pycantonese/data/hkcancor/README.md in the GitHub source code for details.

The rime-cantonese data (release 2021.05.16) is incorporated into PyCantonese for word segmentation and characters-to-Jyutping conversion. This data has a CC BY 4.0 license. Please see pycantonese/data/rime_cantonese/README.md in the GitHub source code for details.

Logo

The PyCantonese logo is the Chinese character 粵 meaning Cantonese, with artistic design by albino.snowman (Instagram handle).

Acknowledgments

Wonderful resources with a permissive license that have been incorporated into PyCantonese:

HKCanCor
rime-cantonese

Individuals who have contributed pull requests, bug reports, and other feedback (in alphabetical order of last names):

@cathug
Francis Bond
Jenny Chim
Eric Dong
@g-traveller
@graphemecluster
Rachel Han
Ryan Lai
@laubonghaudoi
Katrina Li
Kevin Li
@ZhanruiLiang
Hill Ma
@richielo
@rylanchiu
Stephan Stiller
Robin Yuen

Changelog

Please see CHANGELOG.md.

Setting up a Development Environment

The latest code under development is available on GitHub at jacksonllee/pycantonese. To obtain this version for experimental features or for development:

$ git clone https://github.com/jacksonllee/pycantonese.git
$ cd pycantonese
$ pip install -e ".[dev]"

To run tests and styling checks:

$ pytest
$ flake8 src tests
$ black --check src tests

To build the documentation website files:

$ python docs/source/build_docs.py

pycantonese's People

Contributors

Stargazers

Watchers

Forkers

killvung iou2much standby hkbu-kennycheng algoricky beethovenvirus x-ccs irisyanfguo utmcontent frank-zzm kwx4github j-chim willchen05 chunyfong hkbu-victor zausiu chaaklau canclid computational-linguistics-research xbsdsongnan wendonggan cheenid judyfong ebell495 yingfenging laplacekorea rizaziz allensmile fighting41love zhanruiliang edong zhengwuma russelluo odoochain amorjnyh alexisszabo maxmax2016 ishine

pycantonese's Issues

more transparent function names needed

The function names jyutping and yale do not seem to reflect what they do and---in particular---what the input and output are. Perhaps better and more transparent function names should be used.

(h.t. Stephan Stiller)

Parse chinese character to jyutping

I've gone through the documentation but did not find how to convert Chinese sentence to jyutping. For example something like this.

import pycantonese as pc

pc.parse_to_jyutping("我係香港人")
'ngo5 hai6 hoeng1 gong2 jan4'

hkcancor search result is empty

`import pycantonese as pc

corpus = pc.hkcancor()
print(len(corpus.words()))
print(len(corpus.characters()))

aa = corpus.search(nucleus='aa')
print(len(aa))
`

I tried build-in hkcancor corpus, but the result is empty:

149781 -> len(corpus.words())
0 -> len(corpus.characters())
0 -> len(aa)

I am using pycantonese 2.0.0 version
Could you please have a look :)

About token

Your work is really great! I want to do some LDA work on the Cantonese data. Can I use your library to finish the tokenizing work? How? Thanks a lot!

Windows Installer for 2.0

Hi there,

Any chance for a windows installer?

unable to run file directly

Describe the bug
running in python environment is ok but not when it is in a py file

To reproduce
running "python cantonese.py" will result in error/crash, while cantonese.py only has 2 lines:

import pycantonese
pycantonese.characters_to_jyutping('香港人講廣東話') # Hongkongers speak Cantonese

but directly running in python environment was ok:

Expected behavior
A clear and concise description of what you expected to happen instead.

Screenshots
If applicable, add screenshots to help explain your problem.

System (please complete the following information):

Operating System: [e.g. Windows, MacOS, Linux]
Windows 11 and Windows 10
PyCantonese version: [What you see after you run import pycantonese; print(pycantonese.__version__)]
pycantonese 3.4.0

Additional context
Add any other context about the problem here.

misplaced tone mark for Yale "ng"

In Jyutping-to-Yale conversion, the tone mark on syllabic "ng" is misplaced. Currently:

>>> pc.yale("ng5")
['ńgh']

The tone diacritic is now on "n", but it should be on "g" based on the usual Yale convention.

(h.t. Stephan Stiller)

Can't distinguish high falling tone from high level tone

Officially, Yale romanization should distinguish high level tone ā from high falling tone à (jyutping doesn't). Would it be possible to make pycantonese do that?

[Feature Request] Caching after calling `.read_chat(url)`

Feature you are interested in and your specific question(s):
While using .read_chat(url), the ZIP file is downloaded, extracted and parsed every time the function is executed.
Execution time and download time can be saved by caching the files in a local folder like ~/.cache/pycantonese/chatdata/, just like HuggingFace's .from_pretrained(model) and datasets.load_dataset() (and many other similar functions).

What you are trying to accomplish with this feature or functionality:
Decrease execution time, Increase performance.

Additional context:

Looking for "name conversion"

Hello,

please excuse any ignorance on the topic of romanization of Cantonese, as I neither know the language nor it's pronunciation rules. My use case is merely transliterating actor and role names for movie and drama series purposes.

I've tried several packages, but I keep seeing the same differences. For these romanizations, tones are not used, but this is minor. Right now I have only one example at hand, because I don't use it often and forgot about the earlier cases, but I will try to find more if this is not a "structural" problem, but specific to the character.

The example is Bruce Leung / Leung Siu Lung which comes up as:

In [4]: pc.characters2jyutping('梁小龍')
Out[4]: ['loeng4', 'siu2', 'lung4']

The loeng4 versus leung puzzles me and I can also not find any sources / documentation that could explain the difference to me, because basically I don't know what to search for :) As a speaker of Dutch, English and German, in none of the languages I would pronounce loeng or leung the same.

Are these name conversions based on different rules or even a different system entirely?

Documentation links to the old hkcancor site, but it has been moved to github.

Describe the bug

The tagset documentation links to a site which no long exists.
http://compling.hss.ntu.edu.sg/hkcancor/

It should link to:

https://github.com/fcbond/hkcancor

To reproduce
Click on the link http://compling.hss.ntu.edu.sg/hkcancor/

Expected behavior
It should go to the new URL

Screenshots
If applicable, add screenshots to help explain your problem.

System (please complete the following information):

Operating System: [e.g. Windows, MacOS, Linux]
PyCantonese version: [What you see after you run import pycantonese; print(pycantonese.__version__)]

Additional context
Add any other context about the problem here.

grep -R compling *
CHANGELOG.md:* [The Hong Kong Cantonese Corpus](http://compling.hss.ntu.edu.sg/hkcancor/) is included in the package.
docs/tutorials/lee-pycantonese-2021-05-16.ipynb:    "PyCantonese is shipped with the [Hong Kong Cantonese Corpus](http://compling.hss.ntu.edu.sg/hkcancor/) (HKCanCor, CC BY license). We are going to use this corpus a lot in this tutorial."
docs/tutorials/lee-pycantonese-2021-05-16.ipynb:    "* `pos`: part-of-speech tag (see the [HKCanCor documentation](http://compling.hss.ntu.edu.sg/hkcancor/) for the POS tagset)\n",
docs/tutorials/lee-pycantonese-2021-05-16.ipynb:    "1. What is the part-of-speech tag for classifiers. Check the [HKCanCor documentation](http://compling.hss.ntu.edu.sg/hkcancor/).\n",
docs/searches.html:<p>For the part-of-speech tagset used by HKCanCor, see <a class="reference external" href="http://compling.hss.ntu.edu.sg/hkcancor/">here</a>.</p>
docs/changelog.html:<li><p><a class="reference external" href="http://compling.hss.ntu.edu.sg/hkcancor/">The Hong Kong Cantonese Corpus</a> is included in the package.</p></li>
docs/generated/pycantonese.pos_tagging.hkcancor_to_ud.html:are described at <a class="reference external" href="http://compling.hss.ntu.edu.sg/hkcancor/">http://compling.hss.ntu.edu.sg/hkcancor/</a>).
docs/generated/pycantonese.pos_tag.html:<a class="reference external" href="http://compling.hss.ntu.edu.sg/hkcancor/">http://compling.hss.ntu.edu.sg/hkcancor/</a>.</p></li>
docs/source/changelog.rst:* `The Hong Kong Cantonese Corpus <http://compling.hss.ntu.edu.sg/hkcancor/>`_ is included in the package.
docs/source/pos_tagging.rst:(`46 of which are described <http://compling.hss.ntu.edu.sg/hkcancor/>`_).
docs/source/searches.rst:For the part-of-speech tagset used by HKCanCor, see `here <http://compling.hss.ntu.edu.sg/hkcancor/>`_.
docs/source/data.rst:`Hong Kong Cantonese Corpus <http://compling.hss.ntu.edu.sg/hkcancor/>`_
docs/_modules/pycantonese/pos_tagging/hkcancor_to_ud.html:<span class="c1"># HKCanCor tagset: http://compling.hss.ntu.edu.sg/hkcancor/</span>
docs/_modules/pycantonese/pos_tagging/hkcancor_to_ud.html:<span class="sd">    are described at http://compling.hss.ntu.edu.sg/hkcancor/).</span>
docs/_modules/pycantonese/pos_tagging/tagger.html:<span class="sd">          http://compling.hss.ntu.edu.sg/hkcancor/.</span>
docs/_sources/data.rst.txt:`Hong Kong Cantonese Corpus <http://compling.hss.ntu.edu.sg/hkcancor/>`_
docs/_sources/changelog.rst.txt:* `The Hong Kong Cantonese Corpus <http://compling.hss.ntu.edu.sg/hkcancor/>`_ is included in the package.
docs/_sources/pos_tagging.rst.txt:(`46 of which are described <http://compling.hss.ntu.edu.sg/hkcancor/>`_).
docs/_sources/searches.rst.txt:For the part-of-speech tagset used by HKCanCor, see `here <http://compling.hss.ntu.edu.sg/hkcancor/>`_.
docs/pos_tagging.html:(<a class="reference external" href="http://compling.hss.ntu.edu.sg/hkcancor/">46 of which are described</a>).
docs/data.html:<a class="reference external" href="http://compling.hss.ntu.edu.sg/hkcancor/">Hong Kong Cantonese Corpus</a>
src/pycantonese/pos_tagging/tagger.py:          http://compling.hss.ntu.edu.sg/hkcancor/.
src/pycantonese/pos_tagging/hkcancor_to_ud.py:# HKCanCor tagset: http://compling.hss.ntu.edu.sg/hkcancor/
src/pycantonese/pos_tagging/hkcancor_to_ud.py:    are described at http://compling.hss.ntu.edu.sg/hkcancor/).
src/pycantonese/data/hkcancor/README.md:http://compling.hss.ntu.edu.sg/hkcancor/
src/pycantonese/data/hkcancor/README.md:([here](http://compling.hss.ntu.edu.sg/hkcancor/data/LICENSE),
tests/test_docs.py:        # "http://compling.hss.ntu.edu.sg/hkcancor/",  # TODO: Is the site down?

Corpus contains no characters

Hi there,
I am using the default corpus and found that there are 149781 words but 0 characters.
What can be causing this issue? I am using pycantonese version 2.0.

I am running Python 3.52 in a virtual environment.

Cheers

Jyutping to TIPA: allow optional overriding of the TIPA symbols

The library provides a hard coded dict for the Jyutping-to-TIPA conversion, but the chosen mapping for particular symbols may not be exactly what the user wants. An optional workaround for user-provided overrides is needed.

Where are all the profanities?

I tried to look up the 門氏五虎將, or some other phases like 仆街 or "Collect skin", but none of these are available.

Should I implement them?

Thank you

possible to add a custom lookup dict for characters_to_jyutping

Describe the bug
I read this and understand the corpora used for characters_to_jyutping are.
(i) the HKCanCor corpus data included in the PyCantonese library, and (ii) the rime-cantonese data
https://pycantonese.org/jyutping.html

The issue I found is, it seems at least one word, if converted to jyutping, give an incorrect jyutping result?

To reproduce
pycantonese.characters_to_jyutping('到')
[('到', 'dou2')]
pycantonese.characters_to_jyutping('感到')
[('感到', 'gam2dou2')]
pycantonese.characters_to_jyutping('到底')
[('到底', 'dou3dai2')]

Expected behavior
according to here. https://humanum.arts.cuhk.edu.hk/Lexis/lexi-can/
到 should be dou3, so expected results are:
pycantonese.characters_to_jyutping('到')
[('到', 'dou3')]
pycantonese.characters_to_jyutping('感到')
[('感到', 'gam2dou3')]
pycantonese.characters_to_jyutping('到底')
[('到底', 'dou3dai2')]

I wonder if there is any way to resolve this problem, so pycantonese.characters_to_jyutping will return dou3 for 到 and 感到?
Thanks!

中英混合句子分詞嗰陣會將所有英文單詞連埋一齊

輸入

import pycantonese
pycantonese.pos_tag(pycantonese.segment("我今晚會 have dinner at home"))

輸出係

[('我', 'PRON'), ('今晚', 'ADV'), ('會', 'AUX'), ('havedinnerathome', 'VERB')]

可以睇到 havedinnerathome 成個變成咗一個動詞。如果想還原句子就做唔到。可唔可以喺保留英文單詞之間空格嘅前提下分詞？

Parsing Error occurred when using Yip-Matthews Bilingual Corpus

Describe the bug
When I try to use the Yip-Matthews Bilingual Corpus, the following error occurs:

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/tljh/user/lib/python3.9/concurrent/futures/process.py", line 243, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/opt/tljh/user/lib/python3.9/concurrent/futures/process.py", line 202, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/opt/tljh/user/lib/python3.9/concurrent/futures/process.py", line 202, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/home/jupyter-raptor/.local/lib/python3.9/site-packages/pylangacq/chat.py", line 1430, in _parse_chat_str
    utterances = self._get_utterances(all_tiers)
  File "/home/jupyter-raptor/.local/lib/python3.9/site-packages/pylangacq/chat.py", line 1449, in _get_utterances
    utterance_line = _clean_utterance(tiermarker_to_line[participant_code])
  File "/home/jupyter-raptor/.local/lib/python3.9/site-packages/pylangacq/_clean_utterance.py", line 195, in _clean_utterance
    utterance = _drop(utterance, "> [/]", "<", ">", "left")
  File "/home/jupyter-raptor/.local/lib/python3.9/site-packages/pylangacq/_clean_utterance.py", line 118, in _drop
    paren_i = _find_paren(
  File "/home/jupyter-raptor/.local/lib/python3.9/site-packages/pylangacq/_clean_utterance.py", line 112, in _find_paren
    raise ValueError(f"no matching paren: {s}, {target}, {opposite}, {direction}")
ValueError: no matching paren: see my babe [/] babe , <, >, left
"""

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In[3], line 7
      5 url = "https://childes.talkbank.org/data/Biling/CHCC.zip"
      6 url = "https://childes.talkbank.org/data/Biling/YipMatthews.zip"
----> 7 corpus = pycantonese.read_chat(url)

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:187, in _params_in_docstring.<locals>.real_decorator.<locals>.wrapper(*args, **kwargs)
    185 @functools.wraps(func)
    186 def wrapper(*args, **kwargs):
--> 187     return func(*args, **kwargs)

File ~/.local/lib/python3.9/site-packages/pycantonese/corpus.py:423, in read_chat(path, match, exclude, encoding)
    402 @_params_in_docstring("match", "exclude", "encoding", class_method=False)
    403 def read_chat(
    404     path: str, match: str = None, exclude: str = None, encoding: str = _ENCODING
    405 ) -> CHATReader:
    406     """Read Cantonese CHAT data files.
    407 
    408     Parameters
   (...)
    421     :class:`~pycantonese.CHATReader`
    422     """
--> 423     return pylangacq_read_chat(
    424         path, match=match, exclude=exclude, encoding=encoding, cls=CHATReader
    425     )

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:187, in _params_in_docstring.<locals>.real_decorator.<locals>.wrapper(*args, **kwargs)
    185 @functools.wraps(func)
    186 def wrapper(*args, **kwargs):
--> 187     return func(*args, **kwargs)

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:1846, in read_chat(path, match, exclude, encoding, cls)
   1844 path_lower = path.lower()
   1845 if path_lower.endswith(".zip"):
-> 1846     return cls.from_zip(path, match=match, exclude=exclude, encoding=encoding)
   1847 elif os.path.isdir(path):
   1848     return cls.from_dir(path, match=match, exclude=exclude, encoding=encoding)

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:187, in _params_in_docstring.<locals>.real_decorator.<locals>.wrapper(*args, **kwargs)
    185 @functools.wraps(func)
    186 def wrapper(*args, **kwargs):
--> 187     return func(*args, **kwargs)

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:1127, in Reader.from_zip(cls, path, match, exclude, extension, encoding, parallel, use_cached, session)
   1124         with zipfile.ZipFile(zip_path) as zfile:
   1125             zfile.extractall(unzip_dir)
-> 1127     reader = cls.from_dir(
   1128         unzip_dir,
   1129         match=match,
   1130         exclude=exclude,
   1131         extension=extension,
   1132         encoding=encoding,
   1133         parallel=parallel,
   1134     )
   1136 # Unzipped files from `.from_zip` have the unwieldy temp dir in the file path.
   1137 for f in reader._files:

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:187, in _params_in_docstring.<locals>.real_decorator.<locals>.wrapper(*args, **kwargs)
    185 @functools.wraps(func)
    186 def wrapper(*args, **kwargs):
--> 187     return func(*args, **kwargs)

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:1057, in Reader.from_dir(cls, path, match, exclude, extension, encoding, parallel)
   1055             continue
   1056         file_paths.append(os.path.join(dirpath, filename))
-> 1057 return cls.from_files(
   1058     sorted(file_paths),
   1059     match=match,
   1060     exclude=exclude,
   1061     encoding=encoding,
   1062     parallel=parallel,
   1063 )

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:187, in _params_in_docstring.<locals>.real_decorator.<locals>.wrapper(*args, **kwargs)
    185 @functools.wraps(func)
    186 def wrapper(*args, **kwargs):
--> 187     return func(*args, **kwargs)

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:1009, in Reader.from_files(cls, paths, match, exclude, encoding, parallel)
   1006 else:
   1007     strs = [_open_file(p) for p in paths]
-> 1009 return cls.from_strs(strs, paths, parallel=parallel)

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:187, in _params_in_docstring.<locals>.real_decorator.<locals>.wrapper(*args, **kwargs)
    185 @functools.wraps(func)
    186 def wrapper(*args, **kwargs):
--> 187     return func(*args, **kwargs)

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:970, in Reader.from_strs(cls, strs, ids, parallel)
    966     raise ValueError(
    967         f"strs and ids must have the same size: {len(strs)} and {len(ids)}"
    968     )
    969 reader = cls()
--> 970 reader._parse_chat_strs(strs, ids, parallel)
    971 return reader

File ~/.local/lib/python3.9/site-packages/pylangacq/chat.py:254, in Reader._parse_chat_strs(self, strs, file_paths, parallel)
    252 if parallel:
    253     with cf.ProcessPoolExecutor() as executor:
--> 254         self._files = collections.deque(
    255             executor.map(self._parse_chat_str, strs, file_paths)
    256         )
    257 else:
    258     self._files = collections.deque(
    259         self._parse_chat_str(s, f) for s, f in zip(strs, file_paths)
    260     )

File /opt/tljh/user/lib/python3.9/concurrent/futures/process.py:559, in _chain_from_iterable_of_lists(iterable)
    553 def _chain_from_iterable_of_lists(iterable):
    554     """
    555     Specialized implementation of itertools.chain.from_iterable.
    556     Each item in *iterable* should be a list.  This function is
    557     careful not to keep references to yielded objects.
    558     """
--> 559     for element in iterable:
    560         element.reverse()
    561         while element:

File /opt/tljh/user/lib/python3.9/concurrent/futures/_base.py:608, in Executor.map.<locals>.result_iterator()
    605 while fs:
    606     # Careful not to keep a reference to the popped future
    607     if timeout is None:
--> 608         yield fs.pop().result()
    609     else:
    610         yield fs.pop().result(end_time - time.monotonic())

File /opt/tljh/user/lib/python3.9/concurrent/futures/_base.py:438, in Future.result(self, timeout)
    436     raise CancelledError()
    437 elif self._state == FINISHED:
--> 438     return self.__get_result()
    440 self._condition.wait(timeout)
    442 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:

File /opt/tljh/user/lib/python3.9/concurrent/futures/_base.py:390, in Future.__get_result(self)
    388 if self._exception:
    389     try:
--> 390         raise self._exception
    391     finally:
    392         # Break a reference cycle with the exception in self._exception
    393         self = None

ValueError: no matching paren: see my babe [/] babe , <, >, left

To reproduce

Execute the following codes:

import pycantonese
url = "https://childes.talkbank.org/data/Biling/YipMatthews.zip"
corpus = pycantonese.read_chat(url)

The above error appears.

Expected behavior
Expected the corpus can be used without error, just like Child Heritage Chinese Corpus, Guthrie Bilingual Corpus, HKU-70 Corpus, Lee-Wong-Leung Corpus, Leo Corpus and Paidologos Corpus: Cantonese.

All links are checked, only the Yip-Matthews Bilingual Corpus shows an error.

System (please complete the following information):

Operating System: Ubuntu 18.04
PyCantonese version: 3.4.0

Additional context
Running in Jupyterhub

Jyutping to IPA support

Feature you are interested in and your specific question(s):

Is there any method that does jyutping to ipa ? I know there's a jyutping to tipa method now, would be great if also have jyutping to ipa.

What you are trying to accomplish with this feature or functionality:
I am currently helping to prepare the data for training the cantonese part of a multilingual pl-bert for the open source StyleTTS2 model. link. We need a grapheme to phoneme library for zh-yue/zh language using the wikipedia dataset.

We have yet to find a good enough quality g2p library, tried espeak-ng, some deep learning library, that fits into the StyleTTS2 format. So we are attempting to use the pycantonese characters_to_jyutping method, then convert from jyutping_to_ipa.

Additional context:

Return POS and the character?

Hi,
It is possible to return all of the sentences in character from the corpus with its pos tag?

.cha file word segmentation

Hello, may I please know if it would be possible to word segment a .cha file, or if better, a zip folder containing .cha files? Thank you very much!

Anyway to off the segmentation and just do the jyutping char by char?

The speed is a bit slow, and I am just looking for jyutping, may be the most frequent jyuping of the character? Thanks

Copy-paste error in tagger implementation

Describe the bug
Code location:
https://github.com/jacksonllee/pycantonese/blob/main/src/pycantonese/pos_tagging/tagger.py#L262
From the context, i+2 should be used but it has i-2 currently.
I tried to fix and and regen the model pickle file, but that fails some tests which I don't know how to proceed.

To reproduce

Change i-2 to i+2 in the linked line and the line below.
Regenerate model by running train_tagger.py
Run tests `python -m pytest tests/test_parsing.py

Expected behavior
All tests pass.

System (please complete the following information):

Operating System: MacOS
PyCantonese version: 3.4.0

Simplified Chinese characters not supported

I try to use the jyutping to convert characters to jyutping, but I found some character can be convert:
for example:
txt='昆省急救服务中心嘅医护人员昆省警方。'
the output is:
[('昆', 'gwan1'), ('省', 'saang2'), ('急救', 'gap1gau3'), ('服', 'fuk6'), ('务', None), ('中心', 'zung1sam1'), ('嘅', 'ge3'), ('医', 'ai3'), ('护', None), ('人', 'jan4'), ('员', None), ('昆', 'gwan1'), ('省', 'saang2'), ('警方', 'ging2fong1'), ('。', None)]
you can see that ‘务’，‘护’，‘员’ are None

Corpus access

Dear Jackson,

We are a group of NLP researchers who's interested in Chinese varieties.
Could you advise us on how to gain access to Prof. Luke's corpus?

Regards,
Liling

Segmenter removes space of English words in code-mixed sentence

Describe the bug
Segmenter removes space of English words in code-mixed sentence, for example this sentence:

這是Career Centre

To reproduce
Here is the code:

import pycantonese
from pycantonese.word_segmentation import Segmenter
segmenter = Segmenter()
pyseg = pycantonese.segment("這是Career Centre", cls=segmenter)
for word in pyseg:
    print(word)

The output is:

這是
CareerCentre

Expected behavior
The expected output is:

這是
Career Centre

這是
Career
Centre

System (please complete the following information):

Operating System: macOS Sonoma 14.0 (23A344)
PyCantonese version: 3.4.0

`hkcancor_to_ud` typo: "G1": "V" should be "VERB"

The _MAP for the pos_tagging.hkcancor_to_ud function has a typo that incorrectly outputs the V as a UD tag when VERB is intended. This breaks downstream tasks that rely on pycantonese to convert hkcancor labels into UD.

Jyutping codas {i,u} and Yale "h" for low tones

Jyutping codas {i,u} with low tones aren't handled correctly in Jyutping-to-Yale conversion, specifically for the position of "h" signaling a low tone in Yale. Currently:

>>> import pycantonese as pc
>>> pc.yale("caau4")
['chàahu'] # incorrect -- should be ['chàauh'] instead

allow search criteria to be regular expressions in addition to plain strings

This will make the search functions more flexible, e.g., allowing searches like "find all words whose part of speech tags begin with "N" for some sort of nouns".

retain capitalization and whitespace in Jyutping-to-Yale

In Jyutping-to-Yale conversion, if the input Jyutping string has capitalization that mimics orthographic conventions in English (e.g., uppercase for proper names), then we may want to allow an option to retain capitalization. (h.t. Stephan Stiller)

While we are on the subject of retaining the input style, we may also want to consider retaining input whitespace (most probably for word segmentation).

How can i join you ?

hello, I am very interested in this project.
I want to join, what can I do for this project?

Not compatible with pylangacq version 0.13

Code to reproduce

import pycantonese as pc                                                                                      
pc.word_segmentation.Segmenter()   

Traceback (most recent call last):                                                                                
  File "<stdin>", line 1, in <module>                                                                             
  File "/home/lib/python3.7/site-packages/pycantonese/word_segmentation.py", line 46, in __init__
    self.fit(hkcancor().sents())                                                                                  
  File "/home/lib/python3.7/site-packages/pycantonese/corpus.py", line 334, in hkcancor
    return CantoneseCHATReader(data_path, encoding="utf8")                                                        
  File "/home/lib/python3.7/site-packages/pycantonese/corpus.py", line 31, in __init__
    super(CantoneseCHATReader, self).__init__(*filenames, encoding=encoding)
TypeError: __init__() got an unexpected keyword argument 'encoding'

Downgrading to pylangacq==0.12.0 works as a workaround

Windows support

I can successfully use pycantonese using wsl but crash in windows cmd/powershell

code:

import pycantonese
print(pycantonese.segment("廣東話容唔容易學？"))

Traceback:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "E:\VisualStudioProject\FYP\lyrics_web_crawler\test.py", line 2, in <module>
    print(pycantonese.segment("廣東話容唔容易學？"))
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 111, in segment
    cls = _get_default_segmenter()
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 66, in _get_default_segmenter
    return Segmenter()
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 43, in __init__
    self.fit(hkcancor().words(by_utterances=True))
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\corpus.py", line 372, in hkcancor
    reader = _HKCanCorReader.from_dir(data_dir)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 1010, in from_dir
    return cls.from_files(
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 962, in from_files
    return cls.from_strs(strs, paths, parallel=parallel)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 116, in spawn_main
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    exitcode = _main(fd, parent_sentinel)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 125, in _main
    return func(*args, **kwargs)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    prepare(preparation_data)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 236, in prepare
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 923, in from_strs
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 116, in spawn_main
Traceback (most recent call last):
    exitcode = _main(fd, parent_sentinel)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 125, in _main
    reader._parse_chat_strs(strs, ids, parallel)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 213, in _parse_chat_strs
  File "<string>", line 1, in <module>
    executor.map(self._parse_chat_str, strs, file_paths)
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 674, in map
    _fixup_main_from_path(data['init_main_from_path'])
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 116, in spawn_main
    results = super().map(partial(_process_chunk, fn),
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in map
    exitcode = _main(fd, parent_sentinel)
    prepare(preparation_data)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 125, in _main
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 236, in prepare
    prepare(preparation_data)
    main_content = runpy.run_path(main_path,
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 265, in run_path
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 236, in prepare
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in <listcomp>
    _fixup_main_from_path(data['init_main_from_path'])
    return _run_module_code(code, init_globals, run_name,
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 97, in _run_module_code
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    _fixup_main_from_path(data['init_main_from_path'])
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    _run_code(code, mod_globals, init_globals,
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 87, in _run_code
    main_content = runpy.run_path(main_path,
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 265, in run_path
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
    main_content = runpy.run_path(main_path,
    exec(code, run_globals)
  File "E:\VisualStudioProject\FYP\lyrics_web_crawler\test.py", line 2, in <module>
    return _run_module_code(code, init_globals, run_name,
Traceback (most recent call last):
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 645, in submit
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 265, in run_path
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 97, in _run_module_code
    print(pycantonese.segment("廣東話容唔容易學？"))
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 111, in segment
  File "<string>", line 1, in <module>
    _run_code(code, mod_globals, init_globals,
    self._start_queue_management_thread()
    return _run_module_code(code, init_globals, run_name,
    cls = _get_default_segmenter()
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 87, in _run_code
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 584, in _start_queue_management_thread
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 97, in _run_module_code
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 66, in _get_default_segmenter
    _run_code(code, mod_globals, init_globals,
    return Segmenter()
    exec(code, run_globals)
  File "E:\VisualStudioProject\FYP\lyrics_web_crawler\test.py", line 2, in <module>
Traceback (most recent call last):
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 87, in _run_code
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 43, in __init__
    self._adjust_process_count()
    print(pycantonese.segment("廣東話容唔容易學？"))
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 116, in spawn_main
  File "<string>", line 1, in <module>
    self.fit(hkcancor().words(by_utterances=True))
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 608, in _adjust_process_count
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 111, in segment
    exitcode = _main(fd, parent_sentinel)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 125, in _main
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    exec(code, run_globals)
  File "E:\VisualStudioProject\FYP\lyrics_web_crawler\test.py", line 2, in <module>
    prepare(preparation_data)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 236, in prepare
    p.start()
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\corpus.py", line 372, in hkcancor
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 116, in spawn_main
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 116, in spawn_main
    print(pycantonese.segment("廣東話容唔容易學？"))
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\process.py", line 121, in start
    cls = _get_default_segmenter()
    _fixup_main_from_path(data['init_main_from_path'])
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 111, in segment
    exitcode = _main(fd, parent_sentinel)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 125, in _main
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 66, in _get_default_segmenter
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    reader = _HKCanCorReader.from_dir(data_dir)
    exitcode = _main(fd, parent_sentinel)
    self._popen = self._Popen(self)
    cls = _get_default_segmenter()
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    prepare(preparation_data)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 236, in prepare
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\context.py", line 327, in _Popen
    return Segmenter()
    main_content = runpy.run_path(main_path,
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 66, in _get_default_segmenter
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 125, in _main
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 43, in __init__
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 265, in run_path
    return Segmenter()
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 1010, in from_dir
    return Popen(process_obj)
    _fixup_main_from_path(data['init_main_from_path'])
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 43, in __init__
    self.fit(hkcancor().words(by_utterances=True))
    prepare(preparation_data)
    return _run_module_code(code, init_globals, run_name,
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\corpus.py", line 372, in hkcancor
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 236, in prepare
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 97, in _run_module_code
    return cls.from_files(
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    prep_data = spawn.get_preparation_data(process_obj._name)
    self.fit(hkcancor().words(by_utterances=True))
    _fixup_main_from_path(data['init_main_from_path'])
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
    return func(*args, **kwargs)
    reader = _HKCanCorReader.from_dir(data_dir)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\corpus.py", line 372, in hkcancor
    _run_code(code, mod_globals, init_globals,
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 962, in from_files
    main_content = runpy.run_path(main_path,
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 87, in _run_code
    _check_not_importing_main()
    reader = _HKCanCorReader.from_dir(data_dir)
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 265, in run_path
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 265, in run_path
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    return cls.from_strs(strs, paths, parallel=parallel)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    exec(code, run_globals)
    raise RuntimeError('''
    return func(*args, **kwargs)
    return _run_module_code(code, init_globals, run_name,
  File "E:\VisualStudioProject\FYP\lyrics_web_crawler\test.py", line 2, in <module>
    return _run_module_code(code, init_globals, run_name,
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 1010, in from_dir
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 97, in _run_module_code
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 97, in _run_module_code
    print(pycantonese.segment("廣東話容唔容易學？"))
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 111, in segment
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 1010, in from_dir
    _run_code(code, mod_globals, init_globals,

  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 923, in from_strs
    return cls.from_files(
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 87, in _run_code
    _run_code(code, mod_globals, init_globals,
    cls = _get_default_segmenter()
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 87, in _run_code
    return cls.from_files(
    exec(code, run_globals)
  File "E:\VisualStudioProject\FYP\lyrics_web_crawler\test.py", line 2, in <module>
    reader._parse_chat_strs(strs, ids, parallel)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 213, in _parse_chat_strs
    print(pycantonese.segment("廣東話容唔容易學？"))
    return func(*args, **kwargs)
    exec(code, run_globals)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 66, in _get_default_segmenter
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 111, in segment
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 962, in from_files
    cls = _get_default_segmenter()
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 66, in _get_default_segmenter
    return Segmenter()
  File "E:\VisualStudioProject\FYP\lyrics_web_crawler\test.py", line 2, in <module>
    executor.map(self._parse_chat_str, strs, file_paths)
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 43, in __init__
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 674, in map
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 962, in from_files
    print(pycantonese.segment("廣東話容唔容易學？"))
    return cls.from_strs(strs, paths, parallel=parallel)
    return Segmenter()
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 111, in segment
    self.fit(hkcancor().words(by_utterances=True))
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    results = super().map(partial(_process_chunk, fn),
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\corpus.py", line 372, in hkcancor
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 43, in __init__
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in map
    return cls.from_strs(strs, paths, parallel=parallel)
    cls = _get_default_segmenter()
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 66, in _get_default_segmenter
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    reader = _HKCanCorReader.from_dir(data_dir)
    self.fit(hkcancor().words(by_utterances=True))
    return Segmenter()
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\corpus.py", line 372, in hkcancor
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 43, in __init__
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in <listcomp>
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 923, in from_strs
    return func(*args, **kwargs)
    reader = _HKCanCorReader.from_dir(data_dir)
    self.fit(hkcancor().words(by_utterances=True))
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 1010, in from_dir
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 923, in from_strs
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\corpus.py", line 372, in hkcancor
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    return func(*args, **kwargs)
    reader._parse_chat_strs(strs, ids, parallel)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 1010, in from_dir
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 645, in submit
    reader = _HKCanCorReader.from_dir(data_dir)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 213, in _parse_chat_strs
    reader._parse_chat_strs(strs, ids, parallel)
    return cls.from_files(
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    return cls.from_files(
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 213, in _parse_chat_strs
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    executor.map(self._parse_chat_str, strs, file_paths)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    return func(*args, **kwargs)
    self._start_queue_management_thread()
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 584, in _start_queue_management_thread
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 674, in map
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 1010, in from_dir
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 116, in spawn_main
    executor.map(self._parse_chat_str, strs, file_paths)
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 962, in from_files
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 674, in map
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 962, in from_files
    self._adjust_process_count()
    exitcode = _main(fd, parent_sentinel)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 116, in spawn_main
    results = super().map(partial(_process_chunk, fn),
    return cls.from_files(
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 608, in _adjust_process_count
Traceback (most recent call last):
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 125, in _main
    return cls.from_strs(strs, paths, parallel=parallel)
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in map
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    results = super().map(partial(_process_chunk, fn),
    prepare(preparation_data)
    exitcode = _main(fd, parent_sentinel)
    return cls.from_strs(strs, paths, parallel=parallel)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in map
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 236, in prepare
    p.start()
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 125, in _main
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    return func(*args, **kwargs)
  File "<string>", line 1, in <module>
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in <listcomp>
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\process.py", line 121, in start
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 962, in from_files
    _fixup_main_from_path(data['init_main_from_path'])
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in <listcomp>
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 923, in from_strs
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    prepare(preparation_data)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 236, in prepare
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
    self._popen = self._Popen(self)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 116, in spawn_main
    return cls.from_strs(strs, paths, parallel=parallel)
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 645, in submit
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 923, in from_strs
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 645, in submit
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\context.py", line 327, in _Popen
    _fixup_main_from_path(data['init_main_from_path'])
    main_content = runpy.run_path(main_path,
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    reader._parse_chat_strs(strs, ids, parallel)
    exitcode = _main(fd, parent_sentinel)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 265, in run_path
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 213, in _parse_chat_strs
    return Popen(process_obj)
    self._start_queue_management_thread()
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 125, in _main
    main_content = runpy.run_path(main_path,
    return _run_module_code(code, init_globals, run_name,
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
    prepare(preparation_data)
Traceback (most recent call last):
    reader._parse_chat_strs(strs, ids, parallel)
    self._start_queue_management_thread()
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 265, in run_path
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 97, in _run_module_code
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 923, in from_strs
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 584, in _start_queue_management_thread
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 236, in prepare
  File "<string>", line 1, in <module>
    return _run_module_code(code, init_globals, run_name,
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 584, in _start_queue_management_thread
    executor.map(self._parse_chat_str, strs, file_paths)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
    _fixup_main_from_path(data['init_main_from_path'])
    _run_code(code, mod_globals, init_globals,
    self._adjust_process_count()
    reader._parse_chat_strs(strs, ids, parallel)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 213, in _parse_chat_strs
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 87, in _run_code
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 97, in _run_module_code
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 674, in map
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 608, in _adjust_process_count
    _check_not_importing_main()
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 213, in _parse_chat_strs
    self._adjust_process_count()
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 116, in spawn_main
    executor.map(self._parse_chat_str, strs, file_paths)
    exec(code, run_globals)
    main_content = runpy.run_path(main_path,
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
    p.start()
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 608, in _adjust_process_count
    exitcode = _main(fd, parent_sentinel)
  File "E:\VisualStudioProject\FYP\lyrics_web_crawler\test.py", line 2, in <module>
    executor.map(self._parse_chat_str, strs, file_paths)
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 265, in run_path
    results = super().map(partial(_process_chunk, fn),
    _run_code(code, mod_globals, init_globals,
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\process.py", line 121, in start
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 674, in map
    print(pycantonese.segment("廣東話容唔容易學？"))
    raise RuntimeError('''
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 674, in map
    return _run_module_code(code, init_globals, run_name,
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 87, in _run_code
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 125, in _main
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 111, in segment
    p.start()
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in map
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 97, in _run_module_code
    self._popen = self._Popen(self)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\process.py", line 121, in start
    results = super().map(partial(_process_chunk, fn),
    exec(code, run_globals)
    prepare(preparation_data)

    results = super().map(partial(_process_chunk, fn),
    cls = _get_default_segmenter()
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\context.py", line 327, in _Popen
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in map
    _run_code(code, mod_globals, init_globals,
  File "E:\VisualStudioProject\FYP\lyrics_web_crawler\test.py", line 2, in <module>
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 236, in prepare
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in map
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 66, in _get_default_segmenter
    self._popen = self._Popen(self)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\context.py", line 327, in _Popen
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in <listcomp>
    return Popen(process_obj)
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
    print(pycantonese.segment("廣東話容唔容易學？"))
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 87, in _run_code
    _fixup_main_from_path(data['init_main_from_path'])
    return Segmenter()
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in <listcomp>
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 111, in segment
    return Popen(process_obj)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 43, in __init__
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in <listcomp>
    exec(code, run_globals)
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 645, in submit
    cls = _get_default_segmenter()
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "E:\VisualStudioProject\FYP\lyrics_web_crawler\test.py", line 2, in <module>
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
    self.fit(hkcancor().words(by_utterances=True))
    main_content = runpy.run_path(main_path,
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 66, in _get_default_segmenter
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 645, in submit
    prep_data = spawn.get_preparation_data(process_obj._name)
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\corpus.py", line 372, in hkcancor
    print(pycantonese.segment("廣東話容唔容易學？"))
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 265, in run_path
    _check_not_importing_main()
    self._start_queue_management_thread()
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 645, in submit
    return Segmenter()
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 111, in segment
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 584, in _start_queue_management_thread
    self._start_queue_management_thread()
    reader = _HKCanCorReader.from_dir(data_dir)
    cls = _get_default_segmenter()
    return _run_module_code(code, init_globals, run_name,
    _check_not_importing_main()
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 584, in _start_queue_management_thread
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 43, in __init__
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 66, in _get_default_segmenter
    raise RuntimeError('''
    self._start_queue_management_thread()
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 97, in _run_module_code
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
    self._adjust_process_count()
    self.fit(hkcancor().words(by_utterances=True))
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.    self._adjust_process_count()
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 584, in _start_queue_management_thread
    return func(*args, **kwargs)
    return Segmenter()
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 608, in _adjust_process_count
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\corpus.py", line 372, in hkcancor
    _run_code(code, mod_globals, init_globals,
    raise RuntimeError('''

  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 608, in _adjust_process_count
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 1010, in from_dir
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 43, in __init__
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 87, in _run_code
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.    exec(code, run_globals)
    p.start()
    reader = _HKCanCorReader.from_dir(data_dir)
    self.fit(hkcancor().words(by_utterances=True))
    self._adjust_process_count()

  File "E:\VisualStudioProject\FYP\lyrics_web_crawler\test.py", line 2, in <module>
    return cls.from_files(
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\process.py", line 121, in start
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\corpus.py", line 372, in hkcancor
    p.start()
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 608, in _adjust_process_count
    print(pycantonese.segment("廣東話容唔容易學？"))
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\process.py", line 121, in start
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 111, in segment
    return func(*args, **kwargs)
    self._popen = self._Popen(self)
    reader = _HKCanCorReader.from_dir(data_dir)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 1010, in from_dir
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\context.py", line 327, in _Popen
    self._popen = self._Popen(self)
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    p.start()
    cls = _get_default_segmenter()
Traceback (most recent call last):
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\context.py", line 327, in _Popen
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 962, in from_files
    return Popen(process_obj)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\process.py", line 121, in start
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 66, in _get_default_segmenter
  File "<string>", line 1, in <module>
    return cls.from_files(
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
    self._popen = self._Popen(self)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 1010, in from_dir
    return Popen(process_obj)
    return Segmenter()
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\context.py", line 327, in _Popen
    return cls.from_strs(strs, paths, parallel=parallel)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 43, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    return func(*args, **kwargs)
    return cls.from_files(
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 116, in spawn_main
    prep_data = spawn.get_preparation_data(process_obj._name)
    return Popen(process_obj)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 962, in from_files
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    self.fit(hkcancor().words(by_utterances=True))
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\corpus.py", line 372, in hkcancor
    _check_not_importing_main()
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 923, in from_strs
    exitcode = _main(fd, parent_sentinel)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 962, in from_files
    _check_not_importing_main()
    return cls.from_strs(strs, paths, parallel=parallel)
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 125, in _main
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    reader = _HKCanCorReader.from_dir(data_dir)
    prepare(preparation_data)
    reader._parse_chat_strs(strs, ids, parallel)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 236, in prepare
    raise RuntimeError('''
    return cls.from_strs(strs, paths, parallel=parallel)
    raise RuntimeError('''
    _check_not_importing_main()
    return func(*args, **kwargs)
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 213, in _parse_chat_strs
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 923, in from_strs
    _fixup_main_from_path(data['init_main_from_path'])

  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 1010, in from_dir
    return func(*args, **kwargs)

    executor.map(self._parse_chat_str, strs, file_paths)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    raise RuntimeError('''
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 923, in from_strs
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 674, in map
    return cls.from_files(
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    return func(*args, **kwargs)
    reader._parse_chat_strs(strs, ids, parallel)
    main_content = runpy.run_path(main_path,
    results = super().map(partial(_process_chunk, fn),
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in map
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 962, in from_files
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 265, in run_path
    reader._parse_chat_strs(strs, ids, parallel)

  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 213, in _parse_chat_strs
    return cls.from_strs(strs, paths, parallel=parallel)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in <listcomp>
    return _run_module_code(code, init_globals, run_name,
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 97, in _run_module_code
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 923, in from_strs
    _run_code(code, mod_globals, init_globals,
  File "D:\anaconda3\envs\lyricscrawler\lib\runpy.py", line 87, in _run_code
    executor.map(self._parse_chat_str, strs, file_paths)
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 213, in _parse_chat_strs
    reader._parse_chat_strs(strs, ids, parallel)
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 674, in map
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 645, in submit
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 213, in _parse_chat_strs
    exec(code, run_globals)
  File "E:\VisualStudioProject\FYP\lyrics_web_crawler\test.py", line 2, in <module>
    executor.map(self._parse_chat_str, strs, file_paths)
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 674, in map
    results = super().map(partial(_process_chunk, fn),
    self._start_queue_management_thread()
    executor.map(self._parse_chat_str, strs, file_paths)
    print(pycantonese.segment("廣東話容唔容易學？"))
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in map
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 584, in _start_queue_management_thread
    results = super().map(partial(_process_chunk, fn),
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in map
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 674, in map
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 111, in segment
    self._adjust_process_count()
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 608, in _adjust_process_count
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
    cls = _get_default_segmenter()
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in <listcomp>
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in <listcomp>
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 66, in _get_default_segmenter
    p.start()
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\process.py", line 121, in start
    results = super().map(partial(_process_chunk, fn),
    return Segmenter()
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 645, in submit
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in map
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\word_segmentation.py", line 43, in __init__
    self._popen = self._Popen(self)
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 645, in submit
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\context.py", line 327, in _Popen
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
    self.fit(hkcancor().words(by_utterances=True))
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in <listcomp>
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pycantonese\corpus.py", line 372, in hkcancor
    self._start_queue_management_thread()
    self._start_queue_management_thread()
    return Popen(process_obj)
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 584, in _start_queue_management_thread
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 584, in _start_queue_management_thread
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 645, in submit
    reader = _HKCanCorReader.from_dir(data_dir)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    self._adjust_process_count()
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 608, in _adjust_process_count
    self._adjust_process_count()
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 608, in _adjust_process_count
    self._start_queue_management_thread()
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 584, in _start_queue_management_thread
    return func(*args, **kwargs)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
    p.start()
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 1010, in from_dir
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\process.py", line 121, in start
    _check_not_importing_main()
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
    self._adjust_process_count()
    p.start()
    self._popen = self._Popen(self)
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 608, in _adjust_process_count
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\process.py", line 121, in start
    return cls.from_files(
    raise RuntimeError('''
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\context.py", line 327, in _Popen
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    self._popen = self._Popen(self)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.    p.start()
    return func(*args, **kwargs)
    return Popen(process_obj)
    prep_data = spawn.get_preparation_data(process_obj._name)

  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\process.py", line 121, in start
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 962, in from_files
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
    self._popen = self._Popen(self)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\context.py", line 327, in _Popen
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
    return Popen(process_obj)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
    return cls.from_strs(strs, paths, parallel=parallel)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 153, in wrapper
    _check_not_importing_main()
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
    return func(*args, **kwargs)
    raise RuntimeError('''
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 923, in from_strs
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.    _check_not_importing_main()
  File "D:\anaconda3\envs\lyricscrawler\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main

    reader._parse_chat_strs(strs, ids, parallel)
  File "D:\anaconda3\envs\lyricscrawler\lib\site-packages\pylangacq\chat.py", line 213, in _parse_chat_strs
    executor.map(self._parse_chat_str, strs, file_paths)
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 674, in map
    results = super().map(partial(_process_chunk, fn),
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in map
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\_base.py", line 608, in <listcomp>
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 645, in submit
    self._start_queue_management_thread()
  File "D:\anaconda3\envs\lyricscrawler\lib\concurrent\futures\process.py", line 584, in _start_queue_management_thread
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.    self._adjust_process_count()

Jyutping-to-Yale output format

Currently, Jyutping-to-Yale conversion always takes a string as input and returns a list of strings, regardless of the number of syllables:

>>> import pycantonese as pc
>>> pc.yale("hoeng1")
['hēung']
>>> pc.yale("hoeng1gong2")
['hēung', 'góng']

Perhaps it would be desirable to make the input and output data structures consistent, e.g., a string for both input and output. The following changes are planned:

Set as default the string for both input and output of the yale() function. Allow an optional parameter to allow a list (of strings for individual syllables) to be the output.
Potential Yale ambiguities: The new default string output has to be checked for ambiguities like Jyutping "hei3hau6" (氣候 climate) --> Yale "heihauh", technically ambiguous between "hei'hauh" and "heih'auh". Probably the apostrophe as the syllable separator is going to be consistently used when a potential ambiguity is detected.

(h.t. Stephan Stiller)

分詞器速度太慢

目前個 .segment()效率有啲低，好似唔係最優算法。@graphemecluster @ZhanruiLiang 之後可能會開個 PR 睇下點優化。另外順便解決埋 #32 嘅分詞問題。

Upcoming new function: parse_text

I'm finishing a new function called parse_text, which takes raw Cantonese text, parses* the text with all the available PyCantonese functions (= conversion from characters to Jyutping, word segmentation, and part-of-speech tagging), and outputs a corpus object that facilitates data access and search operations. While I'm still polishing the code and there's no source code on GitHub yet, I'm opening this issue to introduce the design of the parse_text function and get feedback from potential users -- please leave your comments in this issue if you have any questions or thoughts! I'm going to leave this issue up for a month or so.

*Calling this a "parsing" function is perhaps futuristic, especially for those of you who may insist that parsing be restricted to mean syntactic parsing. When (read: some day...) PyCantonese can get the syntactic and semantic relations from segmented words, all this functionality will be wrapped under parse_text as well. :-)

The following focuses on two aspects of the new parse_text function:

What inputs are possible?
What knobs are there to customize the parsing? (only for word segmentation and part-of-speech tagging for now)

Input 1: A Plain String

If you have unprocessed Cantonese text (prose, conversational data, etc.), then you can pass in a plain Python string to parse_text:

In [1]: import pycantonese

In [2]: data = "你食咗飯未呀？食咗喇！你聽日得唔得閒呀？"

In [3]: corpus = pycantonese.parse_text(data)

In [4]: for s in corpus.to_strs():
   ...:     print(s)
   ...:
*X:	你 食 咗 飯 未 呀 ？
%mor:	PRON|nei5 VERB|sik6 PART|zo2 NOUN|faan6 ADV|mei6 PART|aa4 ？
*X:	食 咗 喇 ！
%mor:	VERB|sik6 PART|zo2 PART|laa1 ！
*X:	你 聽日 得 唔 得閒 呀 ？
%mor:	PRON|nei5 ADV|ting1jat6 VERB|dak1 ADV|m4 ADJ|dak1haan4 PART|aa4 ？

Note:

The output of parse_text is a CHAT corpus object. All methods and attributes for a CHAT corpus object will work (check out this tutorial).
Since CHAT is designed for conversational data and your input data is a string, parse_text attempts simple utterance segmentation (by the punctuation marks {"，", "！", "。"} as well as the EOL character "\n").
A dummy participant "X" is assigned to each utterance.

Input 2: A List of Strings

If you want to control utterance segmentation on your own, you can provide parse_text with a list of strings instead of a single string. Each string in the list will be treated as an utterance:

In [1]: import pycantonese

In [2]: data = [
   ...:     "你食咗飯未呀？",
   ...:     "食咗喇！你聽日得唔得閒呀？",
   ...: ]

In [3]: corpus = pycantonese.parse_text(data)

In [4]: for s in corpus.to_strs():
   ...:     print(s)
   ...:
*X:	你 食 咗 飯 未 呀 ？
%mor:	PRON|nei5 VERB|sik6 PART|zo2 NOUN|faan6 ADV|mei6 PART|aa4 ？
*X:	食 咗 喇 ！ 你 聽日 得 唔 得閒 呀 ？
%mor:	VERB|sik6 PART|zo2 PART|laa1 ！ PRON|nei5 ADV|ting1jat6 VERB|dak1 ADV|m4 ADJ|dak1haan4 PART|aa4 ？

Input 3: A List of Tuples of Strings

If your data has participant information and don't want "X" to show up as the dummy participant for the parsed utterances, you can provide parse_text with a list of tuples of strings. In each tuple, the first element is the participant, and the second one is the unparsed utterance string:

In [1]: import pycantonese

In [2]: data = [
   ...:     ("小麗", "你食咗飯未呀？"),
   ...:     ("小怡", "食咗喇！你聽日得唔得閒呀？"),
   ...: ]

In [3]: corpus = pycantonese.parse_text(data)

In [4]: for s in corpus.to_strs():
   ...:     print(s)
   ...:
*小麗:	你 食 咗 飯 未 呀 ？
%mor:	PRON|nei5 VERB|sik6 PART|zo2 NOUN|faan6 ADV|mei6 PART|aa4 ？
*小怡:	食 咗 喇 ！ 你 聽日 得 唔 得閒 呀 ？
%mor:	VERB|sik6 PART|zo2 PART|laa1 ！ PRON|nei5 ADV|ting1jat6 VERB|dak1 ADV|m4 ADJ|dak1haan4 PART|aa4 ？

Customizing Word Segmentation

The parse_text function has an optional argument called segment_kwargs. You can pass in a dictionary here to customize the behavior of word segmentation. The key-value pairs in this dictionary are passed as keyword arguments to the underlying segment function.

In [1]: import pycantonese

In [2]: from pycantonese.word_segmentation import Segmenter

In [3]: segmenter = Segmenter(allow={"得唔得閒"})

In [4]: data = [
   ...:     ("小麗", "你食咗飯未呀？"),
   ...:     ("小明", "食咗喇！你聽日得唔得閒呀？"),
   ...: ]

In [5]: corpus = pycantonese.parse_text(data, segment_kwargs={"cls": segmenter})

In [6]: for s in corpus.to_strs():
   ...:     print(s)
   ...:
*小麗:	你 食 咗 飯 未 呀 ？
%mor:	PRON|nei5 VERB|sik6 PART|zo2 NOUN|faan6 ADV|mei6 PART|aa4 ？
*小明:	食 咗 喇 ！ 你 聽日 得唔得閒 呀 ？
%mor:	VERB|sik6 PART|zo2 PART|laa1 ！ PRON|nei5 ADV|ting1jat6 VERB|dak1m4dak1haan4 PART|aa4 ？

Note the difference in the way "得唔得閒" is segmented between here and previous examples.

Customizing Part-of-Speech Tagging

The parse_text function has an optional argument called pos_tag_kwargs. You can pass in a dictionary here to customize the behavior of part-of-speech tagging. The key-value pairs in this dictionary are passed as keyword arguments to the underlying pos_tag function.

In [1]: import pycantonese

In [2]: data = [
   ...:     ("小麗", "你食咗飯未呀？"),
   ...:     ("小明", "食咗喇！你聽日得唔得閒呀？"),
   ...: ]

In [3]: corpus = pycantonese.parse_text(data, pos_tag_kwargs={"tagset": "hkcancor"})

In [4]: for s in corpus.to_strs():
   ...:     print(s)
   ...:
*小麗:	你 食 咗 飯 未 呀 ？
%mor:	R|nei5 V|sik6 U|zo2 N|faan6 D|mei6 Y|aa4 ？
*小明:	食 咗 喇 ！ 你 聽日 得 唔 得閒 呀 ？
%mor:	V|sik6 U|zo2 Y|laa1 ！ R|nei5 T|ting1jat6 V|dak1 D|m4 A|dak1haan4 Y|aa4 ？

add a conversion table between Jyutping and other romanization systems

Error parsing hng6

Describe the bug
A clear and concise description of what the bug is.
Error thrown when calling pycantonese.parse_jyutping('hng6')

To reproduce
Steps to reproduce the behavior, including the complete stack trace if possible:
pycantonese.parse_jyutping('hng6')

Expected behavior
A clear and concise description of what you expected to happen instead.
Should return [Jyutping(onset='h', nucleus='ng', coda='', tone='6')]

Screenshots
If applicable, add screenshots to help explain your problem.

System (please complete the following information):

Operating System: [e.g. Windows, MacOS, Linux]
PyCantonese version: [What you see after you run import pycantonese; print(pycantonese.__version__)]

Additional context
Add any other context about the problem here.

Jyutping "eu"

Jyutping "eu" should be rendered in Yale as "ew" (cf. Matthews and Yip 2011), but currently in the code (and incorrectly):
Jyutping "eu" --> Yale "eu" (which would correspond to "oe"/"eo" in Jyutping)
h/t Stephan Stiller

UnicodeDecodeError on Windows

If you're on Windows, you may hit UnicodeDecodeError when importing pycantonese. The problem has been resolved at the GitHub source code, and I've made a pre-release to PyPI. So for now the workaround is to use this pre-release version and the problem should go away:

$ pip install --pre --upgrade pycantonese

I'll keep this issue open in case anyone runs into it and finds the GitHub bug tracker here. Going to close it once I make an actual release on PyPI. Thanks, everyone!

import error occurs when import pycantonese

I first use pip to install pycantonese:
pip install pycantonese

When I use "import pycantonese", the following error occured:

ImportError Traceback (most recent call last)
in
----> 1 import pycantonese as pc
2
3 import os
4
5 import re

~/anaconda3/lib/python3.7/site-packages/pycantonese/init.py in
1 import pkg_resources
2
----> 3 from pycantonese.corpus import hkcancor, read_chat, CHATReader
4 from pycantonese.jyutping.characters import (
5 characters_to_jyutping,

~/anaconda3/lib/python3.7/site-packages/pycantonese/corpus.py in
4 from typing import List, Optional, Union
5
----> 6 from pylangacq.chat import Reader, _params_in_docstring
7 from pylangacq.chat import read_chat as pylangacq_read_chat
8 from pylangacq.objects import Gra

~/anaconda3/lib/python3.7/site-packages/pylangacq/init.py in
1 import pkg_resources
2
----> 3 from pylangacq.chat import read_chat, Reader
4
5

~/anaconda3/lib/python3.7/site-packages/pylangacq/chat.py in
21 from requests.packages.urllib3.util.retry import Retry
22 from dateutil.parser import parse as parse_date
---> 23 from dateutil.parser import ParserError
24
25 import pylangacq

ImportError: cannot import name 'ParserError' from 'dateutil.parser' (/Users/libaiqi/anaconda3/lib/python3.7/site-packages/dateutil/parser/init.py)

Does Word Segmentation give position of the vocabularies?

Feature you are interested in and your specific question(s):
I'm studying Word Segmentation of PyCantonese (https://pycantonese.org/word_segmentation.html), does the function return also the start & end position of the vocabulary?

What you are trying to accomplish with this feature or functionality:
I would like to achieve:

import pycantonese
from pycantonese.word_segmentation import Segmenter
segmenter = Segmenter()
result = pycantonese.segment("廣東話容唔容易學？", cls=segmenter)
print(result)

Current result:

['廣東話', '容', '唔', '容易', '學', '？']

Would like to have the following result (with the start & end position):

[('廣東話', 0, 3), ('容', 3, 4), ('唔', 4, 5), ('容易', 5, 7), ('學', 7, 8), ('？', 8, 9)]

Thanks.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

jacksonllee / pycantonese Goto Github PK

pycantonese's Introduction

PyCantonese: Cantonese Linguistics and NLP in Python

Download and Install

Consulting

Support

Links

How to Cite

License

Logo

Acknowledgments

Changelog

Setting up a Development Environment

pycantonese's People

Contributors

Stargazers

Watchers

Forkers

pycantonese's Issues

Input 1: A Plain String

Input 2: A List of Strings

Input 3: A List of Tuples of Strings

Customizing Word Segmentation

Customizing Part-of-Speech Tagging

Recommend Projects

Recommend Topics

Recommend Org