openpecha / botok Goto Github PK

View Code? Open in Web Editor NEW

56.0 56.0 15.0 4.53 MB

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python

Home Page: https://botok.readthedocs.io/

License: Apache License 2.0

Python 100.00%

computational-linguistics nlp nlp-library pybo python tibetan tibetan-language tibetan-nlp token tokenizer

botok's Introduction

OpenPecha Toolkit

Description • Owner • Install • Docs

Description

OpenPecha Toolkit allows state of the art solution for distributed standoff annotations on moving texts, in which Base layer can be edited without affecting annotations. This is made possible by our OpenPecha Native Format called opf (OpenPecha Format) and our collection of importers which can parse existing text into opf and exporters which can export opf text into any format (.epub, .docx, .pdf, etc)

Owner

@10zinten

💾 Install

Stable version:

`pip install openpecha`

Daily development version:

`pip install git+https://github.com/OpenPecha/Openpecha-Toolkit`

Docs

Documentation: docs
If you have any problems with openpecha-toolkit, please open issues

Developer Installation.

git clone https://github.com/OpenPecha-dev/openpecha-toolkit.git
cd openpecha-toolkit
pip install -r requirements-dev.txt
pip install -e .
pre-commit install

Testing

PYTHONPATH=.:$PYTHONPATH pytest tests

botok's People

Contributors

Stargazers

Watchers

Forkers

thubtenrigzin mikkokotila crystalwlh stellakunzang tenlhak yongtso liu123546 xbmu leiwng computational-linguistics-research x39826 stlm1376 jyerena108 lungsangg blkserene

botok's Issues

Sentences and Paragraphs as Token attributes

The sentence_tokenizer() and paragraph_tokenizer() should add attributes about sentences in the Token objects directly instead of creating a new list of Tokens embedded in tuples.

An idea is to use the _ attribute in Token objets to store two k/v pairs: sent/word_num and par/word_num

Github Actions for CI

I've recently moved to using Github's integrated CI for all my projects. The experience is very good in comparison to the old way (e.g. Travis etc.).

It's not enabled by default, but you can do it here.

I could move the current CI to Github, as well as have a look if there are some obvious improvements we could make in the process.

integrate tests in setup.py

As explained in

https://docs.pytest.org/en/latest/goodpractices.html

The resources for the frequency is not in the package

have to add it to setup.py in the package_data

pypi package / travis deploys

Hey Guys, really great to see that things are moving :) For pypi it would be good idea to deploy Travis-ci first and then add in the end the deploy automatically to pypi when certain criteria is fulfilled (for example a merge from master to production branch or something like that). Let me know when is a good time to do some testing, and I'll do a PR with the travis.yml based on that. I can try to include some very basic getting started guide in the PR. What do you think?

test_tokenizer.py fail

the function test_split_token() returns:

Loading Trie...
Time: 0.0014331340789794922
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-c086f7958c84> in <module>()
----> 1 test_split_token()

<ipython-input-2-b4146ab2cb9c> in test_split_token()
     49     tok = Tokenizer(trie)
     50     tokens = tok.tokenize(PyBoTextChunks('གཏན་གྱི་བདེ་བའི་རྒྱུ།'))
---> 51     tok_utils = TokenSplit(tokens)
     52     tok_utils.split_affixed_particles()
     53     for t in tokens:

~/dev/mimic3/pybo/BoTokenUtils.py in __init__(self, tokens)
      8     def __init__(self, tokens):
      9         self.tokens = tokens
---> 10         self.matcher = BoMatcher()
     11 
     12     def split_affixed_particles(self):

TypeError: __init__() missing 1 required positional argument: 'query'

statistics performance with tokenizer.list_word_types

As it stands, Text(doc).list_word_types includes tokenization and statistical operation (basic word frequency). In a typical workflow I might first tokenize, and then get some statistics for it. Obviously this would be quite painful with bigger doc as I would have to basically spend twice the time.

May I suggest we separate statistics into its own class, that accepts as its input any of the outputs from Text(doc). That way we can offer other common things like co-occurrence and ngram statistics.

I would be happy to make a PR for such a class if you think it's a good idea. I think it will be good also in terms of keeping the namespace of Text() clean as well, where by the way you have done a fantastic job. It's rare to see a python package with this level of standard for namespace clarity.

labels

I suggest that we start to use labeling in following way:

there are three concepts; priority labels, context labels, and other labels
priority labels have three classes; issues, improvements to current features, new features
each priority class has three priorities; must-do, maybe, unlikely
context labels can have as many classes as needed to describe various context in the application
context labels are always white color, and priority labels are in range of faint yellow to dark red
each ticket gets labeled for its priority and context
other labels are used when classification with above scheme is not possible (e.g. "discussion")
other labels are always grey color

If you think this is a good idea, I'm happy to set it up.

Missing character when updating from pybo 0.4.0 to pybo 0.6.0, BoTokenizer to WordTokenizer

With pybo 0.4.0 and the BoTokenizer I'm able to tokenize the text that I'm working with.
With pybo 0.6.0 and the WordTokenizer I get the following error.

!pip install pybo==0.6.0
tok =pybo.WordTokenizer('POS')
...
tokens = [t for t in tok.tokenize(f.read()) if t.type != "non-bo" and t.pos != "punct"]

ValueError: The char "࿖" is expected to be in the tibetan table, but is not.

four.pdf
six.pdf
Attached are some outputs for jupyter notebooks, with the pybo version manually changed.

CQLMatcher can not match last token

Shown as below, the matcher can not detect the last token.

import pybo

token1 = pybo.Token()
token1.pos = 'NOUN'

token2 = pybo.Token()
token2.pos = 'VERB'

# could find first token
matcher = pybo.CQLMatcher('[pos="NOUN"]')
slices = matcher.match([token1, token2])
print(slices)  # [(0, 0)]

# could not find second token (last token)
matcher = pybo.CQLMatcher('[pos="VERB"]')
slices = matcher.match([token1, token2])
print(slices)  # []

docstrings

Hey Guys, great to see some action here :) Tried the tokenizer and seems to do a good job :)

May I suggest that you include doc strings as you go to all functional calls, as it takes much less time when you do it together with publishing the code, rather than afterward. Also makes adoption by others a lot more likely! If you can get something raw in place, I'm happy to improve it from there. That will then act as a basis for a proper user manual.

Thanks so much!

tokenizer gives IndexError

The below comes up for one volume in Rinchen Terdzo, in a scan of the whole body of texts. I tried to manually reproduce but could not.

The line that is causing it seems to be:

tokens.append(self.chunks_to_token([c_idx]))

in tokenizer.py.

The trace shows that this does not appear to be the same issue with #8.

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-221-9b0503e8f2ce> in <module>()
----> 1 word2vec_pipebig('rt_raw_clean.txt')

<ipython-input-218-52a921d05e0c> in word2vec_pipebig(filename, save_model)
    196                 model = word2vec_pipeline(line)
    197             else:
--> 198                 model = word2vec_pipeline(line, build_model=model)
    199             x += 1
    200 

<ipython-input-218-52a921d05e0c> in word2vec_pipeline(docs, epochs, skipgrams, workers, save, from_file, build_model)
    136         docs = read_file(docs)
    137 
--> 138     tokens = tokenize(docs)
    139     sentences = word2vec_prep(tokens)
    140 

<ipython-input-218-52a921d05e0c> in tokenize(text)
     61             pass
     62 
---> 63     return tok.tokenize(out, split_affixes=False)
     64 
     65 

~/dev/astetik_test/lib/python3.6/site-packages/pybo/__init__.py in tokenize(self, string, split_affixes)
     50         """
     51         preprocessed = PyBoTextChunks(string)
---> 52         tokens = self.tok.tokenize(preprocessed, split_affixes=split_affixes)
     53         if self.lemmatize:
     54             LemmatizeTokens().lemmatize(tokens)

~/dev/astetik_test/lib/python3.6/site-packages/pybo/tokenizer.py in tokenize(self, pre_processed, split_affixes, debug)
    135                     current_node = None
    136 
--> 137                 tokens.append(self.chunks_to_token([c_idx]))
    138 
    139             # END OF INPUT

~/dev/astetik_test/lib/python3.6/site-packages/pybo/tokenizer.py in chunks_to_token(self, syls, tag, ttype)
    180         if len(syls) == 1:
    181             # chunk format: ([char_idx1, char_idx2, ...], (type, start_idx, len_idx))
--> 182             token_syls = [self.pre_processed.chunks[syls[0]][0]]
    183             token_type = self.pre_processed.chunks[syls[0]][1][0]
    184             token_start = self.pre_processed.chunks[syls[0]][1][1]

IndexError: list index out of range

POS-tagging a list of tokens that have already been tokenized

Hi, I'm wondering that is it possible for pybo to POS-tag a list of token that have already been tokenized instead of an input string of running text?

Missing syllabes and punctuations

Please have a look to the following script :

int and bool

CQL and fsa both don't take int and bool as attribute values, we need that though

tests failing because of LemmatizeTokens().lemmatize(tokens)

Do you have an idea of why this might be happening?

import pybo as bo

# 1. PREPARATION 

# 1.1. Initializing the tokenizer
tok = bo.BoTokenizer('POS')

# 1.2. Loading in text
input_str = '༄༅། །རྒྱ་གར་སྐད་དུ། བོ་དྷི་སཏྭ་ཙརྻ་ཨ་བ་ཏ་ར། བོད་སྐད་དུ། བྱང་ཆུབ་སེམས་དཔའི་སྤྱོད་པ་ལ་འཇུག་པ། །སངས་རྒྱས་དང་བྱང་ཆུབ་སེམས་དཔའ་ཐམས་ཅད་ལ་ཕྱག་འཚལ་ལོ། །བདེ་གཤེགས་ཆོས་ཀྱི་སྐུ་མངའ་སྲས་བཅས་དང༌། །ཕྱག་འོས་ཀུན་ལའང་གུས་པར་ཕྱག་འཚལ་ཏེ། །བདེ་གཤེགས་སྲས་ཀྱི་སྡོམ་ལ་འཇུག་པ་ནི། །ལུང་བཞིན་མདོར་བསྡུས་ནས་ནི་བརྗོད་པར་བྱ། །'

# -------------------------

# 2. CREATING THE OBJECTS 

# 1.1. creating pre_processed object
pre_processed = bo.PyBoTextChunks(input_str)

# 1.2. creating tokens object
tokens = tok.tokenize(input_str)

The error it throws is this:

Traceback (most recent call last):
  File "./test_script.py", line 23, in <module>
    tokens = tok.tokenize(input_str)
  File "/home/travis/build/mikkokotila/pybo/pybo/__init__.py", line 54, in tokenize
    LemmatizeTokens().lemmatize(tokens)
  File "/home/travis/build/mikkokotila/pybo/pybo/lemmatizetoken.py", line 23, in lemmatize
    if token.unaffixed_word:
  File "/home/travis/build/mikkokotila/pybo/pybo/token.py", line 88, in unaffixed_word
    return self.cleaned_content
  File "/home/travis/build/mikkokotila/pybo/pybo/token.py", line 64, in cleaned_content
    cleaned = '་'.join([''.join([self.content[idx] for idx in syl]) for syl in self.syls])
  File "/home/travis/build/mikkokotila/pybo/pybo/token.py", line 64, in <listcomp>
    cleaned = '་'.join([''.join([self.content[idx] for idx in syl]) for syl in self.syls])
  File "/home/travis/build/mikkokotila/pybo/pybo/token.py", line 64, in <listcomp>
    cleaned = '་'.join([''.join([self.content[idx] for idx in syl]) for syl in self.syls])
IndexError: string index out of range

using unicode data

Instead of having custom groups in bostrings.py, it would probably be good to use the unicode data. I've gathered the interesting ones in this file, which could probably be moved in pybo. Not any kind of emergency though...

How to initialize the tokenizer without the POS tagging feature?

The doc gives examples on how to initialize the tokenizer together with part-of-speech capability, but how to initialize the tokenizer without the POS-tagging feature? And will there be any improvement of the speed of the tokenizer?

General Roadmap Discussion

I have a little bit better understanding now of the paradigm you are having. It seems that tokenization performance is now much better, and code is much cleaner, excellent work!

Regarding the preprocessing, I did a simple test with a single short made up chunk of text:

'འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་'

Here are some results:

%timeit -n1 pre_processed = PyBoTextChunks(text)
4.51 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit tokens = tok.tokenize(pre_processed)
124 µs ± 5.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit tagged = ['"{}"'.format(w.content) for w in tokens]
5.2 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

As it becomes apparent, at the moment the pre-processing is taking most of the time. This seems ok, as spacy, for example, takes several times longer to create a spacy document (which I understand involves doing many more things).

Because of the fact that each scale linearly (i.e. problems will arise with bigger sets of text), I was thinking if you guys had looked into Cython. That's what a lot of these tools end up using, as it will give pure C speeds in many cases, without changing the code much. Yesterday I tried the Tokenizer with just running it as Cython executable without changing the code at all, and it became 20-25% faster. Just changing one line of code (defining the idx variable as int for cython), there was another 5% increase in performance. Where the power lies in though, is in actually building for Cython, where it's not uncommon to get 10-50x performance gains. You guys seem to have a good skill level in programming, so it would be very smooth for you to move to Cython at this early stage (later more painful of course). Another, simpler, performance improvement would be to use Numpy arrays as opposed to using lists.

What do you think?

change toadd_filenames and todel_filenames to a folder path

this will enable to get a great flexibility in organizing the entries to add or to delete dynamically.

For ex. have a file with words to delete for any context, another one specific to a given file/context/theme.

colibri for gramm'n

This seems to go very well actually. Things are working so far as well as it worked with english language. I did not try some of the more advanced features yet, but it should be ok I guess. Let's see. Do you guys use notebooks? If yes, I might create some simple overview with some ready-made convenience functions included.

yaml fails to import

Looks like in some cases there is an import error for yaml.

This is resolved by:

pip install pyyaml

additional affix combinations

There are a few affix combinations that are not taken into account by pybo, namely

འིའམ
འིའང
འོའམ
འོའང

and possibly

འིའོའམ
འིའོའང

འིའམ is present in the Tengyur, I haven't checked the others, but just in case...

finding sentence limits

While it seems quite reasonable to cut on naro + shad, there are so many edge cases where the proper cut is difficult to find that it would be helpful to have some code doing that in pybo. i'm thinking of the various types of punctuation that are at the beginning of sentences and not at the end, etc. I have some code that does it here plus some tests here (only part of this code is interesting, namely the getAllBreakingCharsIndexes) but it could certainly improved (and better documented, mea culpa!). This could then be combined with some euristics to find actual sentences, not just shunits

tokenizer fails

I've installed with pypi and I'm doing...

import pybo as bo

# initialize the tokenizer
tok = bo.BoTokenizer('POS')

# load a string to a variable
input_str = 'འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་'

# tokenize the input
tokens = tok.tokenize(input_str)

# show the results
tokens

...at which point I get:

IndexError                                Traceback (most recent call last)
~/dev/astetik_test/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/dev/astetik_test/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    381                 if cls in self.type_pprinters:
    382                     # printer registered in self.type_pprinters
--> 383                     return self.type_pprinters[cls](obj, self, cycle)
    384                 else:
    385                     # deferred printer

~/dev/astetik_test/lib/python3.6/site-packages/IPython/lib/pretty.py in inner(obj, p, cycle)
    559                 p.text(',')
    560                 p.breakable()
--> 561             p.pretty(x)
    562         if len(obj) == 1 and type(obj) is tuple:
    563             # Special case for 1-item tuples.

~/dev/astetik_test/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    398                         if cls is not object \
    399                                 and callable(cls.__dict__.get('__repr__')):
--> 400                             return _repr_pprint(obj, self, cycle)
    401 
    402             return _default_pprint(obj, self, cycle)

~/dev/astetik_test/lib/python3.6/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    693     """A pprint that just redirects to the normal repr function."""
    694     # Find newlines and replace them with p.break_()
--> 695     output = repr(obj)
    696     for idx,output_line in enumerate(output.splitlines()):
    697         if idx:

~/dev/astetik_test/lib/python3.6/site-packages/pybo/token.py in __repr__(self)
     60         out += '\nsyl chars in content'
     61         if self.syls:
---> 62             out += '(' + ' '.join([''.join([self.content[char] for char in syl]) for syl in self.syls]) + '): '
     63         else:
     64             out += ': '

~/dev/astetik_test/lib/python3.6/site-packages/pybo/token.py in <listcomp>(.0)
     60         out += '\nsyl chars in content'
     61         if self.syls:
---> 62             out += '(' + ' '.join([''.join([self.content[char] for char in syl]) for syl in self.syls]) + '): '
     63         else:
     64             out += ': '

~/dev/astetik_test/lib/python3.6/site-packages/pybo/token.py in <listcomp>(.0)
     60         out += '\nsyl chars in content'
     61         if self.syls:
---> 62             out += '(' + ' '.join([''.join([self.content[char] for char in syl]) for syl in self.syls]) + '): '
     63         else:
     64             out += ': '

IndexError: string index out of range

If I don't load tok.tokenize(input_str) in to variable, then the error comes in that step.

handling genitive case (and maybe other cases too)

It looks like right now the mode of operation is where the case particle འི་ is separated from the word it belongs to. For example སེམས་དཔའི becomes སེམས་དཔ འི which of course renders the word incomprehensible in a programmatic sense. Is there a reason to handle it this way by default? I understand there might be benefit for having it as an option though.

This discussion then goes on to all འབྲེལ་སྒྲ་ and maybe other cases as well? I did not look yet, but seeing how often ར་ comes up, I'm guessing ལ་དོན་ is handled in the same way? I guess at least with འབྲེལ་སྒྲ་ it's similar to having an english language NLP treating 's as a word as opposed to a modifier that belongs to the root word. Or am I missing something?

What's the tagset used by pybo?

Hi, the tagset used by pybo is not documented and it seems to me that pybo uses the UD POS tags, but not identical to that.

Some additional POS tags are:
OOV (unknown words?) -> X?
OTHER (punctuation marks and symbols?) -> SYM/X?
non-word (non-tibetan word or punctuation marks?) -> X?

And "punct" is lowercase, which should be mapped to PUNCT (as per the description of UD POS tags)

I'm not sure whether there are other POS tags used, could you please list all possible POS tags and give a simple description of them?

Sentence tokenization and detokenization

Since I don't speak tibetan, I'm not sure whether pybo can do sentence tokenization and detokenization for Tibetan text or not. It seems to me that tibetan sentences are separated by a whitespace, so is it okay to just split sentences by whitespace (without resorting to machine learning approaches for real sentence boundary detection)?

The trailing whitespace is preserved in the tokenized text, but after I've removed the trailing whitespace in the output, is it possible to detokenize the list of token back to a string of text (with whitespace added back to between sentences)?

usage.py prints redundant characters ᛃᛃᛃ

When I run usage.py code on Jupyter Notebook, I get this:

Loading Trie...
Time: 5.155517816543579
" ཤི་"/VERBᛃᛃᛃ, "བཀྲ་ཤིས་  "/NOUNᛃᛃᛃ, "tr"/non, " བདེ་་ལེ གས"/NOUNᛃᛃᛃ, "།"/punct, " བཀྲ་ཤིས་"/NOUNᛃᛃᛃ, "བདེ་ལེགས་"/NOUNᛃᛃᛃ, "ཀཀ"/non

Are these ᛃᛃᛃ redundant or something is not printing properly? It seems like I'm getting response for the whole original string so I'm guessing redundant?

NONE error when trying to match int or bool token attributes

Trying to match int and bool with cql creates a NONE error. This seems to happen somewhere in the fsa file. It's an issue since it stops us from matching all our Token attributes

Warning issued after upgrading PyYAML to 5.1

Hi, after I've upgraded PyYAML to 5.1, the following warning is issued while using pybo:

>>> import pybo
>>> bo_tokenizer = pybo.BoTokenizer('GMD')

Warning (from warnings module):
  File "D:\Python\lib\site-packages\pybo\config.py", line 95
    self.config = yaml.load(g.read())
YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.

Warning (from warnings module):
  File "D:\Python\lib\site-packages\pybo\lemmatizetoken.py", line 44
    parsed_yaml = yaml.load(f.read())
YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
Building Trie... (20 s.)

I'm not sure whether pybo should modify its code to suppress the warning or just pin the PyYAML dependency at a lower major version (the main features of pybo are currently not affected).

Missing lemma for numbers

Here is output after the tokenization with latest botok.

suggestion for token conventions

Now we get:

content: "༄༅། "
char types: |punct|punct|punct|space|
type: punct
start in input: 0
length: 4
syl chars in content: None
tag: punct
POS: punct

...and only some of the labels correspond to the actual attribute in the token object. For example POS is 'pos' and start in input is start. Maybe better to adapt:

content
char_types
type
start_in_input
len
syl_chars_in_content
tag
pos

...and have these used for both the token object attribute and the printed out. What do you think?

symbol considered as token content

There ought to be three tokens, respectively having as token.content བསྐྱུང, ྈ and ཡིད་.

tokens = pybo.BoTokenizer('POS').tokenize('བསྐྱུངྈ ཡིད་')
print(tokens)
# [content: "བསྐྱུངྈ ཡིད་"
# char types: |cons|cons|sub-cons|sub-cons|vow|cons|symbol|space|cons|vow|cons|tsek|
# type: syl
# start in input: 0
# length: 12
# syl chars in content(བསྐྱུངྈཡིད): [[0, 1, 2, 3, 4, 5, 6, 8, 9, 10]]
# tag: non-word
# POS: non-word]

Sentencize a list of tokens that have been manually tokenized by adding spaces

Hi, I'm wondering whether it is possible to conduct sentence tokenization on a list of tokens that have already been tokenized (without breaking the original word tokenization)?

I tried the answer in #38, but it seems that it no longer works in pybo 0.6.4.

>>> text = 'བཀུར་བ ར་ མི་འགྱུར་ ཞིང༌ ། ། བརྙས་བཅོས་ མི་ སྙན་ རྗོད་པ ར་ བྱེད ། ། དབང་ དང་ འབྱོར་པ་ ལྡན་པ་ ཡི ། ། རྒྱལ་རིགས་ ཕལ་ཆེ ར་ བག་མེད་པ ས ། ། མྱོས་པ འི་ གླང་ཆེན་ བཞིན་ དུ་ འཁྱམས ། ། དེ་ ཡི་ འཁོར་ ཀྱང་ དེ་ འདྲ ར་ འགྱུར ། ། གཞན་ ཡང་ རྒྱལ་པོ་ རྒྱལ་རིགས་ ཀྱི ། ། སྤྱོད་པ་ བཟང་ངན་ ཅི་འདྲ་བ ། ། དེ་ འདྲ འི་ ཚུལ་ ལ་ བལྟས་ ནས་ སུ ། ། འབངས་ རྣམས་ དེ་དང་དེ་ འདྲ་ སྟེ ། ། རྒྱལ་པོ་ ནོ ར་ ལ་ བརྐམས་ གྱུར་ ན ། ། ནོ ར་ གྱིས་ རྒྱལ་ཁྲིམས་ བསླུ་བ ར་ རྩོམ ། ། མི་བདག་ གཡེམ་ ལ་ དགའ་ གྱུར་ ན ། ། འཕྱོན་མ འི་ ཚོགས་ རྣམས་ མགོ་འཕང་ མཐོ ། ། ཕྲ་མ ར་ ཉན་ ན་ དབྱེན་ གྱིས་ གཏོར ། ། བརྟག་དཔྱད་ མི་ ཤེས་ རྫུན་ གྱིས་ སླུ ། ། ང་ ལོ་ ཡང་ན་ ཀུན་ གྱིས་ བསྐྱོད ། ། ངོ་དག ར་ བརྩི་ ན་ ཟོལ་ཚིག་ སྨྲ ། ། དེ་དང་དེ་ ལ་སོགས་པ་ ཡི ། ། མི་བདག་ དེ་ ལ་ གང་ གང་ གིས ། ། བསླུ་བ ར་ རུང་བ འི་ སྐབས་ མཐོང་ ན ། ། གཡོན་ཅན་ ཚོགས་ ཀྱིས་ ཐབས་ དེ་ སེམས ། ། མི་ རྣམས་ རང་འདོད་ སྣ་ཚོགས་ ལ ། ། རྒྱལ་པོ་ ཀུན་ གྱི་ ཐུན་མོང་ ཕྱིར ། ། རྒྱལ་པོས་ བསམ་ གཞིགས་ མ་ བྱས་ ན ། ། ཐ་མ ར་ རྒྱལ་སྲིད་ འཇིག་པ ར་ འགྱུར ། ། ཆེན་པོ འི་ གོ་ས ར་ གནས་པ་ ལ ། ། སྐྱོན་ ཀྱང་ ཡོན་ཏན་ ཡིན་ཚུལ་ དུ ། ། འཁོར་ ངན་ རྣམས་ ཀྱིས་ ངོ་བསྟོད་ སྨྲ ། ། དེ་ཕྱིར་ སྐྱོན་ཡོན་ ཤེས་པ་ དཀའ ། ། ལྷག་པ ར་ རྩོད་ལྡན་ སྙིགས་མ འི་ ཚེ ། ། འཁོར་ གྱི་ ནང་ ན་མ་ རབས་ མང༌ ། ། སྐྱོན་ ཡང་ ཡོན་ཏན་ ལྟར་ མཐོང་ ལ ། ། རང་འདོད་ ཆེ་ ཞིང་ རྒྱལ་པོ་ བསླུ ། ། ཆུས་ དང་ འཁོར་ གྱི་ བདེ་ ཐབས་ ལ ། ། བསམ་ གཞིགས་ བྱེད་པ་ དཀོན་པ འི་ ཕྱིར ། ། རྒྱལ་པོས་ ལེགས་པ ར་ དཔྱད་ ནས་ སུ ། ། བདེན་པ འི་ ངག་ ལས'
>>> tokens = text.split()
>>> pybo.sentence_tokenizer(tokens)
Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    pybo.sentence_tokenizer(tokens)
  File "D:\Python\lib\site-packages\pybo\tokenizers\sentencetokenizer.py", line 16, in sentence_tokenizer
    sent_indices = get_sentence_indices(tokens)
  File "D:\Python\lib\site-packages\pybo\tokenizers\sentencetokenizer.py", line 43, in get_sentence_indices
    sentence_idx = extract_chunks(is_endpart_n_punct, tokens, 0, previous_end)
  File "D:\Python\lib\site-packages\pybo\tokenizers\sentencetokenizer.py", line 142, in extract_chunks
    if test(subtokens[n - 1], token):
  File "D:\Python\lib\site-packages\pybo\tokenizers\sentencetokenizer.py", line 80, in is_endpart_n_punct
    return is_ending_part(token1) and token2.chunk_type == 'PUNCT'
  File "D:\Python\lib\site-packages\pybo\tokenizers\sentencetokenizer.py", line 75, in is_ending_part
    return token and token.pos == 'PART' \
AttributeError: 'str' object has no attribute 'pos'

How to add my own dictionary

Hi all,
When I use the 'pybo' tools to segment Tibetan words, there are many of oov words. So I want to know how to add my own dictionary to improve word segmentation accuracy.
Thanks!

Failed to tokenize text with pybo 0.6.3

Hi, after I've upgrade pybo to 0.6.3, it seems that the tokenizer does not work anymore.

>>> import pybo
>>> tok = pybo.WordTokenizer('POS')
Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    tok = pybo.WordTokenizer('POS')
  File "D:\Python\lib\site-packages\pybo\tokenizers\wordtokenizer.py", line 33, in __init__
    main, custom = config.get_tok_data_paths(profile, modifs=modifs, mode=mode)
  File "D:\Python\lib\site-packages\pybo\config.py", line 86, in get_tok_data_paths
    files = self.config['tokenizers']['profiles'][profile]
KeyError: 'tokenizers'

I've also noticed that there are some API changes, it is okay for me to just change BoTokenizer to WordTokenizer?

Also, I'm quite interested in the new Tibetan sentence and paragraph tokenizer (isn't every single line of text a paragraph in Tibetan?).

Cache and reuse temporary files to speed up initialization

When initializing BoTokenizer, POS_trie.pickled and pybo.yaml are generated into the working directory. Is it possible to cache these files so that there would be no need to re-generate them after the first initialization (I'm not sure whether this would speed up the initialization process though).

And I think that it would be better to store the temporary files in a fixed position (e.g. C:/Users/UserName/Documents on Windows), instead of the working directory.

Also, this would requires pybo to check whether the temporary files are generated by an older version of pybo (hence the need to update them even there already exists a cached version).

Path issue after frozen with PyInstaller on macOS

Hi, the path in the following file is resolved to something like
Users/username/Desktop/program.app/Contents/MacOS/pybo/textunits/../resources/bo_uni_table.csv
instead of
Users/username/Desktop/program.app/Contents/MacOS/pybo/resources/bo_uni_table.csv
after pybo is frozen into my program using PyInstaller on macOS, crashing the program on startup

https://github.com/Esukhia/pybo/blob/81bfeb19311d1dd44f9bbb31d891155353377b07/pybo/textunits/charcategories.py#L9

This issue occurs on LInux as well (tested on Ubuntu 16.04), but does not happen on Windows.

I solved this issue by modifying this line before freezing pybo into my program:
table_path = Path(__file__).parent.parent / "resources/bo_uni_table.csv"

So is it okay for pybo to modify its codebase to help freezing on macOS and Linux?

default value for Token#pos

Right now, it is None, which triggers an error when needing to concatenate the content of this field for tokens that don't have a pos (such as non-bo tokens).

Should be an empty string instead.

Huge memory cost when initializing the tokenizer

Hi, the memory cost is extremely high on my 64-bit Windows 10 when initializing the tokenizer, ~1600MB for the GMD profile, ~600MB for the POS profile, ~500MB for the tsikchen profile, no matter building the trie from scratch or loading the existing trie file.

And the memory in use does not decrease after the tokenizer is initialized, is this the expected behavior?

Remove trailing whitespace in tokens

When the text to be tokenized includes non-tibetan words that use whitespace as the word boundary (for example english words), there's a trailing whitespace for each token in the results.

While it is easy to strip the trailing whitespace using rstrip() by users of pybo, I think that this should be made as the normal behavior?

If the tokens are to be detokenized into running text later, then a specialized detokenizer is needed to add the whitespace back only to non-tibetan words that use whitespace as the word boundary.

sanskrit entries don't seem to be inflected

ཡུལ་ ཀཽ་ $ཤཱཾ་ $བཱིར་ འོངས་ ཏེ །_ ཀཽ་ཤཱཾ་བཱི་ ན་

is it related to the fact this word is added dynamically?

Unicode normalisation

Add a Unicode normalization method for bad/ambiguous unicode:
https://github.com/Esukhia/derge-tengyur/blob/c45da1faabef28ff0b037557499bf07946e5c3ab/scripts/error-report.py#L44

pybo 0.6.0 tokenizer failed for འིའོ

Here is example for reproduce this error.

# examples on which it fails
str_01 = 'ལོག་པའི་བཤེས་གཉེན་དང་གྲུབ་མཐའ་ངན་པས་བརྟགས་པ་སོགས་ལ་མ་བལྟོས་པར་ཐོག་མེད་ནས་བདག་དང་བདག་གིར་འཛིན་པའི་བག་ཆགས་ཡོངས་སུ་གོམས་པས་ཕུང་པོ་ལ་དམིགས་ནས་ངའོ་བདག་གོ་འདི་ནི་ངའིའོ་བདག་གིའོ་སྙམ་དུ་ཡིད་ལ་བྱེད་པ་ཅན་ཐ་ན་བྱ་དང་རི་དྭགས་ཚུན་ཆོད་ལའང་འབྱུང་བ་གང་ཡིན་པ་ནི་འཇིག་ལྟ་ལྷན་སྐྱེས་ཞེས་བྱ་ལ། ༼༢༧༣ན༽དེའི་ཡུལ་ནི་འཇིག་ལྟ་ལྷན་སྐྱེས་ཀྱི་ཡུལ་གྱི་བདག་སྟེ་གྲུབ་མཐས་བཏགས་པ་མ་ཡིན་ནོ། །'
str_02 = 'ཁ་ཅིག ཨི་ཎ་འགྲོ་བའི་བྱིངས་ལ་པྲ་ཏི་སྔོན་དུ་བཞག ཎ་རྗེས་སུ་འབྲེལ། གཞན་ལས་ཀྱང་མཐོང་ངོ་། །ཀྭི་པའིའོ། །ཞེས་པས་ཀྭི་པའི་རྐྱེན་བྱིན། ཀ་ཡིག་ནི་ཀ་ལ་ཡ་ཎ་གཞན་ཡང་སྟེ་ཡ་ལ་བརྗོད་པ་བོར་བའོ། །ཞེས་པའི་དོན་པ་ཡིག་ནི། བྱིངས་ཀྱི་མཐའ་པ་རྗེས་སུ་འབྲེལ་པ་ལ་ཏའོ། །ཞེས་པའི་དོན་དེ་ཉིད་ཀྱི་ཕྱིར། ཏ་ཨཱ་ག་མ་བྱིན། བི་ཡིག་ནི་སྦྱར་བ་མིན་པའི་ཝིའི་དབྱིའོ། །ཞེས་པས་དབྱི། ཨུཏ་དང་གོ་སོགས་ལས་ཡ་ཏའོ། །ཞེས་པས་དེ་ཕན་གྱི་ཡཏ་རྐྱེན་དུ་བྱིན། ཏ་ཡིག་ནི་ཏའི་དེའི་དུས་ཀྱིའོ། །ཞེས་པའི་ཁྱད་པར་གྱི་དོན་ཡིན་པས་དབྱི། ཡང་ན། དེར་ལེགས་པ་ལའང་ཡའོ། །ཞེས་པས། དེ་ཕན་གྱི་ཡ་རྐྱེན་དུ་བྱིན། རྟགས་ཀྱི་དོན་གྱི་ཚིག་ལ་རྣམ་དབྱེ་དང་པོའོ། །ཞེས་པས་སི་བྱིན། སིའི་ཡིག་རྗེས་འབྲེལ། ས་རྣམ་བཅད་དུ་བཏང་། སྔར་བཞིན་མཚམས་སྦྱར་བས། པྲ་ཏཱི་ཏྱཿ ས་མུཏྤཱ་ད་སྔར་ལྟར་སྦྱིན། མིང་རྣམས་ཀྱི་སྦྱོར་བའི་དོན་ནི་བསྡུ་བའོ། །དེར་གནས་རྣམ་དབྱེ་རྣམས་དབྱི་བར་བྱའོ། །ཞེས་པས་ཚིག་སྡུད་བྱས་ནས་རྣམ་དབྱེ་ཕྱིས་པས། པྲ་ཏཱི་ཏྱ་ས་མུཏྤཱ་ད། ཞེས་པ་སྟེ་པྲ་ཏི་ནི་ཟློས་པ་སྟེ། སོ་སོ་སོ་སོ་དང་། ཨི་ཏི་ནི་འགྲོ་བ་དང་། ཆས་པ་དང་། འཇིག་པའི་དོན་དང་། ཡ་ནི་རུང་བའམ། འོས་པ་སྟེ་སོ་སོ་སོ་སོར་འགྲོ་ཞིང་འཇིག་པ་དང་ལྡན་པ་རྣམས་ཀྱི་འབྱུང་བ་ཞེས་བྱ་བའི་དོན་ཏོ། །ཞེས་འཆད་པར་བྱེད་དོ། །'
str_03 = 'སྤྱིར་རྣམ་རྟོག་ཆོས་སྐུ་སོགས་དགོངས་ལྡེམ་དགོངས་ཀྱིས་གསུངས་པའི་ཆོས་འདི་ཟབ་ལ་རྒྱ་ཆེ་བ་ཡིན། ཉན་ཐོས་པ་ལ་སྒྲ་ཇི་བཞིན་པ་རེ་བཅོམ་ལྡན་འདས་ཀྱིས་གསུངས་པ་དེའང་གོ་ཚད་དུ་གདའ། རྒྱུའི་ཐེག་ཆེན་ལ་དེ་ལས་བློ་ཆེ་བ་ལ་དགོངས་ལྡེམ་དུ་གསུངས་པ་མང་བས་ཆོས་ཟབ་ཏུ་སོང༌། དེ་ལ་ཡང་འཁོར་ལོ་སྔ་ཕྱིའི་དགོངས་ལྡེམ་དགོངས་ལ་ཕུལ་ཆེ་ཆུང་ཡོད། འབྲས་བུའི་ཐེག་ཆེན་བློ་ཤིན་ཏུ་ཆེ་བ་ལ་མཐའ་གཞན་འགོག་པའི་རིགས་པས་དགོངས་སྡེམ་དགོངས་ཁོ་ནར་གསུངས་པས། ཆོས་ཀྱི་ཟབ་རྒྱས་ཐམས་ཅད་ཁྲལ་ཁྲོལ་དུ་བཏང་བས་ན། བཅོམ་ལྡན་འདས་ཕྱག་ན་རྡོ་རྗེ་ལྟ་བུ་ཞིག་མིན་པ་སུ་ཡིན་གྱི་བློ་ལ་མི་འབབ། དེས་ན་བརྒྱུད་པ་འདིའི་རྣམ་རྟོག་ཆོས་སྐུའི་སྐད་སོགས་ཟབ་ལ་རྒྱ་ཆེ་བས་སུ་ཡིན་གྱི་ཡུལ་མིན། དེས་ན་སེམས་འདི་དང་སེམས་སྣང་འདི་སེམས་ཅན་རང་ཁམས་ལ་སྣང་བདེན་གཉིས་ཀར་ཞེན་པའི་མིག་བསླད་པོའི་ཕ་རོལ་བོའི་རྒྱུད་ཀྱི་སྒྱུ་མའི་རྟ་གླང་ལྟར་སྣང༌། འཕགས་པ་ལ་སྒྱུ་མ་མཁན་རང་གིས་མཐོང་བ་ལྟར་སྣང་ཡང་བདེན་ཞེན་མེད། སངས་རྒྱས་ལ་ནི་མིག་མ་བསླད་པའི་རྒྱུད་ཀྱི་སྒྱུ་མའི་རྟ་གླང་མ་མཐོང་བ་བཞིན་སེམས་དང་སེམས་སྣང་ཅི་ཡང་མཐོང་བ་ཡོད་པ་མ་ཡིན། ཁ་ཅིག་འདི་ཆོས་སྐུ།འདིའི་སྣང་ཆ་འདི་དེའི་རང་འོད་ཡིན་ཟེར་བ་མང་པོ་འདུག་སྟེ། འདི་འདྲའི་རྟོག་དཔྱོད་བག་རེམ་བྱས་ན་ཐར་པའི་སྲོག་རྩ་ཨེ་ཆད་བསམ་མནོ་ནི་གཏོང་དགོས་པར་འདུག །སྐུ་དབང་ནི་དྲུང་པ་རྣམ་པ་ཆེ། འོ་ན་སངས་རྒྱས་ལ་མཁྱེན་པ་མེད་པས་ཀུན་མཁྱེན་ཡེ་ཤེས་ཀྱང་མེད་པར་ཐལ་སྙམ་ན། དེ་དག་པའི་རྟེན་འབྲེལ་གྱིས་ཡོད་པ་ལྟར་སྣང་མོད། ཡོད་པ་ལྟར་སྣང་བ་དེ་མི་རིགས་ཏེ། དེར་སེམས་སེམས་སྣང་དང་བཅས་པ་གདོད་མ་ནས་གཏན་མེད་པའི་ཕྱིར་ཞེ་ན། མ་ཁྱབ་སྟེ། སེམས་ཅན་ལ་ཡང་སེམས་སེམས་སྣང་གདོད་མ་ནས་གཏན་མེད་ཀྱང་མ་དག་པའི་རྟེན་འབྲེལ་ལ་ཡོད་པ་ལྟར་སྣང་འདུག་ན་ཅི་སྟེ་འགལ། གྲགས་པ་ཅན་མང་པོ་རྟེན་འབྲེལ་གྱི་ཐད་འདིར་ཕྱིན་པ་དང༌། ཝ་རྒན་མ་ཁྱིས་བདས་པ་འདྲ་རེ་འོང་གིན་འདུག །དེ་ཐམས་ཅད་གོ་ཡུལ་གྱི་ལྟ་བ་སྤྱི་རྒྱ་གཅོད་པ་ཡིན་ཏེ། བྱིན་རླབས་ཀྱི་སེམས་འཛིན་ཐབས་ཤིག་མེད་ན། སྟོད་ལུང་རྒྱ་དམར་བས་ཁོ་བོ་དབུ་མའི་ལྟ་བ་ལ་ཕུ་ཐག་ཆོད་ཀྱང༌། སེམས་འཛིན་གྱི་གནད་ཅིག་མ་ཤེས་པས་སེམས་ལས་སུ་མ་རུང༌། བལ་པོ་རྒྱ་གར་དུར་ཁྲོད་བ་བྱ་བའི་གྲུབ་ཐོབ་ཅིག་བྱོན་གདའ་བ་ན་སོ་རྒས་ཀྱང་བསྙེགས་དགོས་གསུངས་ན། ༼ལོ་རྒྱུས་འདི་དུས་མཁྱེན་བཀའ་འབུམ་ན་འདུག༽ དེའི་ཕྱིར་བརྗོད་པ། ལུས་རྡོ་རྗེ་སྐྱིལ་ཀྲུང་སོགས་རྣམ་སྣང་ཆོས་བདུན་ལས། གཞན་ཉལ་བརྐྱང་སོགས་སྤྱོད་ལམ་གྱི་དུས་སུའང་མི་བྱེད། ངག་གསོལ་འདེབས་སྨྲ་བཅད་མ་གཏོགས་སྤྱོད་ལམ་གྱི་དུས་སུའང་མི་བྱེད། ཡིད་ལ་རྣམ་རྟོག་རང་དགའ་བ་སྤྱོད་ལམ་གྱི་དུས་སུའང་མི་བྱེད། རང་ཆོས་དང་མི་མཐུན་པའི་ཚུལ་འདི་ཙམ་ལ་ཆོས་ལྟར་སྣང་བྱེད་པ་འདི་ཀུན་ཀྱང་རང་ལ་བལྟོས་ནས་སངས་རྒྱས་ཙམ་དུ་བལྟས་ཏེ་ལོག་རྟོག་སྐད་ཅིག་ཀྱང་མི་བྱེད། རང་གི་སྦྱངས་ཡོན་གྱི་དམ་བཅའ་ཅི་ནུས་རེ་ལ་གཞི་བཟུང་ནས་གོང་དུ་བཤད་པའི་གོ་བ་རྣམས་ལ་སེམས་འཛིན་ཐུབ་ན་རབ་སྟེ། དེ་མ་ཟིན་ན་ཞི་གནས་སྐྱེས་པ་གཅིག་པུས་ཆོག །ལྷག་མཐོང་གི་དམིགས་པ་སླར་ཁྲིད་མི་དགོས་པ་ཞིག་ཡོང་བ་ཡིན། དེ་ལྟར་མ་ནུས་ན་སྐྱབས་སེམས་དང་འབྲེལ་བར་བྱས་ནས། རང་གི་སེམས་རིག་ཙམ་འདི་ཀརྨ་པའི་སྐུ་ངོ་བོ་སྙིང་རྗེ། རྣམ་པ་བྱམས་པ། མཚན་ཉིད་དགའ་བ། བྱ་བ་བཏང་སྙོམས་པའི་ངོ་བོར་བསྒོམ་པས་སེམས་ལ་ཚད་མེད་བཞི་ལྡན་གྱི་ཏིང་ངེ་འཛིན་ལ་ཅི་འདོད་ཀྱི་བར་འཇོག་ཐུབ་པ་འབྱུང༌། དེ་ནས་རང་ལུས་སྟོང་སིང་ངེ་བར་བསྒོམས་ཏེ། རང་གི་སེམས་རིག་ཙམ་འདི་ཀརྨ་པའི་གསུང་མི་འཇིགས་པ་སྦྱིན་པ། ཉོན་མོངས་པ་ཞི་བ། སྙན་ལ་འཇེབས་པ། སྡུག་བསྔལ་དང་ཡི་མུག་འཕྲོག་པའི་ངོ་བོར་བསྒོམས་པས་སེམས་ལ་འདོད་པའི་ཆགས་པ་དང་བྲལ་ཞིང༌། ཆོས་བརྒྱད་ཀྱི་བློ་ཞི་ནས་ཏིང་ངེ་འཛིན་རྩེ་གཅིག་པ་བུམ་ནང་གི་མར་མེ་ལྟ་བུ་ཅི་ནུས་སུ་མཉམ་པར་འཇོག་ནུས་པ་འབྱུང༌། དེས་ན་རང་སེམས་ཀྱང་སྟོང་པ་ཉིད་ཀྱི་ངང་ཁོ་ནར་བསྡུས་ཏེ། དེའི་སྟོང་པ་ཉིད་ཀྱི་ངང་ལས་ཀརྨ་པའི་ཐུགས་ཀུན་མཁྱེན་པ། དྲི་མ་མེད་པ། མ་སྐྱེས་པ། ཐ་སྙད་ཀྱི་ཡུལ་མིན་པར་བསྒོམས་ཏེ། མྱོང་བ་སྐྱེས་པ་ན་དེ་རེ་རེའི་སྟེང་དུ་མ་བཅོས་པར་བསྲི་ཞིང་ལྷོད་གློད་པས། རང་སེམས་ལ་ཡུལ་སྣང་འདི་རྒྱལ་པོ་ལ་བློན་པོས་ངོ་ལོག་བྱས་པ་འདྲ་བ་ཞིག་གི་ཉམས་འབྱུང༌། དེ་ནས་རང་སེམས་ཆགས་ཡུལ་མེད་པས་རྒྱལ་པོ་རྒྱལ་ས་ནས་བུད་པ་བཞིན་སེམས་རྨེག་མེད་དུ་བུད་འགྲོ། དེ་ནས་རྒྱལ་ས་གཞན་གྱི་དབང་དུ་བྱས་པ་བཞིན་སྤྲོས་མེད་ཕྱག་རྒྱ་ཆེན་པོ་སྟོང་པ་ཉིད་དེ། རྒྱལ་པོ་རྒྱལ་ཁྲིའི་ཁ་ན་ཕྱིན། ཟ་ཁ་བཅོ་བརྒྱད་ནི་དར། ལང་ལང་བྱེད། ལིང་ལིང་བྱེད་འདུག་གོ །ངས་ནི་དེ་ལས་མི་ཟེར། དེ་ནས་སྟོང་པ་དེ་ཉིད་ཀྱི་ངང་ནས་བླ་མ་ཀརྨ་པའི་སྐུ་གསུང་ཐུགས་གང་དུ་བཤད་པ་ལྟ་བུ་ཡོངས་རྫོགས་ཞིག་ལ་མཉམ་པར་བཞག་པས། མཉམ་པར་འཇོག་བྱེད་ཀྱི་ཤེས་པ་དེ་རིམ་བཞིན་ཤེད་ཇེ་ཆུང་ཇེ་ཡལ་དུ་འགྲོ་བས། དེའི་དུས་མ་བཅོས་པར་གློད་ཐུབ་པ་ཞིག་དགོས་པ་ཡིན། ༼འདིར་རྩོལ་བ་དང་བཅོས་བཅོས་དང་སྒྲིམ་སྒྲིམ་བྱས་ན་རྣམ་རྟོག་ཚོ་ལ་ཟན་མར་གཡུ་བྱིན་འདུག་གོ༽ དེར་གློད་ཐུབ་ན་འཁོར་འདས་ཀྱི་རྒྱབ་གྱེས་མཚམས། རྣམ་ཤེས་ཡེ་ཤེས་ཀྱི་གཡུལ་འགྱེད་པ། བླ་མ་དང་གདུལ་བྱའི་བྱིན་རླབས་འཇུག་མཚམས་ཤིག་འདིར་ཡོད་པ་ཡིན། རྣམ་རྟོག་ལ་སྒོམ་གྱི་ཐིལ་བྱེད་ཟེར་བ་ཀུན་འདི་ཚོ་ལ་སླེབས་ན་ཁ་རྒྱག་པ་ལ་ཀྲིག་འབྱར་ཡོང་གི་ཡོད། བསགས་ཆུང་ཁ་ཅིག །སྒོམ་ཤོར་སོང་སྙམ་སྟེ་དམིགས་པ་ལ་གསལ་འདེབས་པ་སྐད་དང༌། ཡུལ་ལ་མཉམ་པར་འཇོག་ཅི་ཐུབ་ཐུབ་བྱེད་པ་ཀུན་ཡོད་དེ། དེ་ཀུན་བཅོས་མ་བྱ་བ་ཡིན། གློད་མ་ཤེས་པའི་སྐད་ཡིན། དེར་གློད་ཤེས་ཀྱིས་གློད་ཐུབ་ན་སེམས་ཅན་དང་འཁོར་བ་བྱ་བ་འདི་ཀུན་མེ་ཆེན་གྱི་དཀྱིལ་ན་ཆབ་རོམ་བཞག་པ་འདྲ་བ་འོང་ཚོད་དུ་གདའ། གཞན་སེམས་ལ་གློད་ཟེར་ནས་སེམས་བྲེང་མ་ཆད་སྐྱེར་འཇུག་པ་ལ་རྗེས་ཀྱི་རྟོག་པ་རེས་ལྷན་པ་བསླན་པ་ལྟ་བུ་ལ་མི་ཟེར། ཆོས་འདི་ཀུན་གྱི་རྒྱུས་ཡོད་པ་ལ་སྒོམ་ཆེན་རང་ཞིག་དགོས། སློབ་མ་ཆོས་པ་བྱ་བ་དེ་བླ་མ་སངས་རྒྱས་ཀྱིས་ཀྱང་བླ་མར་ཁུར་བ་འདྲ་ཞིག་དགོས་པ་ཡིན། ངས་དེ་ལ་ཕན་མི་ཐོགས་པར་གདའ་ཟེར་ཏེ་ཞལ་ནས་ཤུགས་ནར་ཐོན་པ་ཞིག་བྱུང་ཕྱིན་ཆོས་པ་ཡིན་པ་སྐད་དུ་བཅོས་ཀྱང༌། སྐྱེ་བ་ཕྱི་མར་རྡོ་རྗེ་དམྱལ་བར་ལན་གཅིག་འབྱོན་དགོས་པ་ཨེ་ཡིན། དེ་ལྟར་ནའང་རྣལ་འབྱོར་འདན་མ་དེ་ལ་ནུས་པ་ཞིག་ཡོད་དེ། འདིར་འབད་དགེ་བས་བདག་གི་བླ་མ་མཐོང་ཐོས་དོན་དེ་འགྲུབ་པར་གྱུར་ཅིག །དབྱངས་ཅན་བཟང་པོའིའོ།། །ཤུ་བྷཾ། །'
str_04 = 'དེ་ལ་འདིར་འཁོར་ལོའི་ཡིག་འབྲུ་དང་ཆོས་ཀྱི་དབྱིངས་ཀྱི་སྔགས་སོ། །ཧཱུཾ་བཛྲ་དྷྲིཀ྄་ཅེས་པ་ལ་སོགས་སྔགས་ནི། རྡོ་རྗེ་འཛིན་དང་རྒྱལ་བའི་རྒྱལ། །རིན་ཆེན་འཛིན་དང་ཆགས་ཆེན་མཚོན། །ཤེས་རབ་འཛིན་ཅེས་བྱ་བ་སྟེ། །འདི་རྣམས་ཀྱི་དོན་ནི་གོང་དུ་རིགས་ལྔའི་སྐབས་སུ་བཤད་ཟིན་ནོ། །མཱུཾ་དྷ་ཏུ་ཤྭ་ཞེས་བྱ་བ་ལ་སོགས་པ། མཱུཾ་ནི་མོཀྵ་སྟེ་ངོ་བོ་ཉིད་ཀྱི་མཁའ་རང་བཞིན་གྱི་རྣམ་པར་ཐར་པའོ། །ལཱཾ་ནི་ལོ་ཙ་ན་སྟེ་སྤྱན་ནོ། །མཱཾ་ནི་མཱ་མ་ཀཱི་སྟེ་བདག་ཉིད་མའོ། །པཱཾ་ནི་པན་དར་སྟེ་དཀར་མོའོ། །ཏཱཾ་ནི་ཏ་ར་སྟེ་སྒྲོལ་མའོ། །ཀླད་ཀོར་རྣམས་ནི་བྱང་ཆུབ་སེམས་ཀྱི་ཐིག་ལེ་སྟེ་ས་བོན་གྱི་དོན་ནོ། །གཞན་ལ་ཡང་དེ་བཞིན་དུ་ཤེས་པར་བྱའོ། །དྷ་ཏུ་ཤྭ་རི་ནི་དབྱིངས་ཀྱི་དབང་ཕྱུག་མའོ། །དྭེ་ཥ་ར་ཏི་ཞེས་བྱ་བ་ལ་སོགས་པ་ནི་ཞེ་སྡང་དགའ་མ། གཏི་མུག་དགའ་མ། འདོད་ཆགས་དགའ་མ། རྡོ་རྗེ་དགའ་མ་སྟེ། ཞེ་སྡང་གི་རིགས་ལ་སོགས་པའི་གཙོ་བོ་རྣམས་མཉེས་པར་བྱེད་པས་ཞེ་སྡང་དགའ་མ་ལ་སོགས་པའོ། །རྣམ་པ་གཅིག་ཏུ་མི་གཉིས་པའི་ཤེས་རབ་ཆེན་པོ་ཉིད་དུག་གསུམ་རྡོ་རྗེའི་སྒྲས་བརྗོད་པ་ཡིན་ལ། དེ་ཉིད་དགའ་མ་སྟེ་ཤེས་རབ་མ་ཡིན་པའི་ཕྱིར་རོ། །ཇི་སྐད་དུ་གསང་བ་འདུས་པ་ལས། གཉིས་མེད་ཆོས་ཀྱི་ཡེ་ཤེས་ལས། །ཕྱིར་རོལ་ང་རྒྱལ་རྨོངས་ཞེས་བྱ། །དེ་ལ་ཕན་ཚུན་ཐུག་པ་ནི། །ཞེ་སྡང་ཞེས་ནི་བསྟན་པ་ཡིན། །མཚན་ཉིད་འདོད་ཆགས་ཀུན་ཞེན་པ། །ཡེ་ཤེས་འདི་ནི་རྡོ་རྗེའོ། །གཏི་མུག་ཞེ་སྡང་འདོད་ཆགས་དང༌། །རྡོ་རྗེ་རྟག་ཏུ་དགའ་མར་སྦྱར། །ཞེས་གསུངས་པ་ལྟ་བུའོ། །ཀྵིཾ་ཧི་རཱ་ཛ་ཡ་ཞེས་བྱ་བ་ལ་སོགས་པ་ནི་སེམས་དཔའ་རྣམས་ཀྱི་གསང་སྔགས་ཏེ། དེ་ལ་ཀྵིཾ་ཞེས་བྱ་བ་སའི་ས་བོན་ནོ། །དེ་བཞིན་དུ་ཏྲཾ་ནི་རིན་པོ་ཆེའིའོ། །ཧྲཱི་ནི་པདྨའིའོ། །ཛིཾ་ནི་རྒྱལ་བའི་ས་བོན་ནོ། །སྔགས་ཀྱི་དོན་ནི་སའི་རྒྱལ་པོ། སྟོང་པའི་སྙིང་པོ། པདྨའི་ཕྱག །རྒྱལ་བྱེད་ཕྱག་ཅེས་ཟེར་ཏེ། སེམས་དཔའ་བཞིའི་དོན་དང་སྔར་བཤད་པ་བཞིན་དུ་སྦྱར་རོ། །ཧཱུཾཾ་ལ་སྱ་ས་མ་ཡ་སྟྭཾ། ཞེས་བྱ་བ་ལ་སོགས་པ་ནི་ཡུལ་བཞིའི་སེམས་མ་རྣམས་ཀྱི་སྙིང་པོ་སྟེ། དེ་ལ་རིགས་བཞི་དང༌། ནང་གི་སེམས་མ་བཞི་དང༌། སྦྱོར་བའི་བརྡ་བཞི་རྣམས་དང་སྦྱར་ནས་སེམས་མ་བཞིའི་རང་བཞིན་དུ་སྟོན་ཏེ། དེ་ལའང་ཧཱུཾ་ཏྲཱཾཿཧྲཱིཿཨཱཿབཞི་ནི་རིགས་བཞིའི་དོན་ནོ། །ལ་སྱ་མ་ལེ་གཱིརྟི་ནཱིརྟི་བཞི་ནི་ནང་གི་ལྷ་མོ་བཞི་སྟེ། དེ་ཡང་སྒེག་མོ་ནི་འདོད་ཆགས་ཀྱི་ཡུལ། ཕྲེང་བ་ནི་འཁྱུད་པ། གླུ་ནི་དགའ་བར་གྱུར་པ། གར་ནི་བསྐྱོད་པར་གྱུར་བའོ། །'

# tokenizer code
t = Text(str_02, tok_params={'profile': 'POS'})
tokens = t.tokenize_words_raw_text

Error trackback. NOTE: all above 4 examples have the same error trackback.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-18-7c632fc48944> in <module>
      1 t = Text(str_02, tok_params={'profile': 'POS'})
----> 2 tokens = t.tokenize_words_raw_text

~/ML/env/lib/python3.6/site-packages/pybo/text/text.py in tokenize_words_raw_text(self)
     95     def tokenize_words_raw_text(self):
     96         config = {'profile': 'GMD'}
---> 97         return self.__process('basic_cleanup', 'word_tok', 'words_raw_text', 'plaintext', tok_params=config)
     98 
     99     @property

~/ML/env/lib/python3.6/site-packages/pybo/text/text.py in __process(self, preprocessor, tokenizer, modifier, formatter, tok_params)
    129             return pipeline.pipe_file(self.input, self.out_file)
    130         else:
--> 131             return pipeline.pipe_str(self.input)
    132 
    133     @staticmethod

~/ML/env/lib/python3.6/site-packages/pybo/text/pipelinebase.py in pipe_str(self, text)
     36                                                self.tok_params['profile'],
     37                                                modifs=modifs,
---> 38                                                mode=mode)
     39         else:
     40             elts = self.pipes['tok'][self.tok](text)

~/ML/env/lib/python3.6/site-packages/pybo/text/tokenize.py in word_tok(text, profile, modifs, mode)
     19 def word_tok(text: str, profile, modifs=None, mode='internal') -> List[PyboToken]:
     20     tok = get_wordtokenizer(profile, modifs, mode)
---> 21     return tok.tokenize(text)
     22 
     23 

~/ML/env/lib/python3.6/site-packages/pybo/tokenizers/wordtokenizer.py in tokenize(self, string, split_affixes, debug)
     53         MergeDagdra().merge(tokens)
     54 
---> 55         self._get_default_lemma(tokens)
     56         return tokens
     57 

~/ML/env/lib/python3.6/site-packages/pybo/tokenizers/wordtokenizer.py in _get_default_lemma(token_list)
     67                 if t.affix and not t.affix_host:
     68                     part = ''.join([''.join(syl) for syl in t.syls])
---> 69                     t.lemma = part_lemmas[part] + TSEK
     70                 elif not t.affix and t.affix_host:
     71                     t.lemma = t.text_unaffixed + AA + TSEK if t.affixation['aa'] else t.text_unaffixed + TSEK

KeyError: 'འིའོ'

word2vec implementation in Tibetan

Carving out this part from the discussion in #6.

I've got gensim word2vec built and working to some extent using a small pybo tokenized corpus. There are several things I need to get my head around as I had not used gensim in the past, but things are looking promising and I'm getting similarities out. They might even make some sense (in some cases). I think next I need to try training with a much bigger training dataset, and more epochs, and see what happens. There is a document streaming option for corpus, so I might work towards training with the whole of Rinchen Terdzo as I have it on my local machine.

Using the model that comes out from gensim, I also tested 'Annoy' nearest neighbor approximation which seems very promising. There is also LDA available out-of-the-box in gensim, so will look into that as well.

syllable boundary bug

tok = BoTokenizer('POS')
tokens = tok.tokenize('སྟབས་ཡག དེ་ལྟར་བྱོས་ཤིག')
tagged = ['"{}"/{}'.format(w.content, w.pos) for w in tokens]
print(', '.join(tagged))

outputs:

"སྟབས་"/NOUN, "ཡག དེ་"/non-word, "ལྟར་"/OTHER, "བྱོས་"/VERB, "ཤིག"/PART

The problem is that in this case, we want the space to be recognized as a word boundary since here, the expected shad is inhibited by ག.
An additional test should be added to match ཀ/ག/ཤ + space to be a valid syllable boundary.

edit: ཀ ཁྱིམ་ and ཤ ཞེས་ will show a similarly unexpected behaviour. both of these strings were detected as non-words by pybo in tengyur texts.

multi-threading

Currently we're running everything on a single thread. I wonder if there is a straightforward way to provide a wrapper that allows multi-threading (or even distributing) tokenization.

Trie's handing of word list that contains both པར་(photo) and པར་(particle)

པར་ meaning photo with a noun POS tag and པར་ with a particle POS tag with an affixed particle hold the same value in the trie. When a word list contains instances of both the noun and the particle, the trie only holds the data of the last one entered.

–Change token objects so that all the data goes into a dict containing everything as value pertaining to the first sense and give it an int (0, 1, ...) as key, stating which is the preferred sense. Sense 0 would be preferable to sense 1, and so on.

–Later, add things like probability calculated from context to decide which sense is correct.

–This would be calculated on the basis of the list of tokens. Each token would have all the possible senses of the entry. In this context, a sense means all the information (affixation, pos, freq, etc. pertaining to an entry).

Add folia output to pybo

https://github.com/proycon/spacy2folia/tree/master/spacy2folia