kensuke-mitsuzawa / japanesetokenizers Goto Github PK

View Code? Open in Web Editor NEW

136.0 4.0 20.0 278 KB

aim to use JapaneseTokenizer as easy as possible

License: MIT License

Python 92.53% Makefile 0.33% Shell 3.57% Dockerfile 3.58%

nlp japanese-language tokenizer mecab juman mecab-neologd-dictionary kytea dictionary-extension jumanpp

japanesetokenizers's People

Contributors

Stargazers

Watchers

japanesetokenizers's Issues

make Janome tokenizer as standard POS tagger

Summary

Mecab is standard a pos tagger for a long time, but it requires much work to install.

So, instead of mecab, janome tagger is good to use as a standard.

Mecab tagger will be 'plugin' style tagger.

IPADIC POS tagset into Universal POS tag set.

Summary

for convenience, it would be great to convert POS tags into universal tagset.

source

tagset table.

https://universaldependencies.org/tagset-conversion/ja-ipadic-uposf.html

jumanが小文字になってままになっている部分がある

https://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers/blob/master/JapaneseTokenizer/__init__.py#L2

background

it's high-cost to maintain both of python2/python3 files.

Solution

Mecab -> use difference python package depending on python version
juman & jumanpp & kytea -> put both python into same file

pyknp returns error in travis build environemnt

    result = self.juman.analysis(input_str)
  File "/usr/local/lib/python2.7/dist-packages/pyknp/juman/juman.py", line 128, in analysis
    return self.juman(input_str)
  File "/usr/local/lib/python2.7/dist-packages/pyknp/juman/juman.py", line 121, in juman
    result = MList(self.juman_lines(input_str))
  File "/usr/local/lib/python2.7/dist-packages/pyknp/juman/juman.py", line 116, in juman_lines
    return self.socket.query(input_str, pattern=self.pattern)
  File "/usr/local/lib/python2.7/dist-packages/pyknp/juman/juman.py", line 41, in query
    return recv.strip().decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte

Segmentation

Hi, how do we segment sentence from a paragraph in japanese text ?

missing dependency package unidic-lite

this is needed by Mecab3

not user-familiar error message by using neologdn.

This message is too ambiguous.

Exception: You could not call neologd dictionary bacause you do NOT install the package neologdn.

should be

Exception: You could not call neologd dictionary bacause you do NOT install the package neologdn. run pip install neologdn

Import error of pyknp

Traceback (most recent call last):
  File "generate_theme1.py", line 3, in <module>
    from JapaneseTokenizer import MecabWrapper
  File "/share/data/home/kensuke_mitsuzawa/outsource-ds-py-company-review/conda-env/lib/python3.5/site-packages/JapaneseTokenizer/__init__.py", line 2, in <module>
    from JapaneseTokenizer.juman_wrapper import JumanWrapper
  File "/share/data/home/kensuke_mitsuzawa/outsource-ds-py-company-review/conda-env/lib/python3.5/site-packages/JapaneseTokenizer/juman_wrapper/__init__.py", line 2, in <module>
    from .juman_wrapper import JumanWrapper
  File "/share/data/home/kensuke_mitsuzawa/outsource-ds-py-company-review/conda-env/lib/python3.5/site-packages/JapaneseTokenizer/juman_wrapper/juman_wrapper.py", line 11, in <module>
    from pyknp import MList
ImportError: No module named 'pyknp'

Fail in installing pyknp

Installed /opt/conda/lib/python3.5/site-packages/kytea-0.1.3-py3.5-linux-x86_64.egg
Searching for pyknp
Reading https://pypi.python.org/simple/pyknp/
Scanning index of all packages (this may take a while)
Reading https://pypi.python.org/simple/
Couldn't find index page for 'pyknp' (maybe misspelled?)
No local packages or working download links found for pyknp
error: Could not find suitable distribution for Requirement.parse('pyknp')

stopword filtering does NOT work when stopword is hankaku word

In [4]: input_sentence = '10日放送の「中居正広のミになる図書館」（テレビ朝日系）で、SMAPの中居正広が、篠原信一の過去の勘違いを明かす一幕があった。'
In [17]: mecab_wrapper.tokenize(input_sentence).filter(stopwords=['SMAP']).convert_list_object()
Out[17]: 
['1',
 '0',
 '日',
 '放送',
 'の',
 '「',
 '中居',
 '正広',
 'の',
 'ミ',
 'に',
 'なる',
 '図書館',
 '」',
 '(',
 'テレビ朝日',
 '系',
 ')',
 'で',
 '、',
 'SMAP',
 'の',
 '中居',
 '正広',
 'が',
 '、',
 '篠原',
 '信一',
 'の',
 '過去',
 'の',
 '勘違い',
 'を',
 '明かす',
 '一幕',
 'が',
 'ある',
 'た',
 '。']

SMAP still exists in the input string.

error of installing neologdn

Best match: neologdn 0.2.1
Processing neologdn-0.2.1.tar.gz
Writing /tmp/easy_install-mdYouk/neologdn-0.2.1/setup.cfg
Running neologdn-0.2.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-mdYouk/neologdn-0.2.1/egg-dist-tmp-9BvPzi
cc1plus: 警告: コマンドラインオプション ‘-Wstrict-prototypes’ は Ada/C/ObjC 用としては有効ですが、C++ 用としては有効ではありません [デフォルトで有効]
cc1plus: エラー: 認識できないコマンドラインオプション ‘-std=c++11’ です
error: Setup script exited with error: command 'gcc' failed with exit status 1

Error during install on MacOS Mojave

I encounter the error when I try to install the package on MacOS Mojave.

    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/kensuke-mi/.pyenv/versions/anaconda3-5.3.1/include -arch x86_64 -I/Users/kensuke-mi/.pyenv/versions/anaconda3-5.3.1/include -arch x86_64 -I/usr/local/Cellar/mecab/0.996/include -I/Users/kensuke-mi/.pyenv/versions/anaconda3-5.3.1/include/python3.7m -c MeCab_wrap.cpp -o build/temp.macosx-10.7-x86_64-3.7/MeCab_wrap.o
    warning: include path for stdlibc++ headers not found; pass '-std=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
    MeCab_wrap.cpp:3051:10: fatal error: 'stdexcept' file not found
    #include <stdexcept>
             ^~~~~~~~~~~
    1 warning and 1 error generated.
    error: command 'gcc' failed with exit status 1

Solution

The main reason for the error is that C/C++ compiler is the old version.

So, it's totally OK if you update C/C++ compiler.

install the latest compiler with brew install gcc
Put new symbolic links into the latest gcc compiler. ln -s /usr/local/bin/gcc-8 /usr/local/bin/gcc and ln -s /usr/local/bin/g++-8 /usr/local/bin/g++
put this line in your shell profile file, in my case ~/.bash_profile: export PATH=$PATH:/usr/local/bin
refresh your terminal such as source ~/.bash_profile
try to install the package again

Any problem with gcc in travis ?

MeCab_wrap.cxx:8434:80: error: ‘MECAB_ONE_BEST’ was not declared in this scope
MeCab_wrap.cxx:8435:77: error: ‘MECAB_NBEST’ was not declared in this scope
MeCab_wrap.cxx:8436:79: error: ‘MECAB_PARTIAL’ was not declared in this scope
MeCab_wrap.cxx:8437:85: error: ‘MECAB_MARGINAL_PROB’ was not declared in this scope
MeCab_wrap.cxx:8438:83: error: ‘MECAB_ALTERNATIVE’ was not declared in this scope
MeCab_wrap.cxx:8439:82: error: ‘MECAB_ALL_MORPHS’ was not declared in this scope
MeCab_wrap.cxx:8440:89: error: ‘MECAB_ALLOCATE_SENTENCE’ was not declared in this scope
MeCab_wrap.cxx:8441:84: error: ‘MECAB_ANY_BOUNDARY’ was not declared in this scope
MeCab_wrap.cxx:8442:86: error: ‘MECAB_TOKEN_BOUNDARY’ was not declared in this scope
MeCab_wrap.cxx:8443:84: error: ‘MECAB_INSIDE_TOKEN’ was not declared in this scope
error: Setup script exited with error: command 'gcc' failed with exit status 1

Juamanpp is gonna be down sometimes.

Cause

Unknown. This is mainly because jumanpp server script is not stable

Solution

Put jumanpp server in this package.

if timeout;
then; try-start-jumanpp-server
else; exception

unix handler causes exception when it puts much text data

https://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers/blob/master/JapaneseTokenizer/jumanpp_wrapper/jumanpp_wrapper.py#L196-L200

This warning message in 2 times

[Y/09/29 16:54:36]WARNING - jumanpp_wrapper.py#call_juman_interface:197: Re-starting unix process because it tak
es longer time than 30 seconds...
[Y/09/29 16:55:06]WARNING - jumanpp_wrapper.py#call_juman_interface:197: Re-starting unix process because it tak
es longer time than 30 seconds...

It seems that final exception is here.

Traceback (most recent call last):
  File "/share/data/home/kensuke_mitsuzawa/fuman-ds-py-academic-service/conda-env/lib/python3.5/site-packages/p$
xpect-4.2.1-py3.5.egg/pexpect/spawnbase.py", line 150, in read_nonblocking
    s = os.read(self.child_fd, size)
OSError: [Errno 5] Input/output error

Python mistaken type hint

some methods are described with type hint, however, they are wrong hint description.

Causing error of installing pyknp in setup.py

Issue report

https://stackoverflow.com/questions/53718267/module-import-issue-with-a-japanese-tokenizer

Reason

This issue is coming because an author of pyknp removed juman++ module from pyknp package.
However, it's existing in pyknp=0.3.

So, it should run install pyknp=0.3 in setup.py script.

de-normalize after tokenization

Issue

It runs string normalization for juman & jumanpp.
All カタカナ are into 全角カタカナ, all numeric expression are into 全角数字

However, 全角カタカナ & 全角数字 is not normal way to use Japanese text.

Solution

全角カタカナ -> 半角カタカナ
全角数字 -> 半角数字

after tokenization

Bottle-necks in using Jumanpp

It tends to cause exception in processing the first request.
It tends to cause an exception when we try to process the huge amount of text.

Improvement

1

It's better to put command to process dummy text just after package initializes a jumanpp process.

2

It's better to put an automated-restart procedure in the jumanpp process handler.

Error when loading MecabWrapper

Hi Kensuke Mitsuzawa,

I occurred some errors when using JapaneseTokenizer. I have installed package by make command (both) make command.

Here is the error when I run code:
mecab_wrapper = JapaneseTokenizer.MecabWrapper(dictType='ipadic')

Please let me know where I was wrong. Thank you very much.

Support kytea with python3

Thank you for including Mykytea! I supported Mykytea-python for Python3. Could you support kytea with Python 3?

I tried to implement python3 version, but I can't install jctconv ikegami-yukino/jctconv#3 , so I haven't create patch yet.

Error when it tries to call juman sever mode in python3.x

mistaken argument to filter out stopwords

Users expect stopword is a list of root form , however, an argument is surface form.

add juman dic and unidic option for mecab tokenizer

add jumandic to mecab wrapper
add unidict to mecab wrapper
tidy docker-dev file

a bug when port is str for jumanpp sever

Traceback (most recent call last):
  File "/Users/kensuke-mi/Desktop/analysis_work/fuman-ds-py-fuman2vector/job_scripts/train_word2vec_jumanpp.py", line 55, in <module>
    port=config_obj.get('Tokenizer', 'jumanpp_port'))
  File "/Users/kensuke-mi/.pyenv/versions/anaconda3-4.0.0/lib/python3.5/site-packages/JapaneseTokenizer/jumanpp_wrapper/jumanpp_wrapper_python3.py", line 73, in __init__
    self.jumanpp_obj = JumanppClient(hostname=server, port=port, timeout=timeout)
  File "/Users/kensuke-mi/.pyenv/versions/anaconda3-4.0.0/lib/python3.5/site-packages/JapaneseTokenizer/jumanpp_wrapper/jumanpp_wrapper_python3.py", line 30, in __init__
    self.sock.connect((hostname, port))
TypeError: an integer is required (got type str)

Failed to install the package because of dependent package

Issue

The setup is failed because of compiling error of neologdn.

    Complete output from command /Users/kensuke-mi/.pyenv/versions/anaconda3-4.0.0/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/sp/z0_0lktj7nn2s31db2dt5md40000gq/T/pip-build-_kqnspts/neologdn/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /var/folders/sp/z0_0lktj7nn2s31db2dt5md40000gq/T/pip-hylasjm2-record/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_ext
    building 'neologdn' extension
    creating build
    creating build/temp.macosx-10.6-x86_64-3.5
    /usr/bin/clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/kensuke-mi/.pyenv/versions/anaconda3-4.0.0/include -arch x86_64 -I/Users/kensuke-mi/.pyenv/versions/anaconda3-4.0.0/include/python3.5m -c neologdn.cpp -o build/temp.macosx-10.6-x86_64-3.5/neologdn.o -std=c++11
    neologdn.cpp:255:10: fatal error: 'unordered_map' file not found
    #include <unordered_map>
             ^
    1 error generated.
    error: command '/usr/bin/clang' failed with exit status 1

Solution

try to avoid installing neologdn when it happens compiling error.

a Bug in filtering words by P.O.S

Error case

The word filtering by P.O.S does NOT work under specific p.o.s condition.

The case is between pos_condition = ('名詞', '一般', ) and the p.o.s with word is ('名詞', '非自立', '一般')

Can not install with dependency problem

MacBook-Pro% pip install JapaneseTokenizer
Requirement already satisfied (use --upgrade to upgrade): JapaneseTokenizer in /Users/kensuke-mi/Desktop/analysis_work/python_morphology_splitters
Requirement already satisfied (use --upgrade to upgrade): future in /Users/kensuke-mi/.pyenv/versions/3.5.1/lib/python3.5/site-packages/future-0.15.2-py3.5.egg (from JapaneseTokenizer)
Requirement already satisfied (use --upgrade to upgrade): six in /Users/kensuke-mi/.pyenv/versions/3.5.1/lib/python3.5/site-packages (from JapaneseTokenizer)
Collecting mecab-python (from JapaneseTokenizer)
  Downloading mecab-python-0.996.tar.gz (40kB)
    100% |████████████████████████████████| 40kB 6.4MB/s 
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 20, in <module>
      File "/private/var/folders/nq/13lcpk354h51bgkmx4q4ttr00000gp/T/pip-build-kd5ixe8l/mecab-python/setup.py", line 18, in <module>
        include_dirs=cmd2("mecab-config --inc-dir"),
      File "/private/var/folders/nq/13lcpk354h51bgkmx4q4ttr00000gp/T/pip-build-kd5ixe8l/mecab-python/setup.py", line 10, in cmd2
        return string.split (cmd1(str))
    AttributeError: module 'string' has no attribute 'split'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/nq/13lcpk354h51bgkmx4q4ttr00000gp/T/pip-build-kd5ixe8l/mecab-python
You are using pip version 7.1.2, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

mecab-config: command not found error while installing on mac

Hi,

Thank you for making this package available, when i try to install i get the following error. Please let me know how i can resolve it

Collecting mecab-python3
Using cached https://files.pythonhosted.org/packages/ac/48/295efe525df40cbc2173748eb869290e81a57e835bc41f6d3834fc5dad5f/mecab-python3-0.996.1.tar.gz
Complete output from command python setup.py egg_info:
/bin/sh: mecab-config: command not found
Traceback (most recent call last):
File "", line 1, in
File "/private/var/folders/rd/qqy1bpm93qj1qcmrj8624qz91pyzr5/T/pip-build-pqo_16ve/mecab-python3/setup.py", line 29, in
inc_dir = mecab_config("--inc-dir")
File "/private/var/folders/rd/qqy1bpm93qj1qcmrj8624qz91pyzr5/T/pip-build-pqo_16ve/mecab-python3/setup.py", line 27, in mecab_config
return os.popen("mecab-config " + arg).readlines()[0].split()
IndexError: list index out of range

Thank You

add tokenizer sudachi

background

Works Application team released their own morphology analyzer called "Sudachi".
Sudachi has quite useful feature for business users.
It's convenient if we are able to call it from this package.

Design

They released python implementation of sudachi.

It's easy if we call this package. The main drawback is that sudachi-py does not work in python2.x

Deployment

Sudachi-py needs to deploy dictionary file by manual.

We would like to make it automatic somehow.

Travis could not install boost / jumanpp

Problem

Travis environment could not install boost library correctly. That causes install failure of Jumanpp and fails of test cases.
The error log is,

checking for boostlib >= 1.57... configure: We could not detect the boost libraries (version 1.57 or higher). If you have a staged boost library (still not installed) please specify $BOOST_ROOT in your environment and do not give a PATH to --with-boost option.  If you are sure you have boost installed, then check your version number looking in <boost/version.hpp>. See http://randspringer.de/boost for more documentation.

kensuke-mitsuzawa / japanesetokenizers Goto Github PK

japanesetokenizers's People

Contributors

Stargazers

Watchers

Forkers

japanesetokenizers's Issues

Summary

Summary

source

background

Solution

Solution

Cause

Solution

Issue report

Reason

Issue

Solution

Improvement

1

2

Issue

Solution

Error case

background

Design

Deployment

Problem

Recommend Projects

Recommend Topics

Recommend Org