Code Monkey home page Code Monkey logo

japanesetokenizers's People

Contributors

chezou avatar kensuke-mitsuzawa avatar yusukefs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

japanesetokenizers's Issues

MacOS support?

Seems like this is lacking macOS support?

I installed with

pip install JapaneseTokenizer
make install
make install_neologd

During make install I received the following error:

install_tokenizers.sh: line 89: ldconfig: command not found

And during make install_neologd i got:

[install-mecab-ipadic-NEologd] :     unxz is not found.
make: *** [install_neologd] Error 1

And while trying to run the example starter code, I got

[Y/12/06 15:03:43]ERROR - mecab_wrapper.py#__CallMecab:137: ('',)
[Y/12/06 15:03:43]ERROR - mecab_wrapper.py#__CallMecab:138: Possibly Path to userdict is invalid. Check the path
Traceback (most recent call last):
  File "test.py", line 7, in <module>
    mecab_wrapper = JapaneseTokenizer.MecabWrapper(dictType='ipadic')
  File "/Users/rpryzant/kana/venv/lib/python2.7/site-packages/JapaneseTokenizer/mecab_wrapper/mecab_wrapper.py", line 45, in __init__
    self.mecabObj = self.__CallMecab()
  File "/Users/rpryzant/kana/venv/lib/python2.7/site-packages/JapaneseTokenizer/mecab_wrapper/mecab_wrapper.py", line 139, in __CallMecab
    raise subprocess.CalledProcessError(returncode=-1, cmd="Failed to initialize Mecab object")
subprocess.CalledProcessError: Command 'Failed to initialize Mecab object' returned non-zero exit status -1

I am running macOS 10.13.1

make Janome tokenizer as standard POS tagger

Summary

Mecab is standard a pos tagger for a long time, but it requires much work to install.

So, instead of mecab, janome tagger is good to use as a standard.

Mecab tagger will be 'plugin' style tagger.

same module python2/python3

background

it's high-cost to maintain both of python2/python3 files.

Solution

Mecab -> use difference python package depending on python version
juman & jumanpp & kytea -> put both python into same file

pyknp returns error in travis build environemnt

    result = self.juman.analysis(input_str)
  File "/usr/local/lib/python2.7/dist-packages/pyknp/juman/juman.py", line 128, in analysis
    return self.juman(input_str)
  File "/usr/local/lib/python2.7/dist-packages/pyknp/juman/juman.py", line 121, in juman
    result = MList(self.juman_lines(input_str))
  File "/usr/local/lib/python2.7/dist-packages/pyknp/juman/juman.py", line 116, in juman_lines
    return self.socket.query(input_str, pattern=self.pattern)
  File "/usr/local/lib/python2.7/dist-packages/pyknp/juman/juman.py", line 41, in query
    return recv.strip().decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte

Segmentation

Hi, how do we segment sentence from a paragraph in japanese text ?

not user-familiar error message by using neologdn.

This message is too ambiguous.

Exception: You could not call neologd dictionary bacause you do NOT install the package neologdn.

should be

Exception: You could not call neologd dictionary bacause you do NOT install the package neologdn. run pip install neologdn

Import error of pyknp

Traceback (most recent call last):
  File "generate_theme1.py", line 3, in <module>
    from JapaneseTokenizer import MecabWrapper
  File "/share/data/home/kensuke_mitsuzawa/outsource-ds-py-company-review/conda-env/lib/python3.5/site-packages/JapaneseTokenizer/__init__.py", line 2, in <module>
    from JapaneseTokenizer.juman_wrapper import JumanWrapper
  File "/share/data/home/kensuke_mitsuzawa/outsource-ds-py-company-review/conda-env/lib/python3.5/site-packages/JapaneseTokenizer/juman_wrapper/__init__.py", line 2, in <module>
    from .juman_wrapper import JumanWrapper
  File "/share/data/home/kensuke_mitsuzawa/outsource-ds-py-company-review/conda-env/lib/python3.5/site-packages/JapaneseTokenizer/juman_wrapper/juman_wrapper.py", line 11, in <module>
    from pyknp import MList
ImportError: No module named 'pyknp'

Fail in installing pyknp

Installed /opt/conda/lib/python3.5/site-packages/kytea-0.1.3-py3.5-linux-x86_64.egg
Searching for pyknp
Reading https://pypi.python.org/simple/pyknp/
Scanning index of all packages (this may take a while)
Reading https://pypi.python.org/simple/
Couldn't find index page for 'pyknp' (maybe misspelled?)
No local packages or working download links found for pyknp
error: Could not find suitable distribution for Requirement.parse('pyknp')

stopword filtering does NOT work when stopword is hankaku word

In [4]: input_sentence = '10日放送の「中居正広のミになる図書館」(テレビ朝日系)で、SMAPの中居正広が、篠原信一の過去の勘違いを明かす一幕があった。'
In [17]: mecab_wrapper.tokenize(input_sentence).filter(stopwords=['SMAP']).convert_list_object()
Out[17]: 
['1',
 '0',
 '日',
 '放送',
 'の',
 '「',
 '中居',
 '正広',
 'の',
 'ミ',
 'に',
 'なる',
 '図書館',
 '」',
 '(',
 'テレビ朝日',
 '系',
 ')',
 'で',
 '、',
 'SMAP',
 'の',
 '中居',
 '正広',
 'が',
 '、',
 '篠原',
 '信一',
 'の',
 '過去',
 'の',
 '勘違い',
 'を',
 '明かす',
 '一幕',
 'が',
 'ある',
 'た',
 '。']

SMAP still exists in the input string.

error of installing neologdn

Best match: neologdn 0.2.1
Processing neologdn-0.2.1.tar.gz
Writing /tmp/easy_install-mdYouk/neologdn-0.2.1/setup.cfg
Running neologdn-0.2.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-mdYouk/neologdn-0.2.1/egg-dist-tmp-9BvPzi
cc1plus: 警告: コマンドラインオプション ‘-Wstrict-prototypes’ は Ada/C/ObjC 用としては有効ですが、C++ 用としては有効ではありません [デフォルトで有効]
cc1plus: エラー: 認識できないコマンドラインオプション ‘-std=c++11’ です
error: Setup script exited with error: command 'gcc' failed with exit status 1

Error during install on MacOS Mojave

I encounter the error when I try to install the package on MacOS Mojave.

    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/kensuke-mi/.pyenv/versions/anaconda3-5.3.1/include -arch x86_64 -I/Users/kensuke-mi/.pyenv/versions/anaconda3-5.3.1/include -arch x86_64 -I/usr/local/Cellar/mecab/0.996/include -I/Users/kensuke-mi/.pyenv/versions/anaconda3-5.3.1/include/python3.7m -c MeCab_wrap.cpp -o build/temp.macosx-10.7-x86_64-3.7/MeCab_wrap.o
    warning: include path for stdlibc++ headers not found; pass '-std=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
    MeCab_wrap.cpp:3051:10: fatal error: 'stdexcept' file not found
    #include <stdexcept>
             ^~~~~~~~~~~
    1 warning and 1 error generated.
    error: command 'gcc' failed with exit status 1

Solution

The main reason for the error is that C/C++ compiler is the old version.

So, it's totally OK if you update C/C++ compiler.

  1. install the latest compiler with brew install gcc
  2. Put new symbolic links into the latest gcc compiler. ln -s /usr/local/bin/gcc-8 /usr/local/bin/gcc and ln -s /usr/local/bin/g++-8 /usr/local/bin/g++
  3. put this line in your shell profile file, in my case ~/.bash_profile: export PATH=$PATH:/usr/local/bin
  4. refresh your terminal such as source ~/.bash_profile
  5. try to install the package again

Any problem with gcc in travis ?

MeCab_wrap.cxx:8434:80: error: ‘MECAB_ONE_BEST’ was not declared in this scope
MeCab_wrap.cxx:8435:77: error: ‘MECAB_NBEST’ was not declared in this scope
MeCab_wrap.cxx:8436:79: error: ‘MECAB_PARTIAL’ was not declared in this scope
MeCab_wrap.cxx:8437:85: error: ‘MECAB_MARGINAL_PROB’ was not declared in this scope
MeCab_wrap.cxx:8438:83: error: ‘MECAB_ALTERNATIVE’ was not declared in this scope
MeCab_wrap.cxx:8439:82: error: ‘MECAB_ALL_MORPHS’ was not declared in this scope
MeCab_wrap.cxx:8440:89: error: ‘MECAB_ALLOCATE_SENTENCE’ was not declared in this scope
MeCab_wrap.cxx:8441:84: error: ‘MECAB_ANY_BOUNDARY’ was not declared in this scope
MeCab_wrap.cxx:8442:86: error: ‘MECAB_TOKEN_BOUNDARY’ was not declared in this scope
MeCab_wrap.cxx:8443:84: error: ‘MECAB_INSIDE_TOKEN’ was not declared in this scope
error: Setup script exited with error: command 'gcc' failed with exit status 1

Juamanpp is gonna be down sometimes.

Cause

Unknown. This is mainly because jumanpp server script is not stable

Solution

Put jumanpp server in this package.

if timeout;
then; try-start-jumanpp-server
else; exception

unix handler causes exception when it puts much text data

https://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers/blob/master/JapaneseTokenizer/jumanpp_wrapper/jumanpp_wrapper.py#L196-L200

This warning message in 2 times

[Y/09/29 16:54:36]WARNING - jumanpp_wrapper.py#call_juman_interface:197: Re-starting unix process because it tak
es longer time than 30 seconds...
[Y/09/29 16:55:06]WARNING - jumanpp_wrapper.py#call_juman_interface:197: Re-starting unix process because it tak
es longer time than 30 seconds...

It seems that final exception is here.

Traceback (most recent call last):
  File "/share/data/home/kensuke_mitsuzawa/fuman-ds-py-academic-service/conda-env/lib/python3.5/site-packages/p$
xpect-4.2.1-py3.5.egg/pexpect/spawnbase.py", line 150, in read_nonblocking
    s = os.read(self.child_fd, size)
OSError: [Errno 5] Input/output error

de-normalize after tokenization

Issue

It runs string normalization for juman & jumanpp.
All カタカナ are into 全角カタカナ, all numeric expression are into 全角数字

However, 全角カタカナ & 全角数字 is not normal way to use Japanese text.

Solution

全角カタカナ -> 半角カタカナ
全角数字 -> 半角数字

after tokenization

Bottle-necks in using Jumanpp

  1. It tends to cause exception in processing the first request.
  2. It tends to cause an exception when we try to process the huge amount of text.

Improvement

1

It's better to put command to process dummy text just after package initializes a jumanpp process.

2

It's better to put an automated-restart procedure in the jumanpp process handler.

Error when loading MecabWrapper

Hi Kensuke Mitsuzawa,

I occurred some errors when using JapaneseTokenizer. I have installed package by make command (both) make command.

Here is the error when I run code:
mecab_wrapper = JapaneseTokenizer.MecabWrapper(dictType='ipadic')

image

Please let me know where I was wrong. Thank you very much.

a bug when port is str for jumanpp sever

Traceback (most recent call last):
  File "/Users/kensuke-mi/Desktop/analysis_work/fuman-ds-py-fuman2vector/job_scripts/train_word2vec_jumanpp.py", line 55, in <module>
    port=config_obj.get('Tokenizer', 'jumanpp_port'))
  File "/Users/kensuke-mi/.pyenv/versions/anaconda3-4.0.0/lib/python3.5/site-packages/JapaneseTokenizer/jumanpp_wrapper/jumanpp_wrapper_python3.py", line 73, in __init__
    self.jumanpp_obj = JumanppClient(hostname=server, port=port, timeout=timeout)
  File "/Users/kensuke-mi/.pyenv/versions/anaconda3-4.0.0/lib/python3.5/site-packages/JapaneseTokenizer/jumanpp_wrapper/jumanpp_wrapper_python3.py", line 30, in __init__
    self.sock.connect((hostname, port))
TypeError: an integer is required (got type str)

Failed to install the package because of dependent package

Issue

The setup is failed because of compiling error of neologdn.

    Complete output from command /Users/kensuke-mi/.pyenv/versions/anaconda3-4.0.0/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/sp/z0_0lktj7nn2s31db2dt5md40000gq/T/pip-build-_kqnspts/neologdn/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /var/folders/sp/z0_0lktj7nn2s31db2dt5md40000gq/T/pip-hylasjm2-record/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_ext
    building 'neologdn' extension
    creating build
    creating build/temp.macosx-10.6-x86_64-3.5
    /usr/bin/clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/kensuke-mi/.pyenv/versions/anaconda3-4.0.0/include -arch x86_64 -I/Users/kensuke-mi/.pyenv/versions/anaconda3-4.0.0/include/python3.5m -c neologdn.cpp -o build/temp.macosx-10.6-x86_64-3.5/neologdn.o -std=c++11
    neologdn.cpp:255:10: fatal error: 'unordered_map' file not found
    #include <unordered_map>
             ^
    1 error generated.
    error: command '/usr/bin/clang' failed with exit status 1

Solution

try to avoid installing neologdn when it happens compiling error.

a Bug in filtering words by P.O.S

Error case

The word filtering by P.O.S does NOT work under specific p.o.s condition.

The case is between pos_condition = ('名詞', '一般', ) and the p.o.s with word is ('名詞', '非自立', '一般')

Can not install with dependency problem

MacBook-Pro% pip install JapaneseTokenizer
Requirement already satisfied (use --upgrade to upgrade): JapaneseTokenizer in /Users/kensuke-mi/Desktop/analysis_work/python_morphology_splitters
Requirement already satisfied (use --upgrade to upgrade): future in /Users/kensuke-mi/.pyenv/versions/3.5.1/lib/python3.5/site-packages/future-0.15.2-py3.5.egg (from JapaneseTokenizer)
Requirement already satisfied (use --upgrade to upgrade): six in /Users/kensuke-mi/.pyenv/versions/3.5.1/lib/python3.5/site-packages (from JapaneseTokenizer)
Collecting mecab-python (from JapaneseTokenizer)
  Downloading mecab-python-0.996.tar.gz (40kB)
    100% |████████████████████████████████| 40kB 6.4MB/s 
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 20, in <module>
      File "/private/var/folders/nq/13lcpk354h51bgkmx4q4ttr00000gp/T/pip-build-kd5ixe8l/mecab-python/setup.py", line 18, in <module>
        include_dirs=cmd2("mecab-config --inc-dir"),
      File "/private/var/folders/nq/13lcpk354h51bgkmx4q4ttr00000gp/T/pip-build-kd5ixe8l/mecab-python/setup.py", line 10, in cmd2
        return string.split (cmd1(str))
    AttributeError: module 'string' has no attribute 'split'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/nq/13lcpk354h51bgkmx4q4ttr00000gp/T/pip-build-kd5ixe8l/mecab-python
You are using pip version 7.1.2, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

mecab-config: command not found error while installing on mac

Hi,

Thank you for making this package available, when i try to install i get the following error. Please let me know how i can resolve it

Collecting mecab-python3
Using cached https://files.pythonhosted.org/packages/ac/48/295efe525df40cbc2173748eb869290e81a57e835bc41f6d3834fc5dad5f/mecab-python3-0.996.1.tar.gz
Complete output from command python setup.py egg_info:
/bin/sh: mecab-config: command not found
Traceback (most recent call last):
File "", line 1, in
File "/private/var/folders/rd/qqy1bpm93qj1qcmrj8624qz91pyzr5/T/pip-build-pqo_16ve/mecab-python3/setup.py", line 29, in
inc_dir = mecab_config("--inc-dir")
File "/private/var/folders/rd/qqy1bpm93qj1qcmrj8624qz91pyzr5/T/pip-build-pqo_16ve/mecab-python3/setup.py", line 27, in mecab_config
return os.popen("mecab-config " + arg).readlines()[0].split()
IndexError: list index out of range

Thank You

add tokenizer sudachi

background

Works Application team released their own morphology analyzer called "Sudachi".
Sudachi has quite useful feature for business users.
It's convenient if we are able to call it from this package.

Design

They released python implementation of sudachi.

It's easy if we call this package. The main drawback is that sudachi-py does not work in python2.x

Deployment

Sudachi-py needs to deploy dictionary file by manual.

We would like to make it automatic somehow.

Travis could not install boost / jumanpp

Problem

Travis environment could not install boost library correctly. That causes install failure of Jumanpp and fails of test cases.
The error log is,

checking for boostlib >= 1.57... configure: We could not detect the boost libraries (version 1.57 or higher). If you have a staged boost library (still not installed) please specify $BOOST_ROOT in your environment and do not give a PATH to --with-boost option.  If you are sure you have boost installed, then check your version number looking in <boost/version.hpp>. See http://randspringer.de/boost for more documentation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.