taku910 / mecab Goto Github PK

Yet another Japanese morphological analyzer

Shell 8.67% Perl 0.46% C# 0.83% Makefile 1.67% C++ 68.19% HTML 18.87% CSS 0.48% C 0.06% Java 0.44% Python 0.22% Ruby 0.03% Batchfile 0.01% Roff 0.03% SWIG 0.05%

mecab's People

Contributors

Stargazers

Watchers

Forkers

nagyistoce humem s-fiebig hirofumi kou pepijndevos bungoume com3345 hnam lamour1314 ghyde icoxfog417 nervenxc m2pathan shirayu toughie88 hayaki vangogh0318 jordi-adell sayiho yosato shundev kazuki-aruga tianjianjiang zhoujiang2013 chagge ikawaha tetsuok aaronzhangl abaelhe romaryd tomill mondher-bouazizi hiromi-nee titsuki af1ynch caojun221 codeops mydeacy s-ai mvf mcunha08 23tk dragosoprica namjae jvdbogae gherao ikegami-yukino guozanhua polm ccannell67 tsinghuatop hephaex rokurosatp massongit koheisakamoto strategist922 neusoft-technology-solutions cryolite yugimaster midobal minervatech gentom kitter rikima hkazuakey limemidolin yaoreadingcode alice71652 candicerui stiffxj kikutakou nguyenlab david-graves amabel akirakubo kyuhwas atushi0624 lgwray netmaid wikiwikification qnighy szhl sgym0906 hiroaki8388 yr071162024 zyxpaidaxing masakichi chigix borodust invencode bugcheck yxtechs chezou pocke nptdat shubhampachori12110095 birch-san ming-hai osamuaoki

mecab's Issues

OSX Unable to have both libmecab and libMeCab (from mecab-java) due to case insensitive file system

What steps will reproduce the problem?
1. Install mecab
2. Install mecab-java

What is the expected output? What do you see instead?

expect to see libmecab.dylib from mecab
expect to see libMeCab.dylib from mecab-java

instead I see libMeCab.so from mecab-java (but we should have libMeCab.dylib)


What version of the product are you using? On what operating system?

mecab-0.996.tar.gz
mecab-java-0.996.tar.gz
OS X Mavericks 10.9.3

Please provide any additional information below.

You can create a dylib for mecab-java using:

g++ -dynamiclib -undefined suppress -flat_namespace *.o -o libMeCab.dylib


However because OS X is case insensitive, it isn't possible to have both 
libmecab and libMeCab (from mecab-java) due to case insensitive file system.  
Thus JNI loading will fail because you can't have both the real mecab and the 
JNI mecab existing at the same time.

Would officially recompiling and renaming the libMeCab.so to libMeCabJNI.dylib 
help?

Original issue reported on code.google.com by [email protected] on 9 Jul 2014 at 2:02

Memoly leak when use python-wrapper and input string is too long

memoly leak When the following conditions are fullfilled

use python wrapper("-C (allocate sentence)" option is ON)
use same lattice instance within each loop
input bytes over 5534

How to reproduce

versions
- Python 3.5.1
- mecab of 0.996

code

import MeCab
import os
import psutil
import sys
pid = os.getpid()
py = psutil.Process(pid)


class CheckMemoryLeak():
    def __init__(self):
        self.lattice = MeCab.Lattice()

    def mecab_set_sentence(self, text):
        self.lattice.set_sentence(text)


if __name__ == '__main__':
    Mecab = CheckMemoryLeak()
    sentence = 'あ' * 2730
    print('input bytes:', sys.getsizeof(sentence))
    while True:
        Mecab.mecab_set_sentence(sentence)
        memoryUse = py.memory_info()[0]
        print('memory use:', memoryUse)

result

input bytes: 5534
memory use: 13950976
・・・(about 10 times mecab_set_sentence)
memory use: 14221312
・・・(about 10 times mecab_set_sentence)
memory use: 14491648
・・・(after 30 seconds)
memory use: 2043158528

However, in the case of the following code

sentence = 'あ' * 2729

result

input bytes: 5532
memory use: 13950976
・・・(about 10 times mecab_set_sentence)
memory use: 14155776
・・・(after 30 seconds)
memory use: 14155776
・・・(after 10 minutes)
memory use: 14155776

Probable Cause

It is not checked that the number of bytes of input_str is less than or equal to BUF_SIZE.
It is considered that a memory leak has occurred when allocating a character string of a size exceeding BUF_SIZE after allocating an area for BUF_SIZE.
BUF_SIZE, MIN_INPUT_BUFFER_SIZE, MAX_INPUT_BUFFER_SIZE can not be set with setting file, options, etc. only input-buffer-size

mecab/mecab/src/tokenizer.h

Lines 42 to 53 in 3a07c4e

    
           char *alloc(size_t size) { 
        
             if (!char_freelist_.get()) { 
        
               char_freelist_.reset(new ChunkFreeList<char>(BUF_SIZE)); 
        
             } 
        
             return char_freelist_->alloc(size + 1); 
        
           } 
        
           char *strdup(const char *str, size_t size) { 
        
             char *n = alloc(size + 1); 
        
             std::strncpy(n, str, size + 1); 
        
             return n; 
        
           }

Temporary solution

Edit BUF_SIZE

mecab/mecab/src/common.h

Lines 72 to 74 in 3a07c4e

    
           #define MIN_INPUT_BUFFER_SIZE 8192 
        
           #define MAX_INPUT_BUFFER_SIZE (8192*640) 
        
           #define BUF_SIZE 8192

before

#define MIN_INPUT_BUFFER_SIZE 8192
#define MAX_INPUT_BUFFER_SIZE (8192*640)
#define BUF_SIZE 8192

after

#define MIN_INPUT_BUFFER_SIZE 16384
#define MAX_INPUT_BUFFER_SIZE (16384*640)
#define BUF_SIZE 16384

rebuild&reinstall

make
sudo make install

Proposed solution

The problem is that execution will not stop even if a memory leak occurs

Warn if input string exceeds BUF_SIZE also python-wrapper

Kanji do not appear.

What steps will reproduce the problem?
1. Installed MeCab and Python binding.
2. Tried example code from 
http://mecab.googlecode.com/svn/trunk/mecab/doc/bindings.html
3. See result below.

What is the expected output? What do you see instead?
Basic issue is that no Kanji appear.

>>> print m.parse ("今おはよう。")
?   ??????? ?   ̾??-????       
?お?   ?お?   ?お?   ????-????       
??  ??  ??  ̾??-??ͭ̾??-?ȿ?      
??う。    ??う。    ??う。    ????-????       
EOS

What version of the product are you using? On what operating system?
Mecab .994 on OS X Mountain Lion with Python 2.7

Please provide any additional information below.
email: [email protected]

Original issue reported on code.google.com by [email protected] on 26 Aug 2012 at 7:09

形容詞活用形「正しく」が副詞として扱われる

ipadicを用いて形容詞活用形 正しく (ただしく) を含む文章を形態素解析すると、副詞 正しく (まさしく) として扱われてしまいます。

$ mecab -v
mecab of 0.996

$ mecab -D
filename:	/usr/local/lib/mecab/dic/ipadic/sys.dic
version:	102
charset:	utf8
type:	0
size:	392126
left size:	1316
right size:	1316

$ echo "全てのコマンドが正しく動くかを確認するには、以下のコマンドを実行します。" | mecab
全て	名詞,副詞可能,*,*,*,*,全て,スベテ,スベテ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
コマンド	名詞,一般,*,*,*,*,コマンド,コマンド,コマンド
が	助詞,格助詞,一般,*,*,*,が,ガ,ガ
正しく	副詞,一般,*,*,*,*,正しく,マサシク,マサシク
動く	動詞,自立,*,*,五段・カ行イ音便,基本形,動く,ウゴク,ウゴク
か	助詞,副助詞／並立助詞／終助詞,*,*,*,*,か,カ,カ
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
確認	名詞,サ変接続,*,*,*,*,確認,カクニン,カクニン
する	動詞,自立,*,*,サ変・スル,基本形,する,スル,スル
に	助詞,格助詞,一般,*,*,*,に,ニ,ニ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
、	記号,読点,*,*,*,*,、,、,、
以下	名詞,非自立,副詞可能,*,*,*,以下,イカ,イカ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
コマンド	名詞,一般,*,*,*,*,コマンド,コマンド,コマンド
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
実行	名詞,サ変接続,*,*,*,*,実行,ジッコウ,ジッコー
し	動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
ます	助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。	記号,句点,*,*,*,*,。,。,。
EOS

node->feature output format inconsistent

node->feature (comma separated) basically has 9 fields, but I noticed it sometimes only has 7 fields. For example, the token FIFA for the sample sentence below.

I have couple of questions:

wrt FIFA, can we assume that only the last two fields (読み,発音) are missing?
basically I parse a sentence into nodes, without further parsing the node->feature string, is there any easier way to get "原形 and 読み" from a node object?

sample sentence
国際サッカー連盟（FIFA）の幹部が汚職により逮捕、29日にFIFA会長選が予定通り開催
output
国際名詞,一般,,,,,国際,コクサイ,コクサイ
サッカー名詞,一般,,,,,サッカー,サッカー,サッカー
連盟名詞,一般,,,,,連盟,レンメイ,レンメイ
（記号,括弧開,,,,,（,（,（
FIFA 名詞,固有名詞,組織,,,,
）記号,括弧閉,,,,,）,）,）
の助詞,連体化,,,,,の,ノ,ノ
幹部名詞,一般,,,,,幹部,カンブ,カンブ
が助詞,格助詞,一般,,,,が,ガ,ガ
汚職名詞,一般,,,,,汚職,オショク,オショク
により助詞,格助詞,連語,,,,により,ニヨリ,ニヨリ
逮捕名詞,サ変接続,,,,,逮捕,タイホ,タイホ
、記号,読点,,,,,、,、,、
29 名詞,数,,,,,*
日名詞,接尾,助数詞,,,,日,ニチ,ニチ
に助詞,格助詞,一般,,,,に,ニ,ニ
FIFA 名詞,一般,,,,,*
会長名詞,一般,,,,,会長,カイチョウ,カイチョー
選名詞,接尾,一般,,,,選,セン,セン
が助詞,格助詞,一般,,,,が,ガ,ガ
予定名詞,サ変接続,,,,,予定,ヨテイ,ヨテイ
通り名詞,接尾,一般,,,,通り,ドオリ,ドーリ
開催名詞,サ変接続,,,,*,開催,カイサイ,カイサイ
EOS

Build fails with LTO

Using the CXXFLAGS: -flto=4 -Werror=odr -Werror=lto-type-mismatch -Werror=strict-aliasing

/bin/sh ../libtool  --tag=CXX   --mode=link x86_64-pc-linux-gnu-g++  -march=native -fstack-protector-all -O2 -pipe -fdiagnostics-color=always -frecord-gcc-switches -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 -fstack-clash-protection -flto=4 -Werror=odr -Werror=lto-type-mismatch -Werror=strict-aliasing  -Wformat -Werror=format-security -Wall  -no-undefined -version-info 2:0:0 -Wl,-O1 -Wl,--as-needed -flto=4 -Werror=odr -Werror=lto-type-mismatch -Werror=strict-aliasing -Wl,--defsym=__gentoo_check_ldflags__=0 -o libmecab.la -rpath /usr/lib64 viterbi.lo tagger.lo utils.lo eval.lo iconv_utils.lo dictionary_rewriter.lo dictionary_generator.lo dictionary_compiler.lo context_id.lo connector.lo nbest_generator.lo writer.lo string_buffer.lo param.lo tokenizer.lo char_property.lo dictionary.lo feature_index.lo lbfgs.lo learner_tagger.lo learner.lo libmecab.lo  -lpthread -lpthread  -lstdc++ 
libtool: link: x86_64-pc-linux-gnu-g++  -fPIC -DPIC -shared -nostdlib /usr/lib/gcc/x86_64-pc-linux-gnu/13/../../../../lib64/crti.o /usr/lib/gcc/x86_64-pc-linux-gnu/13/crtbeginS.o  .libs/viterbi.o .libs/tagger.o .libs/utils.o .libs/eval.o .libs/iconv_utils.o .libs/dictionary_rewriter.o .libs/dictionary_generator.o .libs/dictionary_compiler.o .libs/context_id.o .libs/connector.o .libs/nbest_generator.o .libs/writer.o .libs/string_buffer.o .libs/param.o .libs/tokenizer.o .libs/char_property.o .libs/dictionary.o .libs/feature_index.o .libs/lbfgs.o .libs/learner_tagger.o .libs/learner.o .libs/libmecab.o   -Wl,--as-needed -lpthread -L/usr/lib/gcc/x86_64-pc-linux-gnu/13 -L/usr/lib/gcc/x86_64-pc-linux-gnu/13/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/usr/lib/gcc/x86_64-pc-linux-gnu/13/../../../../x86_64-pc-linux-gnu/lib -L/usr/lib/gcc/x86_64-pc-linux-gnu/13/../../.. -lstdc++ -lm -lc -lgcc_s /usr/lib/gcc/x86_64-pc-linux-gnu/13/crtendS.o /usr/lib/gcc/x86_64-pc-linux-gnu/13/../../../../lib64/crtn.o  -march=native -fstack-protector-all -O2 -fdiagnostics-color=always -frecord-gcc-switches -flto=4 -Werror=odr -Werror=lto-type-mismatch -Werror=strict-aliasing -Werror=format-security -Wl,-O1 -flto=4 -Werror=odr -Werror=lto-type-mismatch -Werror=strict-aliasing -Wl,--defsym=__gentoo_check_ldflags__=0   -Wl,-soname -Wl,libmecab.so.2 -o .libs/libmecab.so.2.0.0
iconv_utils.h:19:7: error: type 'struct Iconv' violates the C++ One Definition Rule [-Werror=odr]
   19 | class Iconv {
      |       ^
iconv_utils.h:19:7: note: a different type is defined in another translation unit
   19 | class Iconv {
      |       ^
iconv_utils.h:22:11: note: the first difference of corresponding definitions is field 'ic_'
   22 |   iconv_t ic_;
      |           ^
iconv_utils.h:24:7: note: a field of same name but different type is defined in another translation unit
   24 |   int ic_;
      |       ^
iconv_utils.h:19:7: note: type 'void *' should match type 'int'
   19 | class Iconv {
      |       ^
iconv_utils.h:36:8: error: type of 'convert' does not match original declaration [-Werror=lto-type-mismatch]
   36 |   bool convert(std::string *);
      |        ^
iconv_utils.cpp:110:6: note: 'convert' was previously declared here
  110 | bool Iconv::convert(std::string *str) {
      |      ^
iconv_utils.cpp:110:6: note: code may be misoptimized unless '-fno-strict-aliasing' is used
lto1: some warnings being treated as errors
lto-wrapper: fatal error: x86_64-pc-linux-gnu-g++ returned 1 exit status

Downstream report: https://bugs.gentoo.org/924569
Full logs: build.log

error on making user dictionary.

What steps will reproduce the problem?
1. making user-dictionary.
  $ /usr/local/libexec/mecab/mecab-dict-index -m model.def -d . -u user2.csv -f utf-8 -t utf-8 -a user.csv 
  model.def is not a binary model. reopen it as text mode...
  dictionary.cpp(183) [cid.left_size() == matrix.left_size() && cid.right_size() == matrix.right_size()] Context ID files(./left-id.def or ./right-id.def may be broken: 1999 2894 2894 1999

What is the expected output? What do you see instead?
  $ /usr/local/libexec/mecab/mecab-dict-index -m model.def -d . -u user2.csv -f utf-8 -t utf-8 -a user.csv 
  model.def is not a binary model. reopen it as text mode...
  reading user.csv ... 
  done!


What version of the product are you using? On what operating system?
  mecab-0.996.tar.gz, ubuntu 13/10


Please provide any additional information below.
  it looks like bug. Perhaps it seems to be fixed like this.

  dictionary.cpp:182, 355  
  [before]
  ------------------------------------------------------------
  CHECK_DIE(cid.left_size()  == matrix.left_size() &&
            cid.right_size() == matrix.right_size())
  ------------------------------------------------------------
  [after]
  ------------------------------------------------------------
  CHECK_DIE(cid->left_size()  == matrix.right_size() &&
            cid->right_size() == matrix.left_size())

Original issue reported on code.google.com by [email protected] on 10 Mar 2014 at 10:39

WPATH_FORCE() not defined on windows when compiling with msvc.

I'm trying to compile on Windows (x64) using the make.bat file found in .\mecab\mecab\src\ and had the following errors occur. I noticed that the WPATH(path) gets preprocessed to WPATH_FORCE(path) which is never defined. Is there a reason for this?
compiler version: MSVC 14.32.31326

feature_index.cpp(532): error C3861: 'WPATH_FORCE': identifier not found
feature_index.cpp(540): error C3861: 'WPATH_FORCE': identifier not found
feature_index.cpp(621): error C3861: 'WPATH_FORCE': identifier not found
feature_index.cpp(672): error C3861: 'WPATH_FORCE': identifier not found

mecab/mecab/src/common.h

Line 84 in 046fa78

#ifdef _WIN32

#ifdef _WIN32
#ifdef __GNUC__
#define WPATH_FORCE(path) (MeCab::Utf8ToWide(path).c_str())
#define WPATH(path) (path)
#else
//Windows Path using msvc (WPATH_FORCE not defined)
//#define WPATH_FORCE(path) (path)
#define WPATH(path) WPATH_FORCE(path)
#endif
#else
#define WPATH_FORCE(path) (path)
#define WPATH(path) (path)
#endif

Ruby-1.8 compatibility fix

What steps will reproduce the problem?
1. including "ruby/version.h"
2. but it's missing in ruby-1.8.x.
3. failed to build with ruby-1.8.x.

What is the expected output? What do you see instead?
succeed to build with ruby-1.8 too.

What version of the product are you using? On what operating system?
mecab-ruby-0.996

Please provide any additional information below.

You should not depend on ruby's version,
use feature macros instead.

--- MeCab_wrap.cpp.orig 2013-02-17 17:24:16.000000000 +0000
+++ MeCab_wrap.cpp
@@ -1856,8 +1856,7 @@ static VALUE mMeCab;

 /* Workaround for ruby1.9.x */
 #if defined SWIGRUBY
-#include "ruby/version.h"
-#if RUBY_API_VERSION_CODE >= 10900
+#if HAVE_RUBY_ENCODING_H
 #include "ruby/encoding.h"
 #define rb_str_new rb_external_str_new
 #endif

Original issue reported on code.google.com by [email protected] on 28 Feb 2013 at 12:01

Patch for /trunk/mecab/doc/dic.html

NAIST-jdic/IPADICは活用型(ctype),活用形(cform)の順だと思います。

ref:
http://hayashibe.jp/tr/mecab/dictionary/

Original issue reported on code.google.com by [email protected] on 5 Jan 2015 at 4:03

Attachments:

dic.html.patch

Support for Ruby2.7?

Hi,

Thanks for your amazing work on mecab.
Ruby2.7 was recently released in December 2019. mecab doesn't yet have that support for 2.7 and it fails to build against Ruby2.7.

Here are the logs: https://bugs.debian.org/cgi-bin/bugreport.cgi?att=1;bug=951349;filename=mecab_0.996-9_amd64-2020-02-14T22%3A49%3A53Z.build;msg=5

And here's the Debian bug report against mecab: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=951349

It'd be great if you could look at it and extend support to Ruby2.7 as well.

Failure initializing Tagger has no error message

If you try to intialize a Tagger object with an output format that doesn't exist (like -Oasdf) then initialization will fail. On the command line this prints a reasonable error message, but if you call MeCab as a library (like in a wrapper from another language), getGlobalError returns an empty string. This can be confusing if the issue is a typo in the name of an output format or other minor issue.

The root cause of this issue is that when an invalid output format is used the global error message is set twice. First it's set here, in ModelImpl::open:

mecab/mecab/src/tagger.cpp

Line 375 in 3a07c4e

setGlobalError(error.c_str());

At this point everything is correct. But this function was called from createTagger, and when that sees that creating the Model failed it sets the error message again, this time to an empty string:

mecab/mecab/src/tagger.cpp

Line 1052 in 3a07c4e

setGlobalError(tagger->what());

I first became aware of this issue due to trouble in mecab-python3.

An easy fix is to not to set the global error in createTagger. I'm not sure that's always correct, so it might be better to not set the global error if the new error string is empty - a check like that is in one version of setGlobalError already, but not the version normally used.

matrix right/left dimension checking is inconsistent (compiling user dictionary/assigning user dict costs)

with unidic as downloaded:

$ mecab/bin/mecab-dict-index.exe -m mecab/dic/unidic-csj/model.bin -d mecab/dic/unidic-csj/ -f utf-8 -c utf-8 -a userdict.csv -u user2.csv
dictionary.cpp(183) [cid.left_size() == matrix.left_size() && cid.right_size() == matrix.right_size()] Context ID files(mecab/dic/unidic-csj/\left-id.def or mecab/dic/unidic-csj/\right-id.def may be broken: 7074 8407 8407 7074

after swapping right-id and left-id files:

$ mecab/bin/mecab-dict-index.exe -m mecab/dic/unidic-csj/model.bin -d mecab/dic/unidic-csj/ -f utf-8 -c utf-8 -a userdict.csv -u user2.csv
reading userdict.csv ... dictionary.cpp(212) [lid >= 0 && rid >= 0 && matrix.is_valid(lid, rid)] invalid ids are found lid=7151 rid=6170

Python wrapper: surface text garbled in first call to parseToNode

What steps will reproduce the problem?

    $ python
    Python 2.7.3 (default, Aug  1 2012, 05:14:39)
    [GCC 4.6.3] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> result = ""
    >>> import MeCab
    >>> t = MeCab.Tagger()
    >>> n = t.parseToNode("結晶系は正方晶系。")
    >>> result = ""
    >>> while n is not None:
    ...     result += n.surface
    ...     n = n.next
    ...
    >>> assert result == "結晶系は正方晶系。", repr(result)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AssertionError: '\x01rf\xff\xff\xff\xff\xff\xff\xff'
    >>>

What is the expected output? What do you see instead?

    The assertion should succeed (no exception thrown).

What version of the product are you using? On what operating system?

    MeCab version 0.996 on Ubuntu Precise.

Please provide any additional information below.

    On my machine the above code always reproduces the problem,
    but other code structures such as assigning the text to a
    variable before parsing or moving the test code into a function
    definition causes the test to run correctly.

    This bug only affects the initial call to a tagger and only if
    the call is parseToNode. The following incantation is a reliable
    workaround:

    >>> t = Tagger()
    >>> t.parse("")

    The tagger can then be used as normal.

Original issue reported on code.google.com by [email protected] on 18 Mar 2013 at 1:03

“'gcc' failed with exit status 1” when trying to install Mecab with PyPy docker image

My Docker image builds just fine when trying to install Mecab with CPython using image tag python:3.8-slim, but it fails with PyPy.

My Dockerfile:

FROM pypy:3-7

RUN pypy -m ensurepip --default-pip

ENV PYTHONDONTWRITEBYTECODE 1
ENV FLASK_APP "main.py"
ENV PYTHONUNBUFFERED 1

RUN mkdir /app
WORKDIR /app

# Install Mecab
RUN apt-get update && apt-get -y install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file sudo

# Set up Mecab
RUN git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
RUN echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -a

COPY Pip* /app/

RUN pip install --upgrade pip && \
    pip install pipenv && \
    pipenv install --dev --system --deploy --ignore-pipfile

ADD . /app

And the full error output:

Installing initially failed dependencies...
[InstallError]:   File "/opt/pypy/site-packages/pipenv/cli/command.py", line 253, in install
[InstallError]:       site_packages=state.site_packages
[InstallError]:   File "/opt/pypy/site-packages/pipenv/core.py", line 2063, in do_install
[InstallError]:       keep_outdated=keep_outdated
[InstallError]:   File "/opt/pypy/site-packages/pipenv/core.py", line 1312, in do_init
[InstallError]:       pypi_mirror=pypi_mirror,
[InstallError]:   File "/opt/pypy/site-packages/pipenv/core.py", line 900, in do_install_dependencies
[InstallError]:       retry_list, procs, failed_deps_queue, requirements_dir, **install_kwargs
[InstallError]:   File "/opt/pypy/site-packages/pipenv/core.py", line 796, in batch_install
[InstallError]:       _cleanup_procs(procs, failed_deps_queue, retry=retry)
[InstallError]:   File "/opt/pypy/site-packages/pipenv/core.py", line 703, in _cleanup_procs
[InstallError]:       raise exceptions.InstallError(c.dep.name, extra=err_lines)
[pipenv.exceptions.InstallError]: Collecting mecab-python3==0.996.5
[pipenv.exceptions.InstallError]:   Using cached mecab-python3-0.996.5.tar.gz (65 kB)
[pipenv.exceptions.InstallError]: Building wheels for collected packages: mecab-python3
[pipenv.exceptions.InstallError]:   Building wheel for mecab-python3 (setup.py): started
[pipenv.exceptions.InstallError]:   Building wheel for mecab-python3 (setup.py): finished with status 'error'
[pipenv.exceptions.InstallError]:   ERROR: Command errored out with exit status 1:
[pipenv.exceptions.InstallError]:    command: /opt/pypy/bin/pypy3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-7gu0in6y/mecab-python3_30c04743845644dab48b7bc8db7a8877/setup.py'"'"'; __file__='"'"'/tmp/pip-install-7gu0in6y/mecab-python3_30c04743845644dab48b7bc8db7a8877/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-9i8vw475
[pipenv.exceptions.InstallError]:        cwd: /tmp/pip-install-7gu0in6y/mecab-python3_30c04743845644dab48b7bc8db7a8877/
[pipenv.exceptions.InstallError]:   Complete output (59 lines):
[pipenv.exceptions.InstallError]:   running bdist_wheel
[pipenv.exceptions.InstallError]:   running build
[pipenv.exceptions.InstallError]:   running build_py
[pipenv.exceptions.InstallError]:   creating build
[pipenv.exceptions.InstallError]:   creating build/lib.linux-x86_64-3.6
[pipenv.exceptions.InstallError]:   creating build/lib.linux-x86_64-3.6/MeCab
[pipenv.exceptions.InstallError]:   copying src/MeCab/__init__.py -> build/lib.linux-x86_64-3.6/MeCab
[pipenv.exceptions.InstallError]:   warning: build_py: byte-compiling is disabled, skipping.
[pipenv.exceptions.InstallError]:
[pipenv.exceptions.InstallError]:   running build_ext
[pipenv.exceptions.InstallError]:   Extension build configuration adjusted:
[pipenv.exceptions.InstallError]:    include_dirs = ['/usr/include']
[pipenv.exceptions.InstallError]:    library_dirs = ['/usr/lib/x86_64-linux-gnu']
[pipenv.exceptions.InstallError]:    libraries    = ['mecab', 'stdc++']
[pipenv.exceptions.InstallError]:    swig_opts    = ['-O', '-builtin', '-c++', '-py3', '-I/usr/include']
[pipenv.exceptions.InstallError]:   building 'MeCab._MeCab' extension
[pipenv.exceptions.InstallError]:   swigging src/MeCab/MeCab.i to src/MeCab/MeCab_wrap.cpp
[pipenv.exceptions.InstallError]:   swig -python -O -builtin -c++ -py3 -I/usr/include -o src/MeCab/MeCab_wrap.cpp src/MeCab/MeCab.i
[pipenv.exceptions.InstallError]:   /usr/include/mecab.h:136: Warning 302: Identifier 'surface' redefined by %extend (ignored),
[pipenv.exceptions.InstallError]:   src/MeCab/MeCab.i:74: Warning 302: %extend definition of 'surface'.
[pipenv.exceptions.InstallError]:   /usr/include/mecab.h:848: Warning 302: Identifier 'set_sentence' redefined by %extend (ignored),
[pipenv.exceptions.InstallError]:   src/MeCab/MeCab.i:105: Warning 302: %extend definition of 'set_sentence'.
[pipenv.exceptions.InstallError]:   creating build/temp.linux-x86_64-3.6
[pipenv.exceptions.InstallError]:   creating build/temp.linux-x86_64-3.6/src
[pipenv.exceptions.InstallError]:   creating build/temp.linux-x86_64-3.6/src/MeCab
[pipenv.exceptions.InstallError]:   gcc -pthread -DNDEBUG -O2 -fPIC -I/usr/include -I/opt/pypy/include -c src/MeCab/MeCab_wrap.cpp -o build/temp.linux-x86_64-3.6/src/MeCab/MeCab_wrap.o -Wno-unused-variable
[pipenv.exceptions.InstallError]:   src/MeCab/MeCab_wrap.cpp: In function ‘void SwigPyBuiltin_SetMetaType(PyTypeObject*, PyTypeObject*)’:
[pipenv.exceptions.InstallError]:   src/MeCab/MeCab_wrap.cpp:3444:11: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘ob_base’; did you mean ‘tp_base’?
[pipenv.exceptions.InstallError]:        type->ob_base.ob_base.ob_type = metatype;
[pipenv.exceptions.InstallError]:              ^~~~~~~
[pipenv.exceptions.InstallError]:              tp_base
[pipenv.exceptions.InstallError]:   src/MeCab/MeCab_wrap.cpp: At global scope:
[pipenv.exceptions.InstallError]:   src/MeCab/MeCab_wrap.cpp:8464:1: error: too many initializers for ‘PyHeapTypeObject’ {aka ‘_heaptypeobject’}
[pipenv.exceptions.InstallError]:    };
[pipenv.exceptions.InstallError]:    ^
[pipenv.exceptions.InstallError]:   src/MeCab/MeCab_wrap.cpp:8697:1: error: too many initializers for ‘PyHeapTypeObject’ {aka ‘_heaptypeobject’}
[pipenv.exceptions.InstallError]:    };
[pipenv.exceptions.InstallError]:    ^
[pipenv.exceptions.InstallError]:   src/MeCab/MeCab_wrap.cpp:8978:1: error: too many initializers for ‘PyHeapTypeObject’ {aka ‘_heaptypeobject’}
[pipenv.exceptions.InstallError]:    };
[pipenv.exceptions.InstallError]:    ^
[pipenv.exceptions.InstallError]:   src/MeCab/MeCab_wrap.cpp:9223:1: error: too many initializers for ‘PyHeapTypeObject’ {aka ‘_heaptypeobject’}
[pipenv.exceptions.InstallError]:    };
[pipenv.exceptions.InstallError]:    ^
[pipenv.exceptions.InstallError]:   src/MeCab/MeCab_wrap.cpp:9445:1: error: too many initializers for ‘PyHeapTypeObject’ {aka ‘_heaptypeobject’}
[pipenv.exceptions.InstallError]:    };
[pipenv.exceptions.InstallError]:    ^
[pipenv.exceptions.InstallError]:   src/MeCab/MeCab_wrap.cpp:9681:1: error: too many initializers for ‘PyHeapTypeObject’ {aka ‘_heaptypeobject’}
[pipenv.exceptions.InstallError]:    };
[pipenv.exceptions.InstallError]:    ^
[pipenv.exceptions.InstallError]:   src/MeCab/MeCab_wrap.cpp: In function ‘PyObject* PyInit__MeCab()’:
[pipenv.exceptions.InstallError]:   src/MeCab/MeCab_wrap.cpp:10401:16: error: ‘PyDescr_NewGetSet’ was not declared in this scope
[pipenv.exceptions.InstallError]:      this_descr = PyDescr_NewGetSet(SwigPyObject_type(), &this_getset_def);
[pipenv.exceptions.InstallError]:                   ^~~~~~~~~~~~~~~~~
[pipenv.exceptions.InstallError]:   src/MeCab/MeCab_wrap.cpp:10401:16: note: suggested alternative: ‘PyDescrObject’
[pipenv.exceptions.InstallError]:      this_descr = PyDescr_NewGetSet(SwigPyObject_type(), &this_getset_def);
[pipenv.exceptions.InstallError]:                   ^~~~~~~~~~~~~~~~~
[pipenv.exceptions.InstallError]:                   PyDescrObject
[pipenv.exceptions.InstallError]:   error: command 'gcc' failed with exit status 1
[pipenv.exceptions.InstallError]:   ----------------------------------------
[pipenv.exceptions.InstallError]:   ERROR: Failed building wheel for mecab-python3
[pipenv.exceptions.InstallError]:   Running setup.py clean for mecab-python3
[pipenv.exceptions.InstallError]: Failed to build mecab-python3
[pipenv.exceptions.InstallError]: Installing collected packages: mecab-python3
[pipenv.exceptions.InstallError]:     Running setup.py install for mecab-python3: started
[pipenv.exceptions.InstallError]:     Running setup.py install for mecab-python3: finished with status 'error'
[pipenv.exceptions.InstallError]:     ERROR: Command errored out with exit status 1:
[pipenv.exceptions.InstallError]:      command: /opt/pypy/bin/pypy3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-7gu0in6y/mecab-python3_30c04743845644dab48b7bc8db7a8877/setup.py'"'"'; __file__='"'"'/tmp/pip-install-7gu0in6y/mecab-python3_30c04743845644dab48b7bc8db7a8877/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-784d25jp/install-record.txt --single-version-externally-managed --compile --install-headers /opt/pypy/include/mecab-python3
[pipenv.exceptions.InstallError]:          cwd: /tmp/pip-install-7gu0in6y/mecab-python3_30c04743845644dab48b7bc8db7a8877/
[pipenv.exceptions.InstallError]:     Complete output (60 lines):
[pipenv.exceptions.InstallError]:     running install
[pipenv.exceptions.InstallError]:     running build
[pipenv.exceptions.InstallError]:     running build_py
[pipenv.exceptions.InstallError]:     creating build
[pipenv.exceptions.InstallError]:     creating build/lib.linux-x86_64-3.6
[pipenv.exceptions.InstallError]:     creating build/lib.linux-x86_64-3.6/MeCab
[pipenv.exceptions.InstallError]:     copying src/MeCab/__init__.py -> build/lib.linux-x86_64-3.6/MeCab
[pipenv.exceptions.InstallError]:     copying src/MeCab/MeCab.py -> build/lib.linux-x86_64-3.6/MeCab
[pipenv.exceptions.InstallError]:     warning: build_py: byte-compiling is disabled, skipping.
[pipenv.exceptions.InstallError]:
[pipenv.exceptions.InstallError]:     running build_ext
[pipenv.exceptions.InstallError]:     Extension build configuration adjusted:
[pipenv.exceptions.InstallError]:      include_dirs = ['/usr/include']
[pipenv.exceptions.InstallError]:      library_dirs = ['/usr/lib/x86_64-linux-gnu']
[pipenv.exceptions.InstallError]:      libraries    = ['mecab', 'stdc++']
[pipenv.exceptions.InstallError]:      swig_opts    = ['-O', '-builtin', '-c++', '-py3', '-I/usr/include']
[pipenv.exceptions.InstallError]:     building 'MeCab._MeCab' extension
[pipenv.exceptions.InstallError]:     swigging src/MeCab/MeCab.i to src/MeCab/MeCab_wrap.cpp
[pipenv.exceptions.InstallError]:     swig -python -O -builtin -c++ -py3 -I/usr/include -o src/MeCab/MeCab_wrap.cpp src/MeCab/MeCab.i
[pipenv.exceptions.InstallError]:     /usr/include/mecab.h:136: Warning 302: Identifier 'surface' redefined by %extend (ignored),
[pipenv.exceptions.InstallError]:     src/MeCab/MeCab.i:74: Warning 302: %extend definition of 'surface'.
[pipenv.exceptions.InstallError]:     /usr/include/mecab.h:848: Warning 302: Identifier 'set_sentence' redefined by %extend (ignored),
[pipenv.exceptions.InstallError]:     src/MeCab/MeCab.i:105: Warning 302: %extend definition of 'set_sentence'.
[pipenv.exceptions.InstallError]:     creating build/temp.linux-x86_64-3.6
[pipenv.exceptions.InstallError]:     creating build/temp.linux-x86_64-3.6/src
[pipenv.exceptions.InstallError]:     creating build/temp.linux-x86_64-3.6/src/MeCab
[pipenv.exceptions.InstallError]:     gcc -pthread -DNDEBUG -O2 -fPIC -I/usr/include -I/opt/pypy/include -c src/MeCab/MeCab_wrap.cpp -o build/temp.linux-x86_64-3.6/src/MeCab/MeCab_wrap.o -Wno-unused-variable
[pipenv.exceptions.InstallError]:     src/MeCab/MeCab_wrap.cpp: In function ‘void SwigPyBuiltin_SetMetaType(PyTypeObject*, PyTypeObject*)’:
[pipenv.exceptions.InstallError]:     src/MeCab/MeCab_wrap.cpp:3444:11: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘ob_base’; did you mean ‘tp_base’?
[pipenv.exceptions.InstallError]:          type->ob_base.ob_base.ob_type = metatype;
[pipenv.exceptions.InstallError]:                ^~~~~~~
[pipenv.exceptions.InstallError]:                tp_base
[pipenv.exceptions.InstallError]:     src/MeCab/MeCab_wrap.cpp: At global scope:
[pipenv.exceptions.InstallError]:     src/MeCab/MeCab_wrap.cpp:8464:1: error: too many initializers for ‘PyHeapTypeObject’ {aka ‘_heaptypeobject’}
[pipenv.exceptions.InstallError]:      };
[pipenv.exceptions.InstallError]:      ^
[pipenv.exceptions.InstallError]:     src/MeCab/MeCab_wrap.cpp:8697:1: error: too many initializers for ‘PyHeapTypeObject’ {aka ‘_heaptypeobject’}
[pipenv.exceptions.InstallError]:      };
[pipenv.exceptions.InstallError]:      ^
[pipenv.exceptions.InstallError]:     src/MeCab/MeCab_wrap.cpp:8978:1: error: too many initializers for ‘PyHeapTypeObject’ {aka ‘_heaptypeobject’}
[pipenv.exceptions.InstallError]:      };
[pipenv.exceptions.InstallError]:      ^
[pipenv.exceptions.InstallError]:     src/MeCab/MeCab_wrap.cpp:9223:1: error: too many initializers for ‘PyHeapTypeObject’ {aka ‘_heaptypeobject’}
[pipenv.exceptions.InstallError]:      };
[pipenv.exceptions.InstallError]:      ^
[pipenv.exceptions.InstallError]:     src/MeCab/MeCab_wrap.cpp:9445:1: error: too many initializers for ‘PyHeapTypeObject’ {aka ‘_heaptypeobject’}
[pipenv.exceptions.InstallError]:      };
[pipenv.exceptions.InstallError]:      ^
[pipenv.exceptions.InstallError]:     src/MeCab/MeCab_wrap.cpp:9681:1: error: too many initializers for ‘PyHeapTypeObject’ {aka ‘_heaptypeobject’}
[pipenv.exceptions.InstallError]:      };
[pipenv.exceptions.InstallError]:      ^
[pipenv.exceptions.InstallError]:     src/MeCab/MeCab_wrap.cpp: In function ‘PyObject* PyInit__MeCab()’:
[pipenv.exceptions.InstallError]:     src/MeCab/MeCab_wrap.cpp:10401:16: error: ‘PyDescr_NewGetSet’ was not declared in this scope
[pipenv.exceptions.InstallError]:        this_descr = PyDescr_NewGetSet(SwigPyObject_type(), &this_getset_def);
[pipenv.exceptions.InstallError]:                     ^~~~~~~~~~~~~~~~~
[pipenv.exceptions.InstallError]:     src/MeCab/MeCab_wrap.cpp:10401:16: note: suggested alternative: ‘PyDescrObject’
[pipenv.exceptions.InstallError]:        this_descr = PyDescr_NewGetSet(SwigPyObject_type(), &this_getset_def);
[pipenv.exceptions.InstallError]:                     ^~~~~~~~~~~~~~~~~
[pipenv.exceptions.InstallError]:                     PyDescrObject
[pipenv.exceptions.InstallError]:     error: command 'gcc' failed with exit status 1
[pipenv.exceptions.InstallError]:     ----------------------------------------
[pipenv.exceptions.InstallError]: ERROR: Command errored out with exit status 1: /opt/pypy/bin/pypy3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-7gu0in6y/mecab-python3_30c04743845644dab48b7bc8db7a8877/setup.py'"'"'; __file__='"'"'/tmp/pip-install-7gu0in6y/mecab-python3_30c04743845644dab48b7bc8db7a8877/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-784d25jp/install-record.txt --single-version-externally-managed --compile --install-headers /opt/pypy/include/mecab-python3 Check the logs for full command output.
ERROR: Couldn't install package: mecab-python3
 Package installation failed...

I tried adding gcc to the installation but it did not solve the problem. Is Mecab not compatible with PyPy?

Installing mecab

I know there must be very simple solution for this but as I don't know it I am asking.

How do I install taku910/mecab

I got the following error while installing external libraries for LASER (https://github.com/facebookresearch/LASER)
automatic installation of the Japanese tokenizer mecab may be tricky
Please install it manually from https://github.com/taku910/mecab

The installation directory should be /home/appuser/sheshank.k/projects/laser/tools-external/mecab

Problems when training

Hello.
now I am trying to use 'mecab-cost-train'. I have a corpus file that contains over 10,000,000 lines.
when run, It uses over 300GB Memory. So, It is impossible to finish 'mecab-cost-train'.
why It is requred to use so many memory?
In my guess, generating too muching 'EncoderLearnerTagger' object is the cause of this problem. (115 line of

mecab/mecab/src/learner.cpp

Line 142 in 32041d9

alpha.resize(psize);

)

There are any solutions to train corpus that contains over 10,000,000 lines ?.
Thank you.

Output Format

I'm trying to parse sentences from a corpus, and in the output there are numbers as shown in the attached image (e.g., '0', '0', '1,2', etc.). A file describing the output format stated:

%f[N1,N2,N3...]` | Display the N1, N2, and N3rd background, "," as the elimita

%FC[N1,N2,N3...] | The N1, N2, and N3rd background, C is displayed as a delimiter.However, if the element is empty, the display will be omitted. (Example) F-[0,1,2]

In this case, are these what the numbers are displaying? If so, what do the numbers mean exactly?

Also, is there any dependency tagging that we can use to extract nominal phrases?

Problems when training.

Hi, first I like to thank taku ku for his awesome mecab.
I'm training MeCab from scratch to make it analyse chinese sentences thanks to 
this website http://www.onaneet.org/blog/archives/4020, but I have some 
troubles while doing it. 

First, I prepared the files 
- dicrc
- char.def
- unk.def
- rewrite.def
- feature.def
as explained on onaneet.
Then I prepared a training corpus for chinese and used mecab-dict-index.
Everithing perfect here.
But, when making mecab-cost-train, if the training corpus has more than around 
700 sentences, the program stops without any error on stderr.

The problem is that 700 sentences for a training is a bit small, isn't it?
And this is an unexpected bug...

I used the Windows version mecab-0.996.exe on a Windows Server 2008 R2 Standard 
for 64x processor.

Original issue reported on code.google.com by [email protected] on 18 Jul 2013 at 7:53

Japanese morphological analyzer

What steps will reproduce the problem?
1.Step 1:Installation (Error: 'mecab-config' is not recognized as an internal 
or external command,
operable program or batch file.)


What is the expected output? What do you see instead?
I expect parts of speech for japanese words 

What version of the product are you using? On what operating system?
Microsoft Windows XP professional Version 2002

Please provide any additional information below.
perl version - perl 5, version 14, subversion 3

Original issue reported on code.google.com by [email protected] on 18 Dec 2013 at 7:46

Attachments:

mecab-perl-0.996.tar.gz

swig version used for mecab-python too old.

What steps will reproduce the problem?
1. The version of swig used for mecab-python 0.996 is 1.3.40 from 2009.
2. This is not compatible with python3 due to MeCab_wrap.cxx:2453:51: error: 
‘PyCObject_Import’.
3. The newer swig 2.0.9 from 2013 solved this problem.

See: http://sourceforge.net/p/swig/bugs/1104/
     https://bugzilla.redhat.com/show_bug.cgi?id=623865

For your reference, updated files MeCab.py and MeCab.wrap.cxx  are attached.

Please rerun the latest swig in mecab package to create files and release 
package using them.

What version of the product are you using? On what operating system?
mecab-python 0.996
Debian GNU/Linux 7.1

Note on minor issues:

The setup.py is outdated.  Please drop deprecated string module and use the 
build-in str module which is compatible with python newer 2.X and python3.  
(Oh, variable name "str" needs to be avoided.) See attached setup.py

Also, it may make your life easy if you put mkdir in swig/Makefile of mecab 
source as attached.

Original issue reported on code.google.com by [email protected] on 12 Aug 2013 at 4:58

Attachments:

mecab-dict-index '-a' option overwrites user-specified costs/ids unexpectedly

mecab-dict-index '-a' option overwrites user-specified costs/ids unexpectedly.

An expected behavior of the '-a' option is that blank fields are filled out automatically, but the user-specified ones are kept as it is.

Below is an example.

Prepare foo.csv file for making user dictionary like as follow.

田町,,,3000,名詞,固有名詞,地域,一般,,,田町,タマチ,タマチ
Execute the following line (before doing this, you need to get ipadic dictionary and its model file)

mecab-dict-index -m mecab-ipadic.model -d ipadic -u foo2.csv -f euc-jp -t euc-jp -a foo.csv
then, you get the following output in foo2.csv

田町,1293,1293,8067,名詞,固有名詞,地域,一般,,,田町,タマチ,タマチ

As you see, the user-specified cost, 3000, is overwritten by 8067.

An expected output, in this case, is;

田町,1293,1293,3000,名詞,固有名詞,地域,一般,,,田町,タマチ,タマチ

Max Grouping Size off-by-one error

Minor error, but the maximum length of unks is one greater than the number reported or specified with the -M flag. So by default it's 25 characters, even though help says 24. Ran into this in the below issues.

polm/cutlet#21

Words do not get divided properly when small letters (捨て仮名) are included in word

Version: 0.996
OS: Ubuntu 18.04 (Windows Subsystem for Linux)

Hello,

I have found some cases where a group of hiragana words are analyzed as one word (probably unk) when small letters (捨て仮名) are included in the word.

Example:

$ mecab
出ておりますでしょうか
出      動詞,自立,*,*,一段,連用形,出る,デ,デ
て      助詞,接続助詞,*,*,*,*,て,テ,テ
おり    動詞,非自立,*,*,五段・ラ行,連用形,おる,オリ,オリ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
でしょ  助動詞,*,*,*,特殊・デス,未然形,です,デショ,デショ
う      助動詞,*,*,*,不変化型,基本形,う,ウ,ウ
か      助詞,副助詞／並立助詞／終助詞,*,*,*,*,か,カ,カ
EOS

出ておりますでしょうかあっ
出      動詞,自立,*,*,一段,連用形,出る,デ,デ
て      助詞,接続助詞,*,*,*,*,て,テ,テ
おり    動詞,非自立,*,*,五段・ラ行,連用形,おる,オリ,オリ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
でしょ  助動詞,*,*,*,特殊・デス,未然形,です,デショ,デショ
う      助動詞,*,*,*,不変化型,基本形,う,ウ,ウ
か      助詞,副助詞／並立助詞／終助詞,*,*,*,*,か,カ,カ
あっ    感動詞,*,*,*,*,*,あっ,アッ,アッ
EOS

出ておりますでしょうかあっはいはいはい
出      動詞,自立,*,*,一段,連用形,出る,デ,デ
て      助詞,接続助詞,*,*,*,*,て,テ,テ
おり    動詞,非自立,*,*,五段・ラ行,連用形,おる,オリ,オリ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
でし    助動詞,*,*,*,特殊・デス,連用形,です,デシ,デシ
ょうかあっはいはいはい  名詞,一般,*,*,*,*,*
EOS

出ておりますでしょうかあっはいはいはいじゃっすいませんちょっとお待ちいただけたら
出      動詞,自立,*,*,一段,連用形,出る,デ,デ
て      助詞,接続助詞,*,*,*,*,て,テ,テ
おり    動詞,非自立,*,*,五段・ラ行,連用形,おる,オリ,オリ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
でし    助動詞,*,*,*,特殊・デス,連用形,です,デシ,デシ
ょうかあっはいはいはいじゃっすいませんちょっとお        名詞,一般,*,*,*,*,*
待ち    名詞,接尾,一般,*,*,*,待ち,マチ,マチ
いただけ        動詞,自立,*,*,一段,連用形,いただける,イタダケ,イタダケ
たら    助動詞,*,*,*,特殊・タ,仮定形,た,タラ,タラ
EOS

With a small number of characters after the character 「ょ」, mecab can still divide the text into 「でしょ」and the rest, which is what we want.

However, if there are too many hiragana characters after 「ょ」, mecab handles all hiragana after 「ょ」until a kanji character appears.

I couldn't find any report on this so far. Apologies if it is a duplicate.
Is there a way to suppress this behavior?

Thanks!

mecab-dict-gen crashes after a long time

after a parameter learning from a corpus and a dictionary, neither of which is particularly big, I try to generate the dictionary from the built model (CRF parameter file) like below

F/seed$ mecab-dict-gen -m csj_f.mdl -o ../
csj_f.mdl is not a binary model. reopen it as text mode...
reading ./unk.def ... 36
reading ./csj_dic.csv ... 35243
emitting ../left-id.def/ ../right-id.def
emitting ../unk.def ... 36
emitting ../csj_dic.csv ... 35243
emitting matrix : 3% |#

but without success, since it crashes with just the error message 'killed'.

The parameter file is 352M with 5 million lines, while the dictionary is 2M with 40 thousand items. Then I do mecab-dict-gen, which takes a long time, about 5 mins every 1% of progress. And frustratingly, around 50% ie after 8 hours, 'gets killed'.

First of all i wonder what makes it take so long and if there is a way to investigate / debug. Perhaps the param file is unusually big? And then, if there's any recipe how to avoid this type of problem, please advise. If you need more info please get back to me.

[mecab-dict-index] error

Hi.
When I run 'mecab-dict-index', error occured.
log information is like this.

==============================================================================

reading ./ETN.csv ... 14
reading ./LISTEN_NER.csv ... 2081
reading ./Preanalysis.csv ... 5
reading ./TV_fullKorean_dict.csv ... 1687814
reading ./NP.csv ... 342
reading ./EF.csv ... 1820
reading ./XSA.csv ... 20
reading ./MM.csv ... 453
reading ./keyword.csv ... 276
reading ./XPN.csv ... 83
reading ./unk_word 1 1 0 (2nd).csv ... 276
reading ./Inflect.csv ... 44850
reading ./VA.csv ... 2360
reading ./XSV.csv ... 24
reading ./keyword_etc.csv ... 222
reading ./Place.csv ... 30300
reading ./LISTEN_unk_word 1 1 9.csv ... 254
reading ./LISTEN_KEYWORD.csv ... 2
reading ./sejong21_word.csv ... 846637
reading ./NNP.csv ... 2371
reading ./Hanja.csv ... 124570
reading ./EP.csv ... 51
reading ./KOR_ENG_csv.csv ... 60365
reading ./sejong21_verbal2.csv ... 15160
reading ./Foreign.csv ... 11599
reading ./NR.csv ... 482
reading ./NNB.csv ... 140
reading ./LISTEN_unk_word.csv ... 254
reading ./Wikipedia.csv ... 36763
reading ./sejong21_fusion.csv ... 1321382
reading ./VCN.csv ... 7
reading ./NNG.csv ... 205269
reading ./MAG.csv ... 14244
reading ./Person-actor.csv ... 99237
reading ./Symbol.csv ... 16
reading ./VCP.csv ... 9
reading ./VX.csv ... 125
reading ./Person.csv ... 196461
reading ./Group.csv ... 3176
reading ./XSN.csv ... 124
reading ./ETM.csv ... 133
reading ./NorthKorea.csv ... 3
dictionary.cpp(472) [da.build(str.size(), const_cast<char **>(&str[0]), &len[0], &val[0], &progress_bar_darts) == 0] unkown error in building double-array

==============================================================================

[dictionary.cpp] line 472~476 is like this

for (size_t i = 0; i < dic.size(); ++i) {
| tbuf.append(reinterpret_cast<const char*>(dic[i].second),
| sizeof(Token));
| delete dic[i].second;
| }

==============================================================================

this error occured when I add 'TV_fullKorean_dict.csv' that contains 1,687,814 entry data.
file size is 165.8MB.
Is there any limit of csv file size ?

Thank you

When training, speed of reading corpus is very slow

Hi.
now I am training the corpus that I have made.
corpus size exceeds 100MB.
Because of multithreading support, Training speed is so fast. but speed of reading corpus is very slow. It seems like not supporting multithread.

For speeding up 'reading corpus', What can I do?
Thank you

mecab-python causes RuntimeError on Windows64bit/Python3 environment

I tried to use mecab-python on Windows, but it causes RuntmeError (Python version is 3.4).

>>> import MeCab
>>> t = MeCab.Tagger()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\Develop\Source\mecab\mecab\python\MeCab.py", line 433, in __init__
    this = _MeCab.new_Tagger(*args)
RuntimeError

I used the code on GitHub master base (mecab/mecab/python).
I attached the fix for windows (refer this site etc).
I tried to regenerate binding code by swig (used swig 3.0.7)

Below is the patched branch that I used.

icoxfog417/mecab#windows

But any fix does not work (in addition, It works fine on 32bit).

Is it possible to have a .Net/C# wrapper for MeCab?

http://creativecommons.org/licenses/by-sa/3.0/

Java/OSX: MeCab::Tagger::parseToNote() as the very first step returns broken pointer of Node::surface()

What steps will reproduce the problem?
- brew install mecab
- brew install mecab-ipadic
- build the java-binding of 0.996
- modify mecab/mecab/java/test.java to skip System.out.println(tagger.parse(str)); such that the next line Node node = tagger.parseToNode(str); will be the very first step right after MeCab::Tagger is initialized.
What is the expected output?

太郎名詞,固有名詞,人名,名,,,太郎,タロウ,タロー
は助詞,係助詞,,,,,は,ハ,ワ
二郎名詞,固有名詞,人名,名,,,二郎,ジロウ,ジロー
に助詞,格助詞,一般,,,,に,ニ,ニ
この連体詞,,,,,,この,コノ,コノ
本名詞,一般,,,,,本,ホン,ホン
を助詞,格助詞,一般,,,,を,ヲ,ヲ
渡し動詞,自立,,,五段・サ行,連用形,渡す,ワタシ,ワタシ
た助動詞,,,,特殊・タ,基本形,た,タ,タ
。記号,句点,,,,,。,。,。
What do you see instead?

名詞,固有名詞,人名,名,,,太郎,タロウ,タロー
]z( 助詞,係助詞,,,,,は,ハ,ワ
)Lorg/ 名詞,固有名詞,人名,名,,,二郎,ジロウ,ジロー
cha 助詞,格助詞,一般,,,,に,ニ,ニ
sen/me 連体詞,,,,,,この,コノ,コノ
cab 名詞,一般,,,,,本,ホン,ホン
/Pa 助詞,格助詞,一般,,,,を,ヲ,ヲ
th;し動詞,自立,,,五段・サ行,連用形,渡す,ワタシ,ワタシ
た助動詞,,,,特殊・タ,基本形,た,タ,タ
。記号,句点,,,,,。,。,。
What version of the product are you using?
- mecab-0.996.tar.gz
- mecab-java-0.996.tar.gz
On what operating system?
Mac OS X El Capitan 10.11.4

Intel C++ compiler does not compile trunk.

What steps will reproduce the problem?
1. Compile trunk with Intel C++ compiler.

What is the expected output? What do you see instead?
Compilation erros which state duplicate explicit instantiations.

What version of the product are you using? On what operating system?
trunk, Scientific Linux 6.4.

Please provide any additional information below.
patch: https://gist.github.com/gwtnb/6546990

Original issue reported on code.google.com by [email protected] on 13 Sep 2013 at 5:20

Mark releases via git tags

It would be nice if you could add git tags for the releases. This would enable making packages directly from github without having to rely on the google drive link provided on your website.

mecab-ipadicでconfigure実行したらmatrix.defが無いというエラーが出る

再現手順は以下のとおりです。

まず下記の手順でmecabをビルドします。configureのオプションに --with-charset="utf8" をつけているのはMac OS X 版バイナリのインストール方法に合わせました。

git clone https://github.com/taku910/mecab
cd mecab
./configure --with-charset="utf8"
make
make check
sudo make install

次にmecab-ipadicをビルドしようとするとエラーになりました。

cd ../mecab-ipadic
./configure --with-charset="utf-8"

エラーメッセージは以下の通りです。

configure: error: cannot find sources (matrix.def) in . or ..

matrix.defがどういうものか私は全く知らないのですが、mecab-jumandicにも同名のファイルがあったので、それを使ってビルドしてみたら通ることは通りました。

ln -s ../mecab-jumandic/matrix.def
./configure --with-charset="utf-8"
make
sudo make install

ただ、「すもももももももものうち」で試してみると「すもも」「も」の後の「もも」が正しく切り出せていません。

$ mecab
すもももももももものうち
すもも  名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も  助詞,係助詞,*,*,*,*,も,モ,モ
も  助詞,係助詞,*,*,*,*,も,モ,モ
も  助詞,係助詞,*,*,*,*,も,モ,モ
も  助詞,係助詞,*,*,*,*,も,モ,モ
も  助詞,係助詞,*,*,*,*,も,モ,モ
もの  名詞,非自立,一般,*,*,*,もの,モノ,モノ
うち  名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

https://github.com/taku910/mecab/blob/6b392e3960a4f5562e18742cb390ae1e22353d2a/mecab-ipadic/INSTALL も見てみたのですが、特に手順は間違っていないようです。

念のためmatrix.defのシンボリックリンクを消した状態でオプション無しでconfigureを実行しても同じエラーでした。

$ rm matrix.def
$ ./configure
configure: error: cannot find sources (matrix.def) in . or ..

ということで、mecab-ipadicの正しいビルド方法を教えてください。

How to set --input-buffer-size when using -p option

First I want to thank you so much for developing this software! I really admire your efforts and shining achievements!!

I would like to ask a question about --input-buffer-size.

When I use -p option to use partial parsing mode, it seems that --input-buffer-size is ignored. Without using -p option, I can set larger --input-buffer-size and parse longer lines. But with -p, MeCab return empty string when I try to parse longer lines regardless of --input-buffer-size value.

Is there anyway to set --input-buffer-size value when I use -p option? Any help will be greatly appreciated.

Thank you.

Don't specify node-format option when using UniDic

If we want to specify node-format option when using UniDic, the content of this option won't be reflected in the analysis results unless we specify -O "" as an argument.

Example:

$ echo "このタスクにキミをアサインしておいたから。" | mecab -d /var/lib/mecab/dic/unidic -F%m\t%t,%f[12]\n # When -O "" isn't specified
この	コノ	コノ	此の	連体詞		
タスク	タスク	タスク	タスク-task	名詞-普通名詞-一般		
に	ニ	ニ	に	助詞-格助詞		
キミ	キミ	キミ	君-代名詞	代名詞		
を	オ	ヲ	を	助詞-格助詞		
アサイン	アサイン	アサイン	アサイン-assign	名詞-普通名詞-サ変可能	
し	シ	スル	為る	動詞-非自立可能	サ行変格	連用形-一般
て	テ	テ	て	助詞-接続助詞		
おい	オイ	オク	置く	動詞-非自立可能	五段-カ行	連用形-イ音便
た	タ	タ	た	助動詞	助動詞-タ	終止形-一般
から	カラ	カラ	から	助詞-接続助詞		
。			。	補助記号-句点		
EOS
$ echo "このタスクにキミをアサインしておいたから。" | mecab -d /var/lib/mecab/dic/unidic -O "" -F%m\t%t,%f[12]\n # When -O "" is specified
この	6,和
タスク	7,外
に	6,和
キミ	7,和
を	6,和
アサイン	7,外
し	6,和
て	6,和
おい	6,和
た	6,和
から	6,和
。	3,記号
EOS

Mecab algorithm (Mecabアルゴリズム)

Mecabアルゴリズムを説明した文書はどこかにありますか？

それとも誰かが簡単な一段落の説明を与えることができますか？

私は自分のウェブサイトや電話のアプリで言語を教えるためにこの機能が必要です (www.jtlanguage.com)。他の言語にも一般化したい。ライセンスの問題なくそれが必要です。したがって、私は自分自身のC＃実装を作成したいと思います。

ありがとうございました。

Is there a document somewhere that describes the Mecab algorithm?

Or could someone give a simple one-paragraph description?

I need this functionality in my website and phone apps for teaching languages (www.jtlanguage.com). I want to generalize it for other languages also. I need it without license problems. Therefore I want to create my own C# implementation.

Thank you.

Meet a undefined reference to 'impZN5MeCab12createTaggerEPKc' when running the example.cpp

I try to run the example.cpp file, but meet these errors:

C:/Program Files (x86)/MeCab/sdk/example.cpp:14: undefined reference to `__imp__ZN5MeCab12createTaggerEPKc'

C:/Program Files (x86)/MeCab/sdk/example.cpp:15: undefined reference to `__imp__ZN5MeCab14getTaggerErrorEv'

C:/Program Files (x86)/MeCab/sdk/example.cpp:18: undefined reference to `__imp__ZN5MeCab14getTaggerErrorEv'

C:/Program Files (x86)/MeCab/sdk/example.cpp:24: undefined reference to `__imp__ZN5MeCab14getTaggerErrorEv'

C:/Program Files (x86)/MeCab/sdk/example.cpp:27: undefined reference to `__imp__ZN5MeCab14getTaggerErrorEv'

C:/Program Files (x86)/MeCab/sdk/example.cpp:33: undefined reference to `__imp__ZN5MeCab14getTaggerErrorEv'

What is the problem?And how can i solve it?Any help will be appreciate!

0.995 cause Mojibake on Mac OS X

What steps will reproduce the problem?
1. I'd install mecab with homebrew on Mac OS X 10.8.2.
2. brew download 0.995 and make a binary, and brew installed 
mecab-ipadic-2.7.0-20070801
3. Then I tried to use mecab on console.

What is the expected output? What do you see instead?

expected

これはテスト
これ  名詞,代名詞,一般,*,*,*,これ,コレ,コレ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
テスト   名詞,サ変接続,*,*,*,*,テスト,テスト,テスト

result

これはテスト
これは   ̾??,????,*,*,*,*,*
テスト   ̾??,????,*,*,*,*,*
EOS


What version of the product are you using? On what operating system?

OS:  Mac OS X 10.8.2.
Mecab: 0.995


Please provide any additional information below.

0.994 works fine.

Original issue reported on code.google.com by [email protected] on 1 Mar 2013 at 5:25

当たらない言葉

どうしてIPAでもUNIDICでも「歓送迎会」が当たらないでしょうか。辞書に入ってるはずですよね。

歓送迎会
歓送名詞,サ変接続,,,,,歓送,カンソウ,カンソー
迎名詞,固有名詞,地域,一般,,,迎,ムカエ,ムカエ
会名詞,接尾,一般,,,*,会,カイ,カイ
EOS

Tag repo please

Looks like mecab is no longer available on http://mecab.googlecode.com/files/
Is it possible to tag repo or make new release out of current content to allow download archive tar ball for make packages with mecab?

The validation of Dictionary::assignUserDictionaryCosts() is inappropriate

Problem

When using the UniDic dictionary and attempting to estimate the cost of user dictionaries, a validation error occurs at the following location.

mecab/mecab/src/dictionary.cpp

Lines 182 to 189 in 05481e7

    
           CHECK_DIE(cid.left_size()  == matrix.left_size() && 
        
                     cid.right_size() == matrix.right_size()) 
        
               << "Context ID files(" 
        
               << left_id_file 
        
               << " or " 
        
               << right_id_file << " may be broken: " 
        
               << cid.left_size() << " " << matrix.left_size() << " " 
        
               << cid.right_size() << " " << matrix.right_size();

dictionary.cpp(184) [cid.left_size() == matrix.left_size() && cid.right_size() ==
matrix.right_size()] Context ID files(C:/Program Files/MeCab/dic/unidic-csj-3.1.1-
full\left-id.def or C:/Program Files/MeCab/dic/unidic-csj-3.1.1-full\right-id.def
may be broken: 18552 15629 20859 15389

Causes and Solutions

This issue is due to the fact that the context_id is not unique for each line in the left_id_file (right_id_file). For instance, the left_id_file of unidic-csj-3.1.1-full is as follows:

7845 名詞,固有名詞,人名,姓,*,*,*,*,固,ツ促,促音形,*,1,*,*
7845 名詞,固有名詞,人名,姓,*,*,*,*,固,ツ促,基本形,*,1,*,*

Therefore, at the above-mentioned location, validation must be performed using the number of unique context_ids, not cid.left_size() (the number of lines in the left_id_file).

And it seems that the left and right are also reversed. Ideally, I believe it should be as follows:

  CHECK_DIE(cid.right_context_id_unique_size() == matrix.left_size() &&
            cid.left_context_id_unique_size()  == matrix.right_size())

A workaround for estimating the cost of user dictionaries involves only rewriting the first line of matrix.def and then rebuilding the user dictionary after cost estimation (pointed out in https://zenn.dev/zagvym/articles/28056236903369).
However, I believe that fixing the aforementioned validation location is the fundamental solution.

mecab-python 0.99 test.py fails

What steps will reproduce the problem?
1. First of all, when I execute test.py (in mecab-python
   0.99 tarball) I see some encoding error like:

+ 
PYTHONPATH=/home/tasaka1/rpmbuild/INSTROOT/python-mecab-0.99-0.990.fc-foo-tasaka
1/usr/lib/python2.7/site-packages
+ /usr/bin/python test.py
  File "test.py", line 7
SyntaxError: Non-ASCII character '\xe5' in file test.py on line 7, but no 
encoding declared; see http://www.python.org/peps/pep-0263.html for details
エラー: /home/tasaka1/rpmbuild/INSTROOT/rpm-tmp.WkohPT 
の不正な終了ステータス (%check)

  It seems that test.py needs some encoding settings.

2. And even if I set encoding on test.py, now with mecab-python
   0.99 test fails like:

+ /bin/sed -i.encoding -e '1s|^\(.*\)$|\1\n# coding=UTF-8|' test.py
+ 
PYTHONPATH=/home/tasaka1/rpmbuild/INSTROOT/python-mecab-0.99-0.990.fc-foo-tasaka
1/usr/lib/python2.7/site-packages
+ /usr/bin/python test.py
0.99
.....
.....
Traceback (most recent call last):
  File "test.py", line 25, in <module>
    len = n.sentence_length;
  File "/home/tasaka1/rpmbuild/INSTROOT/python-mecab-0.99-0.990.fc-foo-tasaka1/usr/lib/python2.7/site-packages/MeCab.py", line 127, in <lambda>
    __getattr__ = lambda self, name: _swig_getattr(self, Node, name)
  File "/home/tasaka1/rpmbuild/INSTROOT/python-mecab-0.99-0.990.fc-foo-tasaka1/usr/lib/python2.7/site-packages/MeCab.py", line 54, in _swig_getattr
    raise AttributeError(name)
AttributeError: sentence_length
エラー: /home/tasaka1/rpmbuild/INSTROOT/rpm-tmp.Capq73 
の不正な終了ステータス (%check)

  It seems that this is because mecab_node_t::sentence_length
  is removed on 0.99

What is the expected output? What do you see instead?
test.py should succeed.


What version of the product are you using? On what operating system?
mecab 0.99 / mecab-python 0.99
Python 2.7.2

Please provide any additional information below.
With encoding issue fixed, mecab 0.98 / mecab-python 0.98 test.py
succeeds.

Original issue reported on code.google.com by [email protected] on 8 Jan 2012 at 5:41

	char *alloc(size_t size) {
	if (!char_freelist_.get()) {
	char_freelist_.reset(new ChunkFreeList<char>(BUF_SIZE));
	}
	return char_freelist_->alloc(size + 1);
	}

	char strdup(const char str, size_t size) {
	char *n = alloc(size + 1);
	std::strncpy(n, str, size + 1);
	return n;
	}

	#define MIN_INPUT_BUFFER_SIZE 8192
	#define MAX_INPUT_BUFFER_SIZE (8192*640)
	#define BUF_SIZE 8192

	CHECK_DIE(cid.left_size() == matrix.left_size() &&
	cid.right_size() == matrix.right_size())
	<< "Context ID files("
	<< left_id_file
	<< " or "
	<< right_id_file << " may be broken: "
	<< cid.left_size() << " " << matrix.left_size() << " "
	<< cid.right_size() << " " << matrix.right_size();

taku910 / mecab Goto Github PK

mecab's People

Contributors

Stargazers

Watchers

Forkers

mecab's Issues

memoly leak When the following conditions are fullfilled

How to reproduce

However, in the case of the following code

Probable Cause

Temporary solution

Proposed solution

Problem

Causes and Solutions

Recommend Projects

Recommend Topics

Recommend Org