The lindera from lindera-morphology

It became difficult to manage the related packages in separate repositories, so we merge them into one repository.
The target repositories are:

Create a user dictionary package

Separate the functionality of the user dictionary contained in the lindera-ipadic-builder package into separate packages.

For example: lindera-user-dic-builder

Lindera doesn’t build

Currently, we can’t import lindera in the latest version. It doesn’t build, and since the change has been pushed as a minor version, it probably broke every project relying on lindera.

...
    Checking lindera-decompress v0.13.5
    Checking bstr v0.2.17
    Checking lindera-core v0.13.5
    Checking csv v1.1.6
   Compiling character_converter v2.1.0
    Checking lindera-unidic-builder v0.13.5
    Checking lindera-ipadic-builder v0.13.5
    Checking lindera-dictionary v0.13.5
    Checking lindera-ko-dic-builder v0.13.5
    Checking lindera-cc-cedict-builder v0.13.5
   Compiling lindera-ipadic v0.13.5
    Checking lindera v0.13.5
error[E0599]: no variant or associated item named `DictionaryTypeError` found for enum `lindera_core::error::LinderaErrorKind` in the current scope
  --> /Users/irevoire/.cargo/registry/src/github.com-1ecc6299db9ec823/lindera-0.13.5/src/tokenizer.rs:64:40
   |
64 |             _ => Err(LinderaErrorKind::DictionaryTypeError
   |                                        ^^^^^^^^^^^^^^^^^^^
   |                                        |
   |                                        variant or associated item not found in `lindera_core::error::LinderaErrorKind`
   |                                        help: there is a variant with a similar name: `DictionaryLoadError`

error[E0599]: no variant or associated item named `UserDictionaryTypeError` found for enum `lindera_core::error::LinderaErrorKind` in the current scope
  --> /Users/irevoire/.cargo/registry/src/github.com-1ecc6299db9ec823/lindera-0.13.5/src/tokenizer.rs:84:40
   |
84 |             _ => Err(LinderaErrorKind::UserDictionaryTypeError
   |                                        ^^^^^^^^^^^^^^^^^^^^^^^ variant or associated item not found in `lindera_core::error::LinderaErrorKind`

For more information about this error, try `rustc --explain E0599`.
error: could not compile `lindera` due to 2 previous errors

You can check this repository to reproduce the issue; https://github.com/meilisearch/charabia on this sha 82c9f3b

Support compressed user dictionary

Support for user dictionary in the CLI

Modify the CLI to allow user dictionary to be specified.

Hello,
I'm trying to build tokenizer app which supports korean/japanese with lindera module. Seems japanese is default supported, but korean needs to build dictionary with following https://github.com/lindera-morphology/lindera-ko-dic-builder.

Is there some guide to use this?

Separate the dictionary into another package

Separate the dictionary into another package.
This is a preparation for using multiple dictionaries in the future.

Make some functions to private in Formatter

No need to make some functions to public in Formatter.

Add GitHub Actions Integration

Add GitHub Actions Integration like mosuka/bayard#94 and some refactoring as follows:

Add GitHub Actions Integration
Make output format to enum
Make tokenize mode to enum
Optimize build script
Update Dockerfile

Prepare a trait for implement each dictionary builder

Duplicate functions that are written in each dictionary builder package.
Because of the maintenance issues, will prepare traits and implement a dictionary builder structure for each dictionary builder in its package.

UniDic support

Delete SystemDict

SystemDict doesn't seem to be used anywhere.

https://github.com/bayard-search/lindera/blob/master/src/core/system_dict.rs

Automate release tasks

Update workflows.

regression.yml : Run tests on three platforms (Linux/Windows/OSX) for each push/pull request.
periodic.yml: Run tests on stable/beta/nightly version of Rust periodically.
release.yml: When create tag, release it to GitHub and publish to crates.io.

Docs for 0.10 failed

Seems the process to generate documentation for 0.10 failed

https://docs.rs/crate/lindera/0.10.0/builds/515760

Support CC-CEDICT user dictionary

Avoid building dictionaries not specified in features

Avoid building dictionaries not specified in features.
For example, if --features=ipadic, only lindera-ipadic will be built as a built-in dictionary.

Enrich word details

The token contains text and its details, but only reads.
It does not contain the part of speech or other information, need to add them.

https://github.com/bayard-search/lindera/blob/446d5f9c491a1dd64a990832d16aacf3e700007d/src/core/tokenizer.rs#L125-L128

https://github.com/bayard-search/lindera/blob/446d5f9c491a1dd64a990832d16aacf3e700007d/src/core/word_entry.rs#L27-L29

NEologd support

Add dictionary builder for CC-CEDICT

Add a dictionary builder for CC-CEDICT to support the Chinese language.
https://note.com/case_k/n/n88b0ffcefd09

Lindera-ipadict randomly as issue during build

When compiling lindera we frequently have a building error:

 error: failed to run custom build command for `lindera-ipadic v0.10.0`

Caused by:
  process didn't exit successfully: `D:\a\milli\milli\target\release\build\lindera-ipadic-caf28ea0e76b9e29\build-script-build` (exit code: 1)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:rerun-if-changed=Cargo.toml

  --- stderr
  Error: Custom { kind: UnexpectedEof, error: TarError { desc: "failed to iterate over archive", io: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" } } }

It seems to be related to dictionaries.

Any idea of what could be the reason, the google drive download? 🤔

Upgrade Yada to 0.4.x

Yada has extend maximum offset limitation.

takuyaa/yada#17

Add Python bindings

Make Lindera available from Python.

Add analyzer framework

Compresses dictionaries for morphological analysis by default.

#126

Restore the missing file.

Support tokenization mode ( normal and search ) with Lindera CLI.

Lindera offers two modes. Change to be able to specify the mode with Lindera CLI.

https://github.com/bayard-search/lindera/blob/d6a534be83188236953aa58365ab3d6446601326/src/core/tokenizer.rs#L76-L80

Support UniDic user dictionary

Add API document

Add field_length argment to parse_unk()

Now parse_dictionary_entry expects only that the length is 11.
But it depends on dictionary builder, e.g. unidic is 10, ko-dic is 12.

So, it should specify by arguments.

Migrate module directory tree from the 2015 edition to the 2018 edition

Migrate module directory tree from the 2015 edition to the 2018 edition.

dictionary builders expects costs[1] as a backward_size

Each dictionary builders set forward_size to cost[0] and backward_size to costs[1].
e.g. ipadic, neologd, unidic, and ko.

However, the load method at ConnectionCostMatrix reads the backward_size from conn_data[0].

IPADIC, IPADIC-neologd, Unidic has the same value, forward and backward size.
But, ko has different values. So the cost method returns the wrong value.

docs.rs build failure

docs.rs build failure.

#62 (comment)

Change the project name again

The name Mokuzu has a similar pronunciation to mozc, so I want to avoid confusion.
Since this project is a fork of kuromoji-rs, change the name to be derived from kuromoji.

Enrich word detail

#28

Support ko-dic user dictionary

Build error in benches

   Compiling lindera v0.5.1 (/Users/johtani/IdeaProjects/rust-workspace/lindera-workspace/lindera/lindera)
error[E0599]: no function or associated item named `default_normal` found for struct `lindera::tokenizer::Tokenizer` in the current scope
 --> lindera/benches/bench.rs:8:40
  |
8 |         let mut tokenizer = Tokenizer::default_normal();
  |                                        ^^^^^^^^^^^^^^ function or associated item not found in `lindera::tokenizer::Tokenizer`

error: aborting due to previous error

For more information about this error, try `rustc --explain E0599`.
error: could not compile `lindera`.

Can't build lindera-ipadic on Raspberry Pi 4B

Lindera-ipadic is a requirement of the zola static website generator written in Rust.

During the zola build, it fails while building lindera-ipadic with this error:
memory allocation of 805306368 bytes failed
error: could not compile lindera-ipadic.

Environment: Raspberry Pi 4B, 4GB memory, debian.

I have tried to give it more contiguous memory by rebooting and trying again with a fresh system and no user apps running. Even then, the system apparently can't give it 800MB (!) of presumably contiguous memory. free -mh shows 2.7GB free, but not contiguous, I imagine.

Zola developers have asked me to report this to you. They do not think lindera-ipadic requires 800MB to build.

Thanks.

Move lindera-cil to another repository

Move lindera-cil to another repository.
Currently, the lindera-cli package is managed in the lindera repository as a member of the workspace.
Keeping the lindera repository for library crates only and moving the binary crates like lindera-cli to a separate repository.

Publishing on crates.io

Publish on crates.io.
But cargo publish failed due to the following error:

error: api errors (status 200 OK): max upload size is: 10485760

Reconsider default LZMA dependency without any option to avoid it

Issue

The PR #139 introduced in v0.9.0 make LZMA (rust-lzma or lzma-rs) a mandatory dependency.
This forces all users to install the external library liblzma to be able to compile Lindera.

In comparison, the v0.8.1 needs only to add lindera in the project's cargo.toml.

Context

In Meilisearch we plan to use Lindera to tokenize Japanese texts, but we don't want to ask our users to install external libraries manually, in order to keep Meilisearch easy to install and easy to use.

Potential solutions

reconsider #139
choose a compression library that doesn't need a manually installed library (vendoring or rust library)
- flate2
- snap
- brotli ? (🤷)
provide a feature flag to choose the compression method

Thanks for maintaining Lindera 😊

Question for user dictionary parsing when using non-compressed local dictionary

While keep using non-compressed local dictionary along with user dictionary, the build_user_dict is failed with error user dictionary path is not set.. I think the related code is here and want to confirm if it is ok to fallback the user dictionary parsing to use IpadicBuilder while using local dictionary?

Downloading and decompressing dictionaries takes a lot of time

Hey @mosuka,

We were facing compilation slow dows at Meilisearch recently and investigated, we found out that it was lindera-ipadic that was taking a lot of time to probably download the mecab-ipadic-2.7.0-20070801.tar.gz tarball from SourceForge.

If you want to look at the time it takes on our side, you can just execute the below command and open the generated HTML report.

rustup update
cargo +nightly build --timings

But as we can see, the CPU is idle for a long time when it builds.

Rename project

Support output in JSON format

Support output in JSON format #26

Unable download UniDic form clrd.ninjal.ac.jp

error: failed to run custom build command for `lindera-unidic v0.13.5 (/home/minoru/github.com/lindera-morphology/lindera/lindera-unidic)`

Caused by:
  process didn't exit successfully: `/home/minoru/github.com/lindera-morphology/lindera/target/debug/build/lindera-unidic-0a9382db4954e5bf/build-script-build` (exit status: 1)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:rerun-if-changed=Cargo.toml

  --- stderr
  Error: Transport(Transport { kind: ConnectionFailed, message: Some("tls connection init failed"), url: Some(Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("clrd.ninjal.ac.jp")), port: None, path: "/unidic_archive/cwj/2.1.2/unidic-mecab-2.1.2_src.zip", query: None, fragment: None }), source: Some(Custom { kind: InvalidData, error: InvalidCertificateData("invalid peer certificate: UnknownIssuer") }) })

Support user dictionary

Currently, Lindela does not support user dictionary. Rebuilding the system dictionary to register new term into the morphological dictionary is too much of a burden for light users.
So we're going to support simple user dictionary such as Kuromoji.

lindera-morphology / lindera Goto Github PK

lindera's People

Contributors

Stargazers

Watchers

Forkers

lindera's Issues

Issue

Context

Potential solutions

Recommend Projects

Recommend Topics

Recommend Org