lindera-morphology / lindera Goto Github PK
View Code? Open in Web Editor NEWA multilingual morphological analysis library.
License: MIT License
A multilingual morphological analysis library.
License: MIT License
It became difficult to manage the related packages in separate repositories, so we merge them into one repository.
The target repositories are:
Separate the functionality of the user dictionary contained in the lindera-ipadic-builder
package into separate packages.
For example: lindera-user-dic-builder
Bump up version to 0.7.0.
Currently, we can’t import lindera in the latest version. It doesn’t build, and since the change has been pushed as a minor version, it probably broke every project relying on lindera.
...
Checking lindera-decompress v0.13.5
Checking bstr v0.2.17
Checking lindera-core v0.13.5
Checking csv v1.1.6
Compiling character_converter v2.1.0
Checking lindera-unidic-builder v0.13.5
Checking lindera-ipadic-builder v0.13.5
Checking lindera-dictionary v0.13.5
Checking lindera-ko-dic-builder v0.13.5
Checking lindera-cc-cedict-builder v0.13.5
Compiling lindera-ipadic v0.13.5
Checking lindera v0.13.5
error[E0599]: no variant or associated item named `DictionaryTypeError` found for enum `lindera_core::error::LinderaErrorKind` in the current scope
--> /Users/irevoire/.cargo/registry/src/github.com-1ecc6299db9ec823/lindera-0.13.5/src/tokenizer.rs:64:40
|
64 | _ => Err(LinderaErrorKind::DictionaryTypeError
| ^^^^^^^^^^^^^^^^^^^
| |
| variant or associated item not found in `lindera_core::error::LinderaErrorKind`
| help: there is a variant with a similar name: `DictionaryLoadError`
error[E0599]: no variant or associated item named `UserDictionaryTypeError` found for enum `lindera_core::error::LinderaErrorKind` in the current scope
--> /Users/irevoire/.cargo/registry/src/github.com-1ecc6299db9ec823/lindera-0.13.5/src/tokenizer.rs:84:40
|
84 | _ => Err(LinderaErrorKind::UserDictionaryTypeError
| ^^^^^^^^^^^^^^^^^^^^^^^ variant or associated item not found in `lindera_core::error::LinderaErrorKind`
For more information about this error, try `rustc --explain E0599`.
error: could not compile `lindera` due to 2 previous errors
You can check this repository to reproduce the issue; https://github.com/meilisearch/charabia on this sha 82c9f3b
Modify the CLI to allow user dictionary to be specified.
Hello,
I'm trying to build tokenizer app which supports korean/japanese with lindera module. Seems japanese is default supported, but korean needs to build dictionary with following https://github.com/lindera-morphology/lindera-ko-dic-builder.
Is there some guide to use this?
Separate the dictionary into another package.
This is a preparation for using multiple dictionaries in the future.
No need to make some functions to public in Formatter.
Add GitHub Actions Integration like mosuka/bayard#94 and some refactoring as follows:
Duplicate functions that are written in each dictionary builder package.
Because of the maintenance issues, will prepare traits and implement a dictionary builder structure for each dictionary builder in its package.
SystemDict
doesn't seem to be used anywhere.
https://github.com/bayard-search/lindera/blob/master/src/core/system_dict.rs
Update workflows.
regression.yml : Run tests on three platforms (Linux/Windows/OSX) for each push/pull request.
periodic.yml: Run tests on stable/beta/nightly version of Rust periodically.
release.yml: When create tag, release it to GitHub and publish to crates.io.
Seems the process to generate documentation for 0.10 failed
Avoid building dictionaries not specified in features.
For example, if --features=ipadic, only lindera-ipadic will be built as a built-in dictionary.
The token contains text and its details, but only reads.
It does not contain the part of speech or other information, need to add them.
Add a dictionary builder for CC-CEDICT to support the Chinese language.
https://note.com/case_k/n/n88b0ffcefd09
When compiling lindera we frequently have a building error:
error: failed to run custom build command for `lindera-ipadic v0.10.0`
Caused by:
process didn't exit successfully: `D:\a\milli\milli\target\release\build\lindera-ipadic-caf28ea0e76b9e29\build-script-build` (exit code: 1)
--- stdout
cargo:rerun-if-changed=build.rs
cargo:rerun-if-changed=Cargo.toml
--- stderr
Error: Custom { kind: UnexpectedEof, error: TarError { desc: "failed to iterate over archive", io: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" } } }
It seems to be related to dictionaries.
Any idea of what could be the reason, the google drive download? 🤔
Yada has extend maximum offset limitation.
Make Lindera available from Python.
Compresses dictionaries for morphological analysis by default.
Restore the missing file.
Lindera offers two modes. Change to be able to specify the mode with Lindera CLI.
Now parse_dictionary_entry expects only that the length is 11.
But it depends on dictionary builder, e.g. unidic is 10, ko-dic is 12.
So, it should specify by arguments.
Migrate module directory tree from the 2015 edition to the 2018 edition.
Each dictionary builders set forward_size
to cost[0]
and backward_size
to costs[1]
.
e.g. ipadic
, neologd
, unidic
, and ko
.
However, the load
method at ConnectionCostMatrix reads the backward_size
from conn_data[0]
.
IPADIC, IPADIC-neologd, Unidic has the same value, forward and backward size.
But, ko has different values. So the cost
method returns the wrong value.
docs.rs build failure.
The name Mokuzu has a similar pronunciation to mozc, so I want to avoid confusion.
Since this project is a fork of kuromoji-rs, change the name to be derived from kuromoji.
Compiling lindera v0.5.1 (/Users/johtani/IdeaProjects/rust-workspace/lindera-workspace/lindera/lindera)
error[E0599]: no function or associated item named `default_normal` found for struct `lindera::tokenizer::Tokenizer` in the current scope
--> lindera/benches/bench.rs:8:40
|
8 | let mut tokenizer = Tokenizer::default_normal();
| ^^^^^^^^^^^^^^ function or associated item not found in `lindera::tokenizer::Tokenizer`
error: aborting due to previous error
For more information about this error, try `rustc --explain E0599`.
error: could not compile `lindera`.
Lindera-ipadic is a requirement of the zola static website generator written in Rust.
During the zola build, it fails while building lindera-ipadic with this error:
memory allocation of 805306368 bytes failed
error: could not compile lindera-ipadic.
Environment: Raspberry Pi 4B, 4GB memory, debian.
I have tried to give it more contiguous memory by rebooting and trying again with a fresh system and no user apps running. Even then, the system apparently can't give it 800MB (!) of presumably contiguous memory. free -mh shows 2.7GB free, but not contiguous, I imagine.
Zola developers have asked me to report this to you. They do not think lindera-ipadic requires 800MB to build.
Thanks.
Move lindera-cil to another repository.
Currently, the lindera-cli package is managed in the lindera repository as a member of the workspace.
Keeping the lindera repository for library crates only and moving the binary crates like lindera-cli to a separate repository.
Publish on crates.io.
But cargo publish
failed due to the following error:
error: api errors (status 200 OK): max upload size is: 10485760
The PR #139 introduced in v0.9.0
make LZMA (rust-lzma
or lzma-rs
) a mandatory dependency.
This forces all users to install the external library liblzma
to be able to compile Lindera.
In comparison, the v0.8.1
needs only to add lindera in the project's cargo.toml
.
In Meilisearch we plan to use Lindera to tokenize Japanese texts, but we don't want to ask our users to install external libraries manually, in order to keep Meilisearch easy to install and easy to use.
Thanks for maintaining Lindera 😊
While keep using non-compressed local dictionary along with user dictionary, the build_user_dict
is failed with error user dictionary path is not set.
. I think the related code is here and want to confirm if it is ok to fallback the user dictionary parsing to use IpadicBuilder
while using local dictionary?
Hey @mosuka,
We were facing compilation slow dows at Meilisearch recently and investigated, we found out that it was lindera-ipadic
that was taking a lot of time to probably download the mecab-ipadic-2.7.0-20070801.tar.gz
tarball from SourceForge.
If you want to look at the time it takes on our side, you can just execute the below command and open the generated HTML report.
rustup update
cargo +nightly build --timings
But as we can see, the CPU is idle for a long time when it builds.
Support output in JSON format #26
error: failed to run custom build command for `lindera-unidic v0.13.5 (/home/minoru/github.com/lindera-morphology/lindera/lindera-unidic)`
Caused by:
process didn't exit successfully: `/home/minoru/github.com/lindera-morphology/lindera/target/debug/build/lindera-unidic-0a9382db4954e5bf/build-script-build` (exit status: 1)
--- stdout
cargo:rerun-if-changed=build.rs
cargo:rerun-if-changed=Cargo.toml
--- stderr
Error: Transport(Transport { kind: ConnectionFailed, message: Some("tls connection init failed"), url: Some(Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("clrd.ninjal.ac.jp")), port: None, path: "/unidic_archive/cwj/2.1.2/unidic-mecab-2.1.2_src.zip", query: None, fragment: None }), source: Some(Custom { kind: InvalidData, error: InvalidCertificateData("invalid peer certificate: UnknownIssuer") }) })
Currently, Lindela does not support user dictionary. Rebuilding the system dictionary to register new term into the morphological dictionary is too much of a burden for light users.
So we're going to support simple user dictionary such as Kuromoji.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.