odakaui / vocabulist Goto Github PK

A vocabulary database for learning Japanese

License: MIT License

Rust 98.32% Python 1.68%

vocabulist's Introduction

vocabulist

So, you're learning Japanese.

You've mastered Hiragana and Katakana and you've decided to build your vocabulary. You do some research online and everyone says, "Use flashcards!" So, you do a bit of searching and download Anki.

Now, you're all set.

Armed with your knowledge of Hiragana and your trusty Anki install, you set out to create flashcards.

You go online and find yourself a Japanese frequency list. Then, starting from the top, you pick an term, look up the definition on jisho.org, and finally input the data into Anki.

But you're lazy, and after about 100 expressions, you decide that learning Japanese actually isn't for you and decide to pick up Spanish instead.

Or maybe, you're more displined than the rest of us and input 1000s of terms into Anki. You eagerly start memorizing the terms, only to find that you struggle to learn them with no context.

Or even better, you memorize the flashcards 1000s of flashcards. However, when you go to start reading you find that the terms you've memorized don't show up in the text you're reading.

Well, here's one possible solution to your problems in an sea of infinite solutions.

Introducing vocabulist. A Japanese frequency list personalized just for you.

Simply, copy the text you want to read into text file(s).
Import the files into vocabulist.
Turn them into flashcards with a single command.

It's as easy as one two three.

vocabulist automatically creates flashcards based on the frequency of the terms in the imported text. So you don't have to worry about spending valuable time learning a term that you'll never see again.

CHANGELOG

Please see the CHANGELOG for a release history.

Documentation quick links

Installation

Currently the only way to install vocabulist is to clone the repository and build it from scratch. Make sure you have a Rust installation in order to compile it. vocabulist compiles with Rust 1.44.1 (stable) or newer. It tracks the latest stable release of the Rust compiler.

To build vocabulist:

$ git clone https://github.com/odakaui/vocabulist.git
$ cd vocabulist
$ cargo build --release
$ mv jmdict.db "${HOME}/.vocabulist_rs/"

Features

Import terms from a .txt file containing Japanese text or a directory of .txt files.
List [x] terms in the database.
Generate [x] flashcards starting from the most frequent.
Sync the database with Anki to avoid creating flashcards for duplicate terms.
Exclude/Include terms in list and flashcard generation functionality.

vocabulist is a work in progress. Right now it has a few short comings. It only works with Japanese. It only works with the JMdict sqlite3 database provided in the repository. It only works with anki.

I have plans to address some of these issues in the future.

vocabulist's People

Contributors

Watchers

vocabulist's Issues

Remove Windows Build from Travis

Speed up build:

Remove windows target from travis.
Only run cargo build --release on tagged commits.
Run cargo build on all branches.

Show help when run without arguments

Right now, nothing happens when the program is run without arguments.

Instead, show help.

Unit/Integration testing

Add unit/integration tests for all code.

Fix error message when the definition could not be filtered by POS

The error message should read:

"Could not filter definition by POS"

Setup distribution pipeline

Figure out how the release are going to be distributed.

Through cargo?
Through downloads?
Installer script?

Also add github actions to the project to build and test?

List by jumanpp POS

Right now you can only list expressions. It would be nice to be able to list expressions based on POS.

Right now since only one tokenizer is supported it would be jumanpp POS. However if multiple tokenizers are supported in the future this will have to change.

Possibly get the POS values from the POS table in the database?

GUI

Implement a GUI for vocabulist.

Pros

Make it more accessible for non developers.
Make document syntax highlighting and hover over definitions possible.
Make it possible to add and edit definitions for terms.

Cons

Harder to keep cross platform?
Larger package size.
More complexity.
Bloat?

If I do implement a user interface it would probably be something cross platform.
Electron or another cross platform framework.

I also have dreams of an iOS and Android app.

Exclude/Include expressions based on POS

Add the ability to exclude/include expressions based on POS.

Right now you can only exclude/include expressions with a file of expressions separated by a newlines.
Add the ability to exclude/include expressions based on a file of pos separated by newlines.

Fix Flashcard Warnings

Only give "cannot be filtered by kanji" if the term also "cannot be filtered by pos".

Report progress when generating flashcards

Right now there is no feedback when the program is generating flashcards.

Implement a progress bar giving the number of flashcards left to generate.

Create separate file paths for debug and release

Create separate paths for jmdict.db, vocabulist_rs.db and config.toml for debug vs release code.

This way the user can have the release version installed while still being able to test the new version.

Sync database with Anki

Sync database with Anki.

Right now the user has to manually go into the database using sqlite3 and reset the in_anki, is_learned, and is_excluded fields.
The program should do that with the sync command.
Expressions that are in anki will be marked as in_anki = 1.
Expressions of notes that are suspended in anki will be marked as is_excluded = 1.
Expressions of notes that are mastered will be marked as is_learned = 1.

There might be a problem with marking expressions as is_learned. Anki could potentially have multiple cards per note. It appears that you can search for a card and then get the note for that card. However, it is unclear how anki defines mastered. This is potentially an additional feature that should have its own issue.

Parse file and create report

Add the ability to parse a file and spit out a report.

The report should list all of the tokens in the file.
The tokens can be filtered based on in_anki, is_learned, and is_excluded.
The definitions for the token and each sentence for the token will be placed under each token.

The report will be able to be printed to the console, markdown, html, and possibly other mediums such as pdf.

Documentation

Add comments on code. Update README.md. Create a wiki?

Transition Tokenizer to Nagisa

Change dependency from Jumanpp to Nagisa.

Decouple code so that a Chinese parser like Lac, Jieba, or HanLP can be used.

Notes with expressions that do not have kanji in them have blank readings

When using the generate command. Notes with expressions that do not contain kanji end up with blank readings.

The expected behavior is that the notes will have a reading that is the same as the expression.

Add Reset Definition Action

The dictionary need to be accessed if a user wants to "reset" the definition of a term.

Originally posted by @odakaui in #30 (comment)

If the definitions are stored in the database along with the terms as in #48, then the user might need to "reset" the definition for single term or all terms.

They might need to do this for one of the following reasons, or another reason altogether.

They change dictionary. (from a Japanese-English dictionary to a Japanese-Japanese dictionary)
They add an additional dictionary.
They sync a definition(s) from Anki.
They edit or delete a definition.
They add a definition and want to revert it back to the dictionary definition.

Update Installation Section of README to Include Homebrew Instructions

Add Homebrew Instructions to README

Fix Error when Using Old/Misconfigured Config File

When the config file does not contain the correct fields you get this error.

Error: Error { inner: ErrorInner { kind: Custom, line: Some(3), col: 0, at: Some(124), message: "missing field `backend`", key: [] } }

Make the program fail gracefully.

Rewrite Tokenizer from Scratch in Rust

I would like to rewrite the tokenizer in rust so that I do not have to rely on external dependencies.

It will be a separate crate that could be used by itself.
It will be written in rust.
It will use the unidic or another tokenizer dictionary.
It will return the surface string, normalized version and pos of each word.
It will be fast and efficient.
It will be licensed under either the MIT or Apache 2.0 License.

I'll need to do some research.

I already looked at sudachi clone, but it doesn't appear that you can put the dictionary at any path.
I also looked at yoin but I'd like to try my hand at writing it myself for the learning experience.

toml-config-file

Create a toml config file to store the settings.

Clean up and Refactor code

Change Dictionary Path to /usr/local/share/vocabulist/jmdict.db

Mark notes as learned during database sync with Anki

Expressions of notes that are mastered should be marked as is_learned = 1.

Give Multiple Dictionary Options

Give the user the ability to choose a different dictionary.

Either a Japanese to Japanese dictionary.
A Chinese dictionary.
Any other type of dictionary.

Create a trait for the dictionaries to decouple the code and make it easier to write different dictionary backends.

Create temporary anki decks for studying a specific file

Add the ability to parse a text file and then create a deck in anki from it.

The deck name could be the file name or something else.
The expressions in the deck wouldn't be added to the database?
Expressions wouldn't be added to the deck if they already exist in anki?
Expressions wouldn't be added if they are mastered in anki?

Pros:

Allow the user to get the definitions for a term without querying a second database.
Speed up listing terms with their definitions.
Allow the user to sync existing definitions for a given term to Anki.
Allow the user to store definitions from multiple dictionaries for a single term.
Allow the user to use multiple dictionaries. (#30)

Cons:

Require changes to the database, specifically the join table.
Increase the amount of time required during an import operation.
Require changes to the way dictionaries are queried.

Sync Command Printing Duplicate Terms to Console

For some reason the sync command is printing three of each term in Anki.

まさか
まさか
まさか
浴びる
浴びる
浴びる

I guess if the note has multiple cards for each note then this will happen.

Solution: Deduplicate the card list.