Code Monkey home page Code Monkey logo

spark-reader's Introduction

Spark-Reader

A tool to assist non-native speakers in reading Japanese. Currently still in beta.

Looking to download the latest version? Click here to go to the release page! Feedback is greatly appreciated.

EDICT2 file included for convenience, see this page for more information and its licence. All other code is released under the included licence.

What this is

Spark Reader is intended as an alternative to things like Rikaisama and Chiitrans.

Instead of normal translation software that will attempt to give you an English equivalent of Japanese text, Spark reader and similar programs will give you the tools needed to actually understand the original Japanese sentence by letting you see the readings of Kanji, lookup words and get help with things like grammar.

It's mainly designed for use with Visual Novels, but anything that will let you copy text from will work with Spark Reader (Websites, books etc.)

Features

  • Edict, Epwing and custom dictionaries supported.
  • Customisable: Many options aesthetic available regarding things like colours, fonts, sizes and learning options like Furigana display or showing your Heisig keywords for Kanji.
  • Known word tracking: Words you know can be shown in blue and have furigana disabled.
  • Heavy Anki integration: Export words, lines and their context for making your own Anki flashcards, and import the words you already know from your existing Anki decks.
  • Multiplayer: Read along with your friends over voice chat and Spark Reader will tell you if you're ahead or behind the others.

Features still in development

  • Powerful and customisable word splitter: Choose between Kuromoji, 'assisted rikaikun' mode or disable word splitting entirely.
  • Import character names and their readings from VNDB.
  • A built-in, memory based texthooker for programs and games that don't work with tools like ITH.

Build requirements

  • Compiled using IntelliJ and Java 8.
  • Uses eb4j (original, not the one on Github) for Epwing support.
  • Uses JNA for the memory based text hook and other native features.
  • Uses JUnit and Hamcrest for tests.

Building

The contents of src must be built against eb4j (from osdn) and jna (from maven). The correct versions of the libraries are indicated in the IntelliJ project files.

IntelliJ has to be configured correctly. This includes the location of the JDK. The official JDK 8 download from oracle is known to work.

The IntelliJ project will compile, but it assumes that the libraries are located in a particular place on your filesystem already. You can get the libraries from osdn and maven with help from google. Make sure to get the correct versions.

The compiled output must be linked into a .jar file. The IntelliJ project has this set up already, but again, it assumes the libraries are present in the right place on your filesystem already. The project will extract the contents of the libraries into the .jar file whole.

You don't have to follow the instructions below if you know what you're doing, but they'll work:

  • Download the correct versions of eb4j and jna and place them in the location that the IntelliJ project wants them
  • In IntelliJ, ensure that the Spark Reader.jar artifact is set to be included in the project build
  • Press the project build button
  • If you encounter any build errors, scrap and re-clone the repository (but not the libraries you downloaded) and try again. If you have any source changes, back them up first, but not any project file changes.

spark-reader's People

Contributors

laurensweyn avatar wareya avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

spark-reader's Issues

Idea: cosmetic-only segmentation with "mouseover" mode

One of the main drawbacks of Spark Reader is the fact that you need to click the interface to interact with the dictionary or word splitter. * A second drawback is that the word splitter acts as a default interpretation rather than as a guide, which is more notable with the Kuromoji branch.

*This is actually really annoying with certain games that use focus logic and raw input to determine whether they received an input event rather than the WM's input bubbling system, so clicking spark reader's overlay might bubble through to the VN and make it advance to the next text box.

The way to handle the first is relatively simple, just add a way for mousing over words to bring up the dropdown menu, and make it work on a character level instead of a word level if word splitting is disabled, basically rikai emulation. I know this is against SR's curren't UX philosophy, but it's something to consider.

A way to handle the second is a new idea. Instead of the word splits being actual word splits, they would just be a visual indicator of where the word splitter (or kuromoji) thinks the word boundaries are. This has three advantages:

One, it would make it easier for people inexperienced with Japanese to realize that there are some mistakes the parser can't not make.

Two, you would no longer have to click to interact with the word splitter, but people new to japanese that are just outright confused by whatever they're reading would still be able to rely on the word splitter or segmenter for help in splitting up long strings of hiragana.

Three, the kuromoji branch could run without any of the word recognition code needed by the "guided rikai" parsing model. Segments would just be wherever kuromoji puts segmentations between words, and things like 彼がいない would appear like 彼|が|い|ない instead of 彼|がい|ない, without having to rely on any weird heuristics or blacklists at all. The parser would make fewer mistakes, because it's doing less work, or none of its own work at all, just using a segmenter.

I think that a real rikai emulation mode is a necessary feature for Spark Reader to be adopted by certain communities, because rikai's UX has a very optimal feel for people who know a lot of japanese. Combining that with segmentation that's only visual would also make it easier for other people to recommend SR, because they wouldn't have to worry about people getting confused by the parser at all as long as they use certain settings.

Finally, it might be a good idea for SR to prompt for the UX style that the user wants the first time it boots (then tell them that they can change it through the options, of course).

Some weird problem with stagr/stagk in jmdict

In JMDict's definition for 昨夜, the two senses are respectively restricted to kanji-only and readings-only, so if you run into the kanji spelling 昨夜, Spark Reader won't pick up on any definitions.

<keb>昨夜</keb>
<ke_pri>ichi1</ke_pri>
<ke_pri>news1</ke_pri>
<ke_pri>nf21</ke_pri>
</k_ele>
<r_ele>
<reb>ゆうべ</reb>
<re_pri>news1</re_pri>
<re_pri>nf18</re_pri>
</r_ele>
<r_ele>
<reb>ゆう</reb>
<re_restr>夕</re_restr>
<re_pri>news1</re_pri>
<re_pri>nf02</re_pri>
</r_ele>
<r_ele>
<reb>さくや</reb>
<re_restr>昨夜</re_restr>
<re_pri>ichi1</re_pri>
<re_pri>news1</re_pri>
<re_pri>nf21</re_pri>
</r_ele>
<sense>
<stagk>夕べ</stagk>
<stagk>夕</stagk>
<pos>&n-adv;</pos>
<pos>&n-t;</pos>
<gloss>evening</gloss>
</sense>
<sense>
<stagr>ゆうべ</stagr>
<stagr>さくや</stagr>
<s_inf>usu. 昨夜</s_inf>
<gloss>last night</gloss>
</sense>

Possible fixes: ignore reading restrictions when looking up words in pure kanji? Only ignore absolute mismatches like "wrong kanji entirely"? Maybe only if there's no other way to get a sense?

Version of Java Required to Run Sparkreader

I'm sure I'm just being an idiot, but I've installed five different Java runtimes, and I can't get the 0.8 release to run. Which version do I need for 64-bit Windows 10?

Again, really sorry to be asking such a dumb question, but I'm stuck.

Edict entry ID is parsed incorrectly

This is a mild issue since it basically breaks compatibility with any files that store edict IDs directly. If this is going to get fixed it should be fixed ASAP before spark reader becomes too popular.

According to the format page: http://www.edrdg.org/jmdict/edict_doc.html#IREF02

The field has the format: EntLnnnnnnnnX.
The EntL is a unique string to help identify the field.
The "X", if present, indicates that an audio clip of the entry reading is available from the JapanesePod101.com site.

The current code is something like:

String IDCode = bits[bits.length - 1].replaceFirst("Ent", "")

It should be something like:

String IDCode = bits[bits.length - 1].replaceFirst("EntL", "").replaceFirst("X", "");

Ran into this because I decided to print the ID in the definition popdown and noticed massive negative values, from hashing failed conversions. I did that because I want to make a blacklist for certain surface forms (spellings) to not be allowed to be interpreted as certain Edict entries (e.g. はそう as 半挿).

An alternative is to store the existing interpretation of the ID (where it falls back to a hash) in existing files so that compatibility isn't broken, but to use the real edict id for new stuff like the blacklist I want to make. I wouldn't really like this, but it's a reasonable compromise. You could also detect if preferredDefs has bogus IDs in it and automatically change them to the right version by finding which definition for that word has the same id-string-sans-Ent hash, but that's complicated.

Question: preferred definitions, should they be based on dictionary (deconjugated) form or surface (as it is in text) form?

Right now they're based on the dictionary form. This makes it a lot faster to set up preferred definitions since you don't have to set it on every conjugation of a given verb you run into, but there are situations where multiple preferred words can end up showing up for the same surface form because of deconjugation. Deconjugation is the only place that the surface form and dictionary form differ, so it seems like an inherent problem to me.

Maybe spark reader could get away with marking definitions as "good" instead of "preferred", so you would have multiple preferred definitions in cases like this, and it wouldn't care about the specific word the definition pops up on. This would also let users basically pull all the definitions they like for a given word to the top of the list, and if you added a "not good" thing too, they could also push definitions they don't like to the bottom. And maybe, just maybe, you could also have actual preferences for specific surface forms, on top of the "good definition / bad definition" thing, which would care about the specific word again.

The issue here is that doing what I described in the above paragraph would make it very complicated to use spark reader effectively, which is why I'm asking about this issue as a question instead of a request.

This is only tangential/unrelated to the blacklist, which is to prevent parsing mistakes like はそう being a single segment, and seems to work well. But the blacklist basically has to be based on the surface form only because that's what it's for, and it feels gross for the blacklist and preferred definitions to use a different version of the same word to determine whether it counts.

For what it's worth, I'm already using my branch with preferred definitions based on the surface form instead of the dictionary form, and it seems to work well, it's just tedious. Being able to mark specific definitions as "good" without caring about the word itself would make it less tedious, but more complicated.

One interesting note is that basing preferred definitions on surface form instead of dictionary form makes preferred definitions work better with changes to the deconjugator since you wouldn't run into a specific possible rare issue: If you mark two definitions as preferred for two different given dictionary forms, then the deconjugator changes and puts both definitions on the same surface form, and that surface form deconjugates to both dictionary forms, it's basically arbitrary which definition shows up first in the list.

This is basically a special case of the "multiple preferred definitions for the same conjugated word" issue, except that it means the behavior of the program changes under the user's feet instead of behaving differently for a word that they didn't set a preferred definition for yet. And if you set a specific one of those definitions as preferred, what happens to the other one? It was valid for its dictionary form, just for a different word. Is it no longer preferred?

Build instructions?

Trying to build it (using intellij idea and jdk 8) to see what I can do with it, and it looks as though there isn't a project level build configuration. Alright, what if I try to build the .jar artifact manually?

Error:Failed to load project configuration: cannot read file C:\Users\wareya\dev\spark\.idea\modules.xml: C:\Users\wareya\dev\spark\.idea\modules.xml (The system cannot find the file specified)

Said file is listed in the .gitignore

Generating an ant build causes it to not find jna when I go to compile, even though it's in the right place for the project settings screen to find it

I'm not really a java dev, so I don't know what's missing or what I'm doing wrong; my impression was that things like maven are normally used to handle dependencies, but I don't know anything about that. FWIW, I come from c/++ where the normal way of doing things is so awful that I learned to put my dependencies right in my repository and compile them on the fly as needed.

Not registering clipboard changes

I realize support for this has probably dropped years ago, but I figured I'd try anyways in the off chance you might know what's up.

I've had this problem pretty much ever, I think. It happens on every version, as far as I can tell. Spark reader will be reading new text from the clipboard and parsing it just fine, until a point where it seems to stop registering the changes in the clipboard, seemingly with no rhyme or reason. This might happen 1 minute in, or 5 minutes in, but it pretty much always happens, and will not register new text copied for quite a few lines. Sometimes I can kind of force it to get back to working by copying random bits of text, but not always. This bug makes the program very unreliable to use, but I don't know of any tool that does exactly what Spark Reader does, so I've just been putting up with it.

Do you have an inkling what this could be about? For reference, I use plain old Textractor to get my text hooked and copied to clipboard.

Question: old deconjugator

I re-added the old deconjugator, which was straightforward enough.

Then I went to put back the same validity tests the old deconjugator had. Because they're done outside the old deconjugator and reach into StdRule and ValidWord (I purged impliedTag in the new deconjugator, etc), I had to make a lot of changes, and it's honestly kind of nasty; the amount of code is just ridiculous and I had to add unsafe casts.

wareya@4db4b9c

It's up to you if you want me to make the new deconjugator PR with or without the original validity tests. It seems to work fine with the new one, so my vote is for without the original validity tests since they'd just make it harder to maintain.

Here's the commit that re-adds the old deconjugator (without the old validity tests): wareya@ff106ec

Text in brackets being removed from Edict definitions

I imagine this has to do with the new tag parsing system.
I noticed this mainly because all my custom definitions for names end in (first name) or (surname), which no longer displays.
Even then, EDICT uses brackets for a lot of non-tag things too. Entries are numbered if more than one exists, and extra comments are often included as well, or info on special cases etc.

This is sort of the result of the annoying way that EDICT2 is formatted... One of the reasons I plan to switch this over to the more detailed XML version at some point. Not a big issue since it will be replaced anyway, but just pointing it out.

Also, the user dictionary saving works, but it's not adding it to the data structure. Could probably easily be fixed by going to the dictionary, removing all entries from the 'custom' source, and then adding them all back on clicking save changes. And since Swing isn't thread safe, it doesn't like the definition table being updated by another window, though again this UI may be seperate from the Settings screen and perhaps redone for things like VNDB name import later.

Experimental changes

My master branch has these changes right now:

  • A blacklist (spelling-definition pair)
  • More text rendering options (complicates things)
  • Variable width support (works, but very hacky)
  • Sticking to the inside of a window
  • Frequency info (based on spelling+reading unlike rikai's)
  • To get frequency info, changed how edict definitions are loaded, readings are associated with spellings instead of being piled together, which should be done anyway (the way edict definitions are loaded right now is basically incorrect)
  • Minor deconjugator fixes

What should be merged? I should clean things up first, but the more I clean up what I add. the more I end up refactoring other parts of speak reader.

User Input improvements

This is a general list of improvements I feel like Spark Reader could use regarding interaction with the UI, to be implemented one day.

  • Keyboard shortcuts. Most games don't seem to respond to unmapped keys, perhaps these could be used to work with Spark Reader to scroll through words, definitions and so on. Being fully controllable by keyboard shortcuts could be great, especially with games that don't like how Spark Reader handles focus, but even some actions to aid users without a scroll wheel (like me when I'm stuck with my laptop's terrible trackpad) can be useful.
  • Click-drag word select support. For manual splitting, perhaps dragging from the start of the word to the end could put split points on both ends and automatically select the word. Right click and drag would do the same but open the right click menu on that word immediately, for quick export for example. I've come close to deciding to implement this one, but I still haven't gotten around to it.
  • A menu bar (like File, Edit, Help etc.) instead of the current '三' button on the top right. This would only appear when the mouse is over the furigana bar and have quick access to commonly changed settings, allow you to hook to VN windows and so on when these become features. With some of the plans I have, the current 三 menu is going to be pretty cluttered, and I haven't found myself using the button anyway; I've just been using the furigana right click shortcut.

Moving over to MVC

So the code is becoming a bit of a mess - the mix of graphics and UI code inside classes that the deconjugator deals with is making things difficult to deal with. Some features, like a headless text analyser, or scrolling through all definitions continuously, can't really work well with how things are now.

So I've been thinking to convert things over to something like MVC. This is a non-trivial rewrite, which is why I'm putting up this issue as a warning of sorts.

Since exams are in full swing where I live, I won't be able to start on this until early December at the soonest.

It would take some work, but I think it would be worth it in the long run. Many of the features often suggested to me would suddenly become a lot easier, code maintenance would be simpler, providing more UI options becomes far simpler and code could be reused more easily for other projects (e.g. a general purpose offline dictionary tool).

Manu bugs

I'm currently going over and adapting the JMdict parser for my own purposes, and in doing so I keep finding bugs, many small ones but occasionally a bigger one. The overall code quality seems low (about how I used to program 5 years ago so don't feel too offended).

So now I found a bug where re_restr is interpreted wrong. The DTD states "In its absence, all readings apply to all kanji elements." but your code doesn't link any readings to kanji in its absence (e.g. entry 々). I also found places where you throw hard exceptions when just 'yet unencountered items' appear in the XML data, which is always a bad idea for production code (for example JMParser#readCDATA).

Do you want to keep updated about every little thing I find?

Word splitter performance issues

Perhaps it's best I write up my bigger planned changes publicly instead of keeping them hidden away inside a text file.

Anyhow, with the new deconjugator, things have gotten quite slow. Quite a lot of that has to do with the program assuming splitting is a fairly fast operation. As this is no longer the case, some work may be needed to prevent splitting things from scratch every time.

The main changes I'm planning to make are these:

  1. Store how lines are split in the backlog, instead of just the original text.
  2. When the user places a manual split, all unchanged text before it until the split (or until the word intersected by the split) is should be copied over from the last split.
  3. After reaching the split performed above, when the auto splitter reaches a split that again matches what was there before, the rest can again be copied from the last split operation.

This has the advantage of keeping furigana and selected definitions when adding splits, and keeping splits correct in the log, along with the performance changes.

In addition, not having the UI become unresponsive during splitting could be a good idea, even if it's just a loading icon. In the most perfect scenario, splitting happens on a seperate thread and as words are split they appear on the UI, and the user can interact with the words while it's still going, as if the VN is scrolling the text. That would be pretty hard to pull off though, and I have nowhere near enough free time to get that all working.

I'm holding back on these changes due to the work on Kuromoji currently being done, so I'm just sharing future plans for now.

Automatic Line Breaks

This is just a shameless feature request, but it would be really great if longer lines could be broken up to fit within the set max width of the text box. I've encountered lines already that are wider than my 1920px monitor can display, for example.

Question: kuromoji

I recently figured out how to use kuromoji, and theoretically speaking, I could see what happens if I force the segmenter to only create a segmentation if it's in a place where kuromoji put one. This would still allow the deconjugator to work, and it would fix cases like Nはこいつ, which my fork (and presumably the original) of spark reader currently splits wrong. Basically it would prevent the parser from making segments that are inconsistent with where kuromoji draws the boundaries between lexemes.

If I wanted to just fix this one example, I would probably add a way to say "if はこ is followed by いつ, rewrite them into は followed by こいつ", which is really awful. Either that or making the segmenter much smarter than it is right now, which would basically reimplement half of mecab or kuromoji.

The problem is, kuromoji is like, really big. The smallest version of it, kuromoji-ipadic, comes in at 13MB. If kuromoji were added in the way described above, it would probably be easy to make two builds of spark reader: one with kuromoji compiled in, one without it. If it were added in an invasive way that lets the segmenter rely on kuromoji more, you couldn't do that unless you were to encapsulate and abstract kuromoji's segmenting behavior and provide a "dummy" version of it if kuromoji is compiled out or not available.

Spark reader's "dumb" segmenter is very good at separating terms that don't inflect and don't have confusing overlap, but kuromoji is good at syntactically complex stuff like strings of hiragana and conjugations. In the few cases kuromoji might do something wrong with separating terms, like complicated katakana names, I don't know the ways it might go wrong. Presumably, the fact that spark reader lets the user "fix" parsing mistakes would gloss over the problem.

There's also the fact that kuromoji is, itself, more code, so if it were added in a simple way, it would probably just slow down the segmenting process down. If kuromoji were added in an invasive way, it might actually make the segmenting process faster, since the only time you would have to look forward in kuromoji's segment list is if you're looking at a segment that can inflect (<-- big deal!)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.