Comments (9)
Changes 2 and 3 will basically not work with kuromoji since the context near the end/beginning of the string affects how kuromoji segments text.
IMO, the "real" solution would be to make the deconjugator run forwards instead of backwards, but it's a bit hard to implement and test that to see if it's a real solution. There's also the problem that the deconjugator passes around so much memory right now.
Kuromoji doesn't seem to have a noticable performance impact, mostly because it makes up for the segmentation pass by increasing the size of the chunks that the word splitter iterates over. But I have an overpowered CPU so I can't really tell that easily unless the performance problem is super bad.
EDIT: Actually for some reason kuromoji makes parsing /faster/.
from spark-reader.
Yea, I would think that kuromoji might even improve the performance, though I haven't tested it.
An alternative to 2 and 3 would be to keep the words matched last time in a HashSet. For every match, before putting it in the slow deconjugator and dictionary searchers, it could check if we've had that word last time.
I'm not sure how you could make something that deconjugates run forward since it can't really tell where the dictionary word ends and conjugations start until it reverses them, though I could be wrong. It could certainly be less of a brute force. HashSets can't do wildcard searches like our brains can, but perhaps if it became a Binary Search Tree of sorts that could be possible.
The performance right now on my Intel Atom laptop, on the master branch at least, is not too great, especially for larger blocks of text.
from spark-reader.
Try out the kuromoji branch, I want to know how it performs on your end. If it works well I'll make a priority of fixing up how I integrated having two build artifacts and make a pull request.
from spark-reader.
It seems the slowdown I've been experiencing is mainly due to some issue regarding lookup in Epwing dictionaries slowing it down during splitting, as not having Epwing dictionaries loaded has the text appear near instantly as it has before. I've been assuming it's the deconjugator since it's been since that change that it's been slow, but with University work piling up I've never actually gone back and thourougly checked what was going on.
Kuromoji certainly isn't causing any performance issues as far as I can tell though. I may do some measurements comparing System.nanos() before and after splitting text when I'm done looking into the Epwing issue to really check, but I suspect it's faster than without as it should result in less brute force attempts.
from spark-reader.
Try with something pathological, too. Like 721 characters of やっぱり覚えてやがったな. I actually get a slowdown like that, even without an epwing dictionary.
from spark-reader.
It seems that at some point, hasEpwingDef has ended up all over the text splitter. It was originally only supposed to match on the first user segment (so if a word wasn't in EDICT, you'd middle click the start of the word and it would find it in Epwing if available) since having it match on every possible segment would be far too slow. Yet, it seems to be matching on every possible segment now.
I still haven't found the time to go through all the changes in the text splitter you've made with the new deconjugator and now kuromoji. I should really get in there with the debugger and write up Javadocs for all those functions when I figure them out.
Also not sure how you managed to set up this basic/heavy thing. It seems they're projects within the project? And somehow IntelliJ downloaded kuromoji from Maven even though there's no maven pom file. Not sure how that works, but it's pretty cool.
from spark-reader.
Ah, so that's how the epwing thing is supposed to work. The master branch still works that way, right? I removed the "start of phrase" check from the kuromoji branch since I just refactored the word splitter (I had to make a pretty major change to work around something), I can make it avoid using epwing more often if necessary.
Yeah, the basic/heavy stuff is a mess. I'm going to make it so that the normal spark reader artifact works basically the same way it used to, and the "heavy" module links it in directly. I'll make sure to comment more of my code while I'm at it.
intellij is actually pretty good at dealing with maven projects, yeah, I'm impressed.
from spark-reader.
Well, after merging the recursive deconjugator, the master branch gained the bug. Before that merge, it should only attempt epwing lookup on the very first word and then any additional word that starts with a manual split.
That feature was really only there for the sake of completion for those who liked their Epwing dictionaries (I don't see why; they don't even tell you basic things like what words are nouns or verbs, so words can't be deconjugated) and most words in Epwing show up in Edict anyhow. As much as I dislike it, I still feel like some functionality like it needs to be there for it to have true Epwing integration.
from spark-reader.
I don't doubt that I might've broken it, I don't have an epwing dictionary set up in spark reader. Sure enough, my "initial segment" code doesn't have the firstSection check.
from spark-reader.
Related Issues (19)
- User Input improvements
- Edict entry ID is parsed incorrectly HOT 1
- Question: preferred definitions, should they be based on dictionary (deconjugated) form or surface (as it is in text) form? HOT 9
- Experimental changes HOT 2
- Build instructions? HOT 3
- Text in brackets being removed from Edict definitions HOT 7
- Idea: cosmetic-only segmentation with "mouseover" mode HOT 1
- JMDict update broke something HOT 4
- Moving over to MVC HOT 1
- Some weird problem with stagr/stagk in jmdict
- Version of Java Required to Run Sparkreader HOT 2
- Automatic Line Breaks HOT 2
- Delete workspace.xml (temporarily) and add it to .gitignore HOT 4
- Manu bugs HOT 3
- Not registering clipboard changes HOT 1
- Question: kuromoji HOT 21
- Question: old deconjugator HOT 2
- UI can become completely inaccessible, leaving java running HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spark-reader.