Perhaps it's best I write up my bigger planned changes publicly instead of keeping the

Word splitter performance issues about spark-reader HOT 9 CLOSED

laurensweyn commented on June 6, 2024

Word splitter performance issues

from spark-reader.

Comments (9)

wareya commented on June 6, 2024

Changes 2 and 3 will basically not work with kuromoji since the context near the end/beginning of the string affects how kuromoji segments text.

IMO, the "real" solution would be to make the deconjugator run forwards instead of backwards, but it's a bit hard to implement and test that to see if it's a real solution. There's also the problem that the deconjugator passes around so much memory right now.

Kuromoji doesn't seem to have a noticable performance impact, mostly because it makes up for the segmentation pass by increasing the size of the chunks that the word splitter iterates over. But I have an overpowered CPU so I can't really tell that easily unless the performance problem is super bad.

EDIT: Actually for some reason kuromoji makes parsing /faster/.

from spark-reader.

LaurensWeyn commented on June 6, 2024

Yea, I would think that kuromoji might even improve the performance, though I haven't tested it.

An alternative to 2 and 3 would be to keep the words matched last time in a HashSet. For every match, before putting it in the slow deconjugator and dictionary searchers, it could check if we've had that word last time.

I'm not sure how you could make something that deconjugates run forward since it can't really tell where the dictionary word ends and conjugations start until it reverses them, though I could be wrong. It could certainly be less of a brute force. HashSets can't do wildcard searches like our brains can, but perhaps if it became a Binary Search Tree of sorts that could be possible.

The performance right now on my Intel Atom laptop, on the master branch at least, is not too great, especially for larger blocks of text.

from spark-reader.

wareya commented on June 6, 2024

Try out the kuromoji branch, I want to know how it performs on your end. If it works well I'll make a priority of fixing up how I integrated having two build artifacts and make a pull request.

from spark-reader.

LaurensWeyn commented on June 6, 2024

It seems the slowdown I've been experiencing is mainly due to some issue regarding lookup in Epwing dictionaries slowing it down during splitting, as not having Epwing dictionaries loaded has the text appear near instantly as it has before. I've been assuming it's the deconjugator since it's been since that change that it's been slow, but with University work piling up I've never actually gone back and thourougly checked what was going on.

Kuromoji certainly isn't causing any performance issues as far as I can tell though. I may do some measurements comparing System.nanos() before and after splitting text when I'm done looking into the Epwing issue to really check, but I suspect it's faster than without as it should result in less brute force attempts.

from spark-reader.

wareya commented on June 6, 2024

Try with something pathological, too. Like 721 characters of やっぱり覚えてやがったな. I actually get a slowdown like that, even without an epwing dictionary.

from spark-reader.

LaurensWeyn commented on June 6, 2024

It seems that at some point, hasEpwingDef has ended up all over the text splitter. It was originally only supposed to match on the first user segment (so if a word wasn't in EDICT, you'd middle click the start of the word and it would find it in Epwing if available) since having it match on every possible segment would be far too slow. Yet, it seems to be matching on every possible segment now.

I still haven't found the time to go through all the changes in the text splitter you've made with the new deconjugator and now kuromoji. I should really get in there with the debugger and write up Javadocs for all those functions when I figure them out.

Also not sure how you managed to set up this basic/heavy thing. It seems they're projects within the project? And somehow IntelliJ downloaded kuromoji from Maven even though there's no maven pom file. Not sure how that works, but it's pretty cool.

from spark-reader.

wareya commented on June 6, 2024

Ah, so that's how the epwing thing is supposed to work. The master branch still works that way, right? I removed the "start of phrase" check from the kuromoji branch since I just refactored the word splitter (I had to make a pretty major change to work around something), I can make it avoid using epwing more often if necessary.

Yeah, the basic/heavy stuff is a mess. I'm going to make it so that the normal spark reader artifact works basically the same way it used to, and the "heavy" module links it in directly. I'll make sure to comment more of my code while I'm at it.

intellij is actually pretty good at dealing with maven projects, yeah, I'm impressed.

from spark-reader.

LaurensWeyn commented on June 6, 2024

Well, after merging the recursive deconjugator, the master branch gained the bug. Before that merge, it should only attempt epwing lookup on the very first word and then any additional word that starts with a manual split.

That feature was really only there for the sake of completion for those who liked their Epwing dictionaries (I don't see why; they don't even tell you basic things like what words are nouns or verbs, so words can't be deconjugated) and most words in Epwing show up in Edict anyhow. As much as I dislike it, I still feel like some functionality like it needs to be there for it to have true Epwing integration.

from spark-reader.

wareya commented on June 6, 2024

I don't doubt that I might've broken it, I don't have an epwing dictionary set up in spark reader. Sure enough, my "initial segment" code doesn't have the firstSection check.

from spark-reader.

Word splitter performance issues about spark-reader HOT 9 CLOSED

Comments (9)

Related Issues (19)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent