Comments (3)
I'd be in favour of providing both index bases, if they are already available internally. This avoids re-parsing bytes to characters on the user side.
from nlprule.
Do you actually use character indices or byte indices in cargo-spellcheck
? Would you have to convert from byte indices to char indices if the Suggestion
.start
and .end
indices were byte indices?
if they are already available internally.
For internal nlprule computation char indices are never needed (I think) so converting from char to byte as early as possible (i.e. when building the binaries) is possible. Using byte indices everywhere in Rust and only converting from byte to char at the boundary to Python made sense to me.
But you're right, it's worth thinking about providing both in the public API. I agree that that would make it nicer for a user. But for computation in nlprule I would like to be consistent in whether byte or char indices are used and ideally use bytes.
from nlprule.
Do you actually use character indices or byte indices in
cargo-spellcheck
? Would you have to convert from byte indices to char indices if theSuggestion
.start
and.end
indices were byte indices?
Yes, it's converted early on into character indices, and soonβ’ will be grapheme aware as well, but that can be a layer on-top of characters. It simplifies iterations significantly and is also required to properly align to spans provided by syn
and ra
iirc.
For internal nlprule computation char indices are never needed (I think) so converting from char to byte as early as possible (i.e. when building the binaries) is possible. Using byte indices everywhere in Rust and only converting from byte to char at the boundary to Python made sense to me.
I am just saying that having character based APIs is a nice feature, since it won't break with simple emojis which are multibyte characters.
But you're right, it's worth thinking about providing both in the public API. I agree that that would make it nicer for a user. But for computation in nlprule I would like to be consistent in whether byte or char indices are used and ideally use bytes.
from nlprule.
Related Issues (20)
- Modularizing the crate HOT 4
- Make rayon optional
- Token as returned by pipe() is relative to the sentence boundaries HOT 6
- Improve loading speed (of regex?) - cli usecase HOT 13
- Usability of the rules API degraded from 0.4.6 to 0.5.1 HOT 1
- oob access since 0.5.3 HOT 6
- Support for older glibc HOT 8
- Grammar check fails HOT 3
- panic in `Regex::regex()` HOT 5
- Compile error in build.rs from README.md HOT 3
- Support Rules written in Rust HOT 1
- Coalesced words - tokenization HOT 1
- Readme link to languagetool HOT 1
- Support for AnnotatedText HOT 10
- Clarify license statement HOT 5
- Document how to load custom rulesets HOT 4
- Support distinguishing between grammar and style errors HOT 3
- Be more responsible about network requests HOT 2
- Single Or Pural
- Support python 3.11
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nlprule.