Code Monkey home page Code Monkey logo

Comments (6)

BurntSushi avatar BurntSushi commented on May 30, 2024 3

Could these perhaps be disabled by default so that this crate is more of a drop-in replacement for the standard library's str type?

I think this is more of a philosophical stance, right? If I were, to say, embed the DFA search runtime from regex-automata into bstr and thereby remove the dependencies on regex-automata and byteorder (but probably would need to keep lazy_static), would you still be asking this question? If not, then why not?

Switching the defaults is something I'm possibly open to. (And in particular, this is good timing, since I hope to put out a bstr 1.0 release sometime soonish, and changing this default is a breaking change.) The main problem I have with that is that I'd rather surface Unicode-aware APIs by default. While regex-automata is primarily used to implement the Unicode segmentation algorithms, it's also used in other places, like for the implementation of trim. I also partially anticipate that it may be used for other things such as case conversion, although I'm not sure. If trim were the only concern, then that could be implemented without using regex-automata.

Generally my view here was that things like graphemes, words and sentences should be available by default because---especially graphemes---they are often what you want for correctness reasons. Indeed, part of the motivation for bstr is to serve as a single one-stop-shop for these sorts of Unicode APIs. And this is interwoven with the idea of providing UTF-8-by-convention APIs, because most of the Unicode algorithms in the Rust ecosystem are implemented on &str and it's really hard to adapt or use them on &[u8].

I think at a high level, I feel like "Unicode by default" is the philosophically better choice in general. regex makes the same choice: everything is Unicode aware by default. Because text is hard and people get it wrong and consistently forget about corner cases. This means that folks who are aware of the trade off and care about slimming their dependency tree will need to take explicit action, and I guess I kind of feel like that's OK. I'd rather that than people who don't posses a deep understanding of text missing out. I grant that I'm being hand wavy here, but it's because reasonable people can probably disagree about what the right default is in this case.

Does it use the unicode features of bstr?

Yes, it uses bstr to trim whitespace via Unicode's White_Space property: https://github.com/BurntSushi/rust-csv/blob/70c8600b29349f9ee0501577284d8300ae9c8055/src/byte_record.rs#L374

(I have been considering removing the use of bstr from csv, since its dependency tree has gotten much bigger than I'd like, and I think the White_Space property is small enough where its handling can just be inlined.)

from bstr.

BurntSushi avatar BurntSushi commented on May 30, 2024

For Unicode handling, as the feature name suggests. :-) It's also documented in the README.

lazy_static isn't particularly heavy. regex-automata might be, but its default features are disabled. Its only non-optional dependency is byteorder. When regex-automata is compiled without its default features, it becomes quite light-weight. All it will have is the DFA search runtime. All the DFA building code falls away.

If you're looking for more specifics, then regex-automata is used to implement the grapheme/word/sentence segmentation algorithms. For example, here's the regex for grapheme segmentation. Those regexes are compiled into DFAs and embedded into the executable. They are then loaded via lazy_static.

from bstr.

BurntSushi avatar BurntSushi commented on May 30, 2024

I don't see a ton of room for improvement here to be honest. Pretty much any kind of Unicode handling is always going to require a bit of fat somewhere. In this case, regex-automata is no worse and perhaps even better than what unicode-segmentation does. Namely, there are no separate Unicode tables. Instead, everything is built right into the automaton, which is also minimized (via Hopcroft). Combined with using a sparse representation (which bstr does for the bigger regexes), I'm pretty sure you're getting pretty close to the minimal amount of space needed to implement these algorithms. The trade off here is that there's an extra dependency that you see compiling. I am generally pretty sympathetic to that concern, which is why I've spent a lot of time keeping my dependency trees small, but it is not something that I optimize for at the expense of everything else.

I think moving forward, there is some potential for removing the lazy_static and byteorder dependencies. I'm already exploring the removal of the latter in the 0.2 release of regex-automata, since I will be bumping the MSRV to Rust 1.36 (which includes the endian/integer conversion routines added to std).

lazy_static I think will be trickier to remove. In theory, a sufficiently expressive const fn feature should be enough, since loading a DFA into memory is by design simple, cheap and pure with no allocation. The other possibility is if lazy types get added to std, then those could be used instead.

In theory, memchr could also be made optional, likely at the cost of a significant performance decrease in almost all searching routines in the vast majority of common cases.

from bstr.

tbu- avatar tbu- commented on May 30, 2024

Thanks for the thorough answer.

I'm looking for a small crate that makes dealing with almost UTF8 strings nicer, so that I can work with them like with std's str type.

From the crate documentation:

This library bundles in a few more Unicode operations, such as grapheme, word and sentence iterators. More operations, such as normalization and case folding, may be provided in the future.

Take the following from a not-yet user, not really informed about the history of this crate: Could these perhaps be disabled by default so that this crate is more of a drop-in replacement for the standard library's str type?

from bstr.

tbu- avatar tbu- commented on May 30, 2024

I went to the top ten reverse dependencies of bstr, the first one is your crate csv:

https://github.com/BurntSushi/rust-csv/blob/70c8600b29349f9ee0501577284d8300ae9c8055/Cargo.toml

Does it use the unicode features of bstr?

The ripgrep-related crates probably use those. Other than your crates, I only see rlua (which does manage to disable default dependencies) and cargo-release (which does not disable the default dependency, but I guess it doesn't use the unicode data either).

from bstr.

BurntSushi avatar BurntSushi commented on May 30, 2024

I'm going to close this out. I still fee largely the same as I did when I wrote my comments above, and I don't see it changing necessarily.

from bstr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.