Code Monkey home page Code Monkey logo

Comments (43)

joshtriplett avatar joshtriplett commented on May 31, 2024 15

I sincerely hope that the version of bstr after 1.0 is simply std (or, more precisely, core and alloc as appropriate).

Thank you for such a fundamental building block of the ecosystem.

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024 6

do the BStr and BString types pull their own weight? They feel a bit redundant to me -- I essentially only ever use the former and even then, only as an argument in a fmt call, and that could just be a one-off wrapper returned by a method on the extension trait.

I think so. There are two main use cases I envision for them:

  1. They provide a target for trait impls that is distinct from &[u8] and Vec<u8>. This is significant because the latter may be covered by blanket impls. (This is why serde_bytes exists. One can use bstr instead of serde_bytes since bstr has optional serde support.
  2. You can use BString and &BStr types internally instead of Vec<u8> and &[u8] to automatically benefit from their std::fmt::Debug impls. That way, you don't have to write custom debug impls.

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024 6

Folks, I've updated the top comment of this issue to include a list of breaking changes (all of which are minor). My plan is to push out the 1.0 release on Monday, July 11, 2022. So please, if you have comments or feedback or other ideas for breaking changes, now is the time to do it.

I am hereby happily offering to step in as collaborator or maintainer and serve as helping hand for whatever @BurntSushi deems right.

For what it's worth, it would have probably been an order of magnitude more work to onboard someone else to do the 1.0 release than to just do it myself. (Which ended up taking almost a full day of work.) It's important to remember this when offering to maintain a project. It's not as simple as me just clicking a button and giving someone permissions. :-)

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024 4

Done!

1.0 tag: https://github.com/BurntSushi/bstr/releases/tag/1.0.0

Blog post: https://blog.burntsushi.net/bstr/

from bstr.

lopopolo avatar lopopolo commented on May 31, 2024 3

Another place that BString and &BStr are useful is to wrap bytes before using them in assert_eq! in tests to get more readable debug output.

from bstr.

thomcc avatar thomcc commented on May 31, 2024 2

My main question would be: do the BStr and BString types pull their own weight? They feel a bit redundant to me -- I essentially only ever use the former and even then, only as an argument in a fmt call, and that could just be a one-off wrapper returned by a method on the extension trait.

That said, part of the reason I don't use them much has been due to the fact that the library is pre-1.0 and these seem the most likely item to be removed to me.

In principal I like the idea of being able to specify something is a BString rather than a Vec<u8> -- in compound types like Vec<Vec<u8>> it can be a bit non-obvious what this represents, whereas Vec<BString> is much more clear (to me, anyway).

I think the main other changes I can think of would be API additions, and thus compatible.

from bstr.

Byron avatar Byron commented on May 31, 2024 2

For what it's worth, it would have probably been an order of magnitude more work to onboard someone else to do the 1.0 release than to just do it myself.

That's certainly true, yet I find it worth clarifying that my offer wasn't limited to the 1.0 release. In any case, thanks for this wonderful crate and the continued maintenance efforts!

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024 1

Ah perfect. My time is super limited and devoting pretty much all of what little free time I have to regex. But I'll do my best to take a detour and whip bstr into a 1.0 release. I think it has had more than enough time to bake and there don't appear to be any obvious design flaws when compared to alternatives.

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024 1

dddd0 on reddit asked:

Might as well ask here: Why does for_byte_record_with_terminator in BufReadExt consume self / the underlying BufRead? It doesn't seem to be necessary to me.

Indeed, this is I believe a good point. In fact, all of the methods on BufReadExt consume self. I believe the only ones we actually want to consume self are byte_lines and byte_records, since they each return an iterator generic over Self. But the other methods all execute closures, and so I suspect &mut self would be more appropriate there. (It's likely this isn't a show-stopping issue in practice since you can probably always convert your T: BufReadExt to a &mut T: BufReadExt and get away with calling the for_ methods without actually consuming your reader, but it does look like a wart.)

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024 1

I found myself creating little helper methods for that throughout the codebase.

Can you create a new issue with these? That is, just copy and paste your helper routines. (Or if they fit better in an existing issue, that's fine too.)

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024 1

With respect to MSRV, if Rust 1.60 is too new for some folks, then I suppose they could stick with bstr 0.2 until Rust 1.60 is no longer too new.

from bstr.

TethysSvensson avatar TethysSvensson commented on May 31, 2024 1

How do you feel about Rust 1.60 as the MSRV?

I think it's probably the right choice in this situation, given that you are trying to stabilize.

In planus, our base policy is to only bump the MSRV to versions that are at least six months old, which would mean that we could not start using bstr until October. That being said, it's a policy we try to follow, but break when there is sufficient reason to do so.

(While writing this, I am also realizing that our MSRV simultaneously says 4 versions of rust and 6 months, which are two very different things)

from bstr.

Lokathor avatar Lokathor commented on May 31, 2024 1

September update on Rust versions in the Linux world:

1.60 seems a completely reasonable MSRV to start with for 1.0. Linux distros will catch up to 1.60 "very soon", and the 0.2 version is still usable until then.

You could also document that the MSRV will stay up to date with Debian Stable once they eventually get past 1.60, or whatever other reference point.

from bstr.

lopopolo avatar lopopolo commented on May 31, 2024 1

Congrats @BurntSushi! This is huge. Thank you so much for all of the effort put into this high quality, foundational crate. bstr makes so much possible for me and my projects. πŸ™‡

from bstr.

thomcc avatar thomcc commented on May 31, 2024

Yep, that all makes sense to me, and I hadn't considered point 1 at all.

from bstr.

Byron avatar Byron commented on May 31, 2024

Now that gitoxide is thinking about stability as well it became clear that without a 1.0 bstr release this couldn't ever happen. I am glad to see this is actively explored here and will be looking forward to the eventual 1.0 release.

For reference, not using &BStr in gitoxide would be a great degradation of usability of parsed git objects which can't be forced into UTF-8 encoding. Turning these back into &[u8] would be something so regrettable that it simply has to be avoided.

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024

For reference, not using &BStr in gitoxide would be a great degradation of usability of parsed git objects which can't be forced into UTF-8 encoding.

Could you say more about this? In particular, I would like to hear more about its importance in public APIs.

from bstr.

Byron avatar Byron commented on May 31, 2024

A good example for its use is a freshly parsed commit whose memory is backed by a buffer somewhere else. The goal is to make clear that these fields represent user-readable strings, but in Rust there is no way to do this without enforcing an encoding or a custom type. That custom type is BStr and once I found it there was no way I would roll my own.

It also works nicely for the mutable version of such a commit.

When testing, working with BString/BStr felt natural and worked exactly as if one would expect.

Note that I put the bstr crate into public APIs knowing it goes against the prominent advice provided in the readme file, that's how indispensable it was πŸ˜….

From there it only proliferated and it's now used in no less than 13 gitoxide crates.

Could you say more about this? In particular, I would like to hear more about its importance in public APIs.

To find a more concise answer: bstr is just too practical and there is a huge void for strings without known encoding in Rust that it seems to fill well for my use.

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024

@Byron Ah wow, that is great feedback. I agree that using BStr/BString in those APIs is probably the right thing, because otherwise the derived Debug representation of those types is going to be totally useless I imagine.

I don't think there is much work to be done to get bstr to 1.0. I'll do my best to prioritize it in the next month or so.

When do you plan to release 1.0 versions of crates that depend on bstr?

from bstr.

Byron avatar Byron commented on May 31, 2024

from bstr.

chrysn avatar chrysn commented on May 31, 2024

Having this stable would also enable its use with the riot-wrappers crate, where for example process names are exposed (they're by all probability ASCII even, but an unsafe assume-it-is opens up for UB if a C user on the same system does something weird, and converting them to &str means pulling in UTF-8 checks, and either fallible operations or panics).

No hurry from me (I'll need to wait with this for my next API bump anyway), just a data point, and thanks for working on this!

from bstr.

Byron avatar Byron commented on May 31, 2024

Since there hasn't been a lot of movement in the past 9 months and since bstr is a very, very important dependency for gitoxide and probably countless other crates, I wonder what I can do to help. The reason for me getting a bit more pushy is the ongoing work to get gitoxide into cargo.

I am hereby happily offering to step in as collaborator or maintainer and serve as helping hand for whatever @BurntSushi deems right.

Thank a lot :)!

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024

Note that I've published 1.0.0-pre.1 to crates.io. It's not on docs.rs yet though.

from bstr.

omac777 avatar omac777 commented on May 31, 2024

I read and understand you want bstr to allow for more possibilities around bytes:
-string of bytes conforming to utf-8
-string of bytes non-conforming to utf-8, but it's encoding is yet to be determined.

Yesterday, an important and lengthy cloud backup failed to complete. Why? Because the name of either the file or directory contained MACOS emojis and the backup's destination service does not like those kinds of characters(non-conformant). It's related because the file system on MACOS/UNIX/BSD accepts those character ranges be it visible or invisible. Linux xfs/btrfs/ext4 don't have any issues with those character ranges either.

Here is short specific example of an error caused by non-conformance. It may not seem like a big thing, but it is considering it's for a company's backup of data and it delays the backup for the entire set of data. Human intervention is necessary at this point.

rsync --archive someSourceDir/ /mnt/tapedrive/someDestinationDir/

within some subdirectory however there is a file called "blah:foo" notice the colon in there. The backup continues for hours then errors out on the ':' character and does not backup the file in question. If it's a directory name with a colon, it does not backup the directory in question. In other words it complicates matters. Why is that? LTFS formatted tape file systems don't like that ':' colon character in file/directory names. So the use case scenario with an emoji in a filename or directory name brought up another error but this time not on tape backups, but on cloud backups. These are big deals on traditional UNIX-based file systems, but it's a big deal on anything that isn't. I'm guessing AWS S3 and other cloud services have issues like this.

The above brings me to suggest these in the bstr API for helping determine if filenames and directory names are going to be ok before they land on said backup storage destination:
bstr::Is_LTFS_Conformant()
bstr::Is_S3_Conformant()
bstr::Is_Google_Cloud_BitbucketName_Conformant()
bstr::Is_Optical_UDF_FS_Conformant()

Oddly enough the only way I've found to circumvent these issues is by tar'ing someSourceDir and then rsync'ing the tar to said destination backup storage, but that's not always the ideal situation. It would be better to find those non-conformant filenames/directory names beforehand and propose some auto-correcting measures to conform to the respective storage range of characters and such.

Why am I mentioning it here? It's because strings, bstr, path are the usual go-to types for holding filenames and directory names. It would be good to have such helper tools available to help everybody save time when doing their own backups without experiencing any errors.

Also to further prevent these kinds of interoperability issues, why not place in the filesystem drivers new constraints that disallow emojis and other unwanted character ranges as filenames and directory names from the beginning. i.e. the openfile/createfile would disallow ':', emojis and such on every operating system. Get all the operating system and storage providers to collaborate on this once and for all.

Thank you for listening. Cheers.

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024

@omac777 Let's please try to keep this thread focused on 1.0 concerns. This isn't an RFC for any arbitrary feature requests, but issues specific to a 1.0 release. I've moved your comment into #109 and responded there.

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024

OK, the docs for what's currently on master have been published: https://docs.rs/bstr/1.0.0-pre.1/bstr/

(I haven't made the changes to stop consuming self for some of the methods on BufReadExt yet.)

from bstr.

Byron avatar Byron commented on May 31, 2024

I think helpful to address would be issues that could potentially lead to breaking changes before 1.0. Not truly understanding if this is the case or not with what I am about to state, let me share one convenience issue in particular: I am having trouble sometimes to create a BStr instance, usually in situations where type inference bails out preventing "string".into() to work.

Recently I saw code that had to do [].as_bstr() to get an empty &BStr because type inference wouldn't know that a BStr was required. Maybe it's easy enough to have a constructor for empty BStr, and also a function to create new BStr instances directly, like bstr::BStr::empty() or bstr::BStr::new("hello"), the latter similar to std::path::Path::new(…).

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024

@Byron I think these are the related issues for that, right? #84 and #86. I specifically skipped over those because I do not believe there are any breaking changes required to fix those things. But maybe I'm wrong.

Basically, when I created BString and BStr, I specifically tried to leave those types as devoid of methods as I could, since their principle purpose is to act as a Deref point for Vec<u8> and &[u8], respectively, and as a target for trait impls that want "byte string" specific behavior (like std::fmt::Debug, serde::Serialize and serde::Deserialize). Adding more methods on to them makes the deref situation a little more dicey. But if one can convince me that it's okay, then I'd generally be happy to add methods or constructors to them. To be honest, I just haven't done the investigation required for that myself.

One possible "idiom" here is that I believe B("").as_bstr() will always work. But if we can convince ourselves that a BStr::new is OK (with basically the same signature as B, except it returns a &BStr), then I'd be more than happy to add that. Thankfully, it's not a breaking change though.

from bstr.

Byron avatar Byron commented on May 31, 2024

@Byron I think these are the related issues for that, right? #84 and #86. I specifically skipped over those because I do not believe there are any breaking changes required to fix those things. But maybe I'm wrong.

#84 is definitely related, maybe having other ways to create a &BStr would reduce some pressure there.

But if one can convince me that it's okay, then I'd generally be happy to add methods or constructors to them. To be honest, I just haven't done the investigation required for that myself.

Thanks for your openness! It's certainly not the case here either as I am far from an actual investigation, all I have is nearly every-day usage and some returning pains. I think I was struck by fear that fixing any of these would lead to another breaking change which better happens soon then.

One possible "idiom" here is that I believe B("").as_bstr() will always work. But if we can convince ourselves that a BStr::new is OK (with basically the same signature as B, except it returns a &BStr), then I'd be more than happy to add that.

BStr::new() seems like the most intuitive, and for now it doesn't appear to conflict with the Deref targets either. I'd be willing to deal with it when it happens, and take the risk, for more convenience now. It's certainly a small thing as well, I found myself creating little helper methods for that throughout the codebase.

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024

I've added another breaking change: restructure serde feature flags

Now that bstr has an 'alloc' feature, we need to rethink how we setup
the serde feature flags. Previously, all we had was 'std' and 'no std'.
But now we have 'std', 'alloc only' and 'core only'. In particular, 'no
std' is split into 'alloc only' and 'core only', since neither one bring
in std. To reflect this trichotomy, we rename 'serde1' to 'serde1-std',
and split 'serde1-nostd' into 'serde1-alloc' and 'serde1-core'.

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024

I've added another breaking change: several methods on BufReadExt now take &mut self instead of self.

from bstr.

epage avatar epage commented on May 31, 2024

Now that bstr has an 'alloc' feature, we need to rethink how we setup
the serde feature flags. Previously, all we had was 'std' and 'no std'.
But now we have 'std', 'alloc only' and 'core only'. In particular, 'no
std' is split into 'alloc only' and 'core only', since neither one bring
in std. To reflect this trichotomy, we rename 'serde1' to 'serde1-std',
and split 'serde1-nostd' into 'serde1-alloc' and 'serde1-core'.

Would it be worth bumping the MSRV to 1.60 and use weak / namespaced feature flags?

This way you have

  • serde1 could possibly be named serde
  • std depends on serde?/std
  • alloc depends on serde?/alloc
  • Optional dependencies that are not meant to be features can be properly hidden with dep:

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024

@epage Yeah I mused about this here: #111 (comment)

Basically, I kind of feel like 1.60 is really way too new. My first approximation is to track Debian stable (which is at Rust 1.48), but in principle, I'm okay with something newer.

Unfortunately, there does seem to be a fundamental conflict here. Namely, if we move forward with the existing trichotomy of serde features, then it's going to have to stay that way since I think moving to the scheme you propose would be a breaking change. (Not the MSRV bump, that's fine, but changing the features around.) Since I don't see myself putting out a bstr 2.0 any time soon, it would effectively freeze bstr into a slightly sub-optimal feature configuration. But if I do adopt the nicer feature setup, then I'm being somewhat aggressive with the MSRV.

I personally don't mind being a bit aggressive, but I suppose it depends on my users.

@lopopolo @thomcc @Byron @TethysSvensson How do you feel about Rust 1.60 as the MSRV?

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024

Given there's a question open here, I'm going to delay the 1.0 release another week.

from bstr.

epage avatar epage commented on May 31, 2024

Speaking of

ByteVec::into_os_string now returns Result<OsString, FromUtf8Error> instead of Result<OsString, Vec>.

In the docs, it says

  1. One could re-implement WTF-8 and re-encode file paths on Windows to WTF-8 by accessing their underlying 16-bit integer representation. Unfortunately, this isn’t zero cost (it introduces a second WTF-8 decoding step) and it’s not clear this is a good thing to do, since WTF-8 should ideally remain an internal implementation detail.
    ...

While this library may provide facilities for (1) in the future, currently, this library only provides facilities for (2) and (3).

(1) could be simplified, depending on what is done with OsStr, like rust-lang/rust#95290

Should the existing functions have more specific names to open the door for the more general functions to be added later, if possible?

Non-blocker for 1.0: would it also be worth pointing people to os_str_bytes to help people who want (1)?

EDIT: Now split out as #115, #116

from bstr.

epage avatar epage commented on May 31, 2024

As a usability note, I wrote an application with BStr in the internal APis and kept getting tripped up that I needed as_bstr() or .map(ByteSlice::as_bstr) at the end of any transformation I did on a BStr to get it back into its own type. In the ideal world, the output type would match the input type.

I'm assuming this is fully intentional and won't be changing but figured it was still worth noting.

EDIT: Split out into #117

from bstr.

lopopolo avatar lopopolo commented on May 31, 2024

@BurntSushi 1.60.0 MSRV works for me. Applications I build with bstr target the latest stable and library crates I've built with bstr have a "bump MSRV in minor versions" policy which I'd be happy to exercise to accommodate a higher bstr MSRV.

On a related note, I've been using the weak features and namespaced features in recent cargo and have found them to be quite pleasant to work with. Using serde1 = ["dep:serde", ...] means that a --features serde flag is no longer valid like it is now.

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024

As a usability note, I wrote an application with BStr in the internal APis and kept getting tripped up that I needed as_bstr() or .map(ByteSlice::as_bstr) at the end of any transformation I did on a BStr to get it back into its own type. In the ideal world, the output type would match the input type.

I'm assuming this is fully intentional and won't be changing but figured it was still worth noting.

Right yeah. That's what bstr 0.1 was: https://docs.rs/bstr/0.1.4/bstr/

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024

@lopopolo Yeah the new Cargo feature stuff is awesome. I very much like that it gives more control over what's a public feature.

@epage See #5 and #8 for more details on why bstr 0.1 -> 0.2 went from concrete BString/BStr types to extension traits on Vec<u8>/[u8]. It is slightly sad that if you have a BStr and call an extension trait method that returns a subslice of it that you get a &[u8] back instead. I'm not sure there is any simple machinery that would let me fix that while maintaining the existing extension trait API on [u8].

Happy to answer more questions on that but would prefer opening a new issue for it if you have more questions.

from bstr.

Byron avatar Byron commented on May 31, 2024

How do you feel about Rust 1.60 as the MSRV?

I'd be OK with it, thanks for asking.

from bstr.

lopopolo avatar lopopolo commented on May 31, 2024

Hi @BurntSushi I know this is late in the game for 1.0, but I wanted to float an idea of an optional dependency on simdutf8 for a SIMD-accelerated fast path to crate::utf8::validate:

bstr/src/utf8.rs

Lines 463 to 471 in 0d9d222

/// Returns OK if and only if the given slice is completely valid UTF-8.
///
/// If the slice isn't valid UTF-8, then an error is returned that explains
/// the first location at which invalid UTF-8 was detected.
pub fn validate(slice: &[u8]) -> Result<(), Utf8Error> {
// The fast path for validating UTF-8. It steps through a UTF-8 automaton
// and uses a SIMD accelerated ASCII fast path on x86_64. If an error is
// detected, it backs up and runs the slower version of the UTF-8 automaton
// to determine correct error information.

I'm not sure how you'd want to structure this, but maybe an optional performance feature a la regex (that maybe does nothing for 1.0 but leaves the possibility for additional deps open in subsequent releases?).

Edit: coming back to this, I'm actually not sure whether I should expect the DFA in bstr or SIMD routines in simdutf8 to be faster. I just associate SIMD with 🏎️

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024

Yeah that's not a 1.0 concern. I'm unlikely to take a dependency for that. Instead, I would rather see the implementation ported to bstr or maybe even memchr. That said, I've never looked at it in depth, so I'm unsure of its complexity.

Feel free to open a new issue about this to avoid cluttering up the 1.0 release thread.

For this thread, I would really like to keep it scoped to 1.0 concerns.

from bstr.

BurntSushi avatar BurntSushi commented on May 31, 2024

OK, bstr 1.0.0-pre.3 is up: https://docs.rs/bstr/1.0.0-pre.3/bstr/

I've rejiggered/simplified the features (just a serde feature and no more serde1-{std,alloc,core} goop) and bumped the MSRV to Rust 1.60.

My new 1.0 release date is September 6, 2022. If there are any last comments on backward incompatible changes, now is the time to make them. :)

from bstr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.