Comments (43)
I sincerely hope that the version of bstr
after 1.0 is simply std
(or, more precisely, core
and alloc
as appropriate).
Thank you for such a fundamental building block of the ecosystem.
from bstr.
do the BStr and BString types pull their own weight? They feel a bit redundant to me -- I essentially only ever use the former and even then, only as an argument in a fmt call, and that could just be a one-off wrapper returned by a method on the extension trait.
I think so. There are two main use cases I envision for them:
- They provide a target for trait impls that is distinct from
&[u8]
andVec<u8>
. This is significant because the latter may be covered by blanket impls. (This is whyserde_bytes
exists. One can usebstr
instead ofserde_bytes
sincebstr
has optional serde support. - You can use
BString
and&BStr
types internally instead ofVec<u8>
and&[u8]
to automatically benefit from theirstd::fmt::Debug
impls. That way, you don't have to write custom debug impls.
from bstr.
Folks, I've updated the top comment of this issue to include a list of breaking changes (all of which are minor). My plan is to push out the 1.0 release on Monday, July 11, 2022. So please, if you have comments or feedback or other ideas for breaking changes, now is the time to do it.
I am hereby happily offering to step in as collaborator or maintainer and serve as helping hand for whatever @BurntSushi deems right.
For what it's worth, it would have probably been an order of magnitude more work to onboard someone else to do the 1.0 release than to just do it myself. (Which ended up taking almost a full day of work.) It's important to remember this when offering to maintain a project. It's not as simple as me just clicking a button and giving someone permissions. :-)
from bstr.
Done!
1.0 tag: https://github.com/BurntSushi/bstr/releases/tag/1.0.0
Blog post: https://blog.burntsushi.net/bstr/
from bstr.
Another place that BString
and &BStr
are useful is to wrap bytes before using them in assert_eq!
in tests to get more readable debug output.
from bstr.
My main question would be: do the BStr
and BString
types pull their own weight? They feel a bit redundant to me -- I essentially only ever use the former and even then, only as an argument in a fmt call, and that could just be a one-off wrapper returned by a method on the extension trait.
That said, part of the reason I don't use them much has been due to the fact that the library is pre-1.0 and these seem the most likely item to be removed to me.
In principal I like the idea of being able to specify something is a BString
rather than a Vec<u8>
-- in compound types like Vec<Vec<u8>>
it can be a bit non-obvious what this represents, whereas Vec<BString>
is much more clear (to me, anyway).
I think the main other changes I can think of would be API additions, and thus compatible.
from bstr.
For what it's worth, it would have probably been an order of magnitude more work to onboard someone else to do the 1.0 release than to just do it myself.
That's certainly true, yet I find it worth clarifying that my offer wasn't limited to the 1.0 release. In any case, thanks for this wonderful crate
and the continued maintenance efforts!
from bstr.
Ah perfect. My time is super limited and devoting pretty much all of what little free time I have to regex
. But I'll do my best to take a detour and whip bstr
into a 1.0 release. I think it has had more than enough time to bake and there don't appear to be any obvious design flaws when compared to alternatives.
from bstr.
Might as well ask here: Why does for_byte_record_with_terminator in BufReadExt consume self / the underlying BufRead? It doesn't seem to be necessary to me.
Indeed, this is I believe a good point. In fact, all of the methods on BufReadExt
consume self
. I believe the only ones we actually want to consume self
are byte_lines
and byte_records
, since they each return an iterator generic over Self
. But the other methods all execute closures, and so I suspect &mut self
would be more appropriate there. (It's likely this isn't a show-stopping issue in practice since you can probably always convert your T: BufReadExt
to a &mut T: BufReadExt
and get away with calling the for_
methods without actually consuming your reader, but it does look like a wart.)
from bstr.
I found myself creating little helper methods for that throughout the codebase.
Can you create a new issue with these? That is, just copy and paste your helper routines. (Or if they fit better in an existing issue, that's fine too.)
from bstr.
With respect to MSRV, if Rust 1.60 is too new for some folks, then I suppose they could stick with bstr 0.2
until Rust 1.60 is no longer too new.
from bstr.
How do you feel about Rust 1.60 as the MSRV?
I think it's probably the right choice in this situation, given that you are trying to stabilize.
In planus, our base policy is to only bump the MSRV to versions that are at least six months old, which would mean that we could not start using bstr until October. That being said, it's a policy we try to follow, but break when there is sufficient reason to do so.
(While writing this, I am also realizing that our MSRV simultaneously says 4 versions of rust and 6 months, which are two very different things)
from bstr.
September update on Rust versions in the Linux world:
- https://packages.debian.org/bullseye/rustc
- Stable is 1.48
- Testing 1.59
- https://packages.fedoraproject.org/pkgs/rust/rust/
- Fedora 35 is at 1.55
- Fedora 37 is at 1.63
1.60 seems a completely reasonable MSRV to start with for 1.0. Linux distros will catch up to 1.60 "very soon", and the 0.2 version is still usable until then.
You could also document that the MSRV will stay up to date with Debian Stable once they eventually get past 1.60, or whatever other reference point.
from bstr.
Congrats @BurntSushi! This is huge. Thank you so much for all of the effort put into this high quality, foundational crate. bstr makes so much possible for me and my projects. π
from bstr.
Yep, that all makes sense to me, and I hadn't considered point 1 at all.
from bstr.
Now that gitoxide
is thinking about stability as well it became clear that without a 1.0 bstr
release this couldn't ever happen. I am glad to see this is actively explored here and will be looking forward to the eventual 1.0 release.
For reference, not using &BStr
in gitoxide
would be a great degradation of usability of parsed git objects which can't be forced into UTF-8 encoding. Turning these back into &[u8]
would be something so regrettable that it simply has to be avoided.
from bstr.
For reference, not using &BStr in gitoxide would be a great degradation of usability of parsed git objects which can't be forced into UTF-8 encoding.
Could you say more about this? In particular, I would like to hear more about its importance in public APIs.
from bstr.
A good example for its use is a freshly parsed commit whose memory is backed by a buffer somewhere else. The goal is to make clear that these fields represent user-readable strings, but in Rust there is no way to do this without enforcing an encoding or a custom type. That custom type is BStr
and once I found it there was no way I would roll my own.
It also works nicely for the mutable version of such a commit.
When testing, working with BString/BStr
felt natural and worked exactly as if one would expect.
Note that I put the bstr
crate into public APIs knowing it goes against the prominent advice provided in the readme file, that's how indispensable it was π
.
From there it only proliferated and it's now used in no less than 13 gitoxide
crates.
Could you say more about this? In particular, I would like to hear more about its importance in public APIs.
To find a more concise answer: bstr
is just too practical and there is a huge void for strings without known encoding
in Rust that it seems to fill well for my use.
from bstr.
@Byron Ah wow, that is great feedback. I agree that using BStr
/BString
in those APIs is probably the right thing, because otherwise the derived Debug
representation of those types is going to be totally useless I imagine.
I don't think there is much work to be done to get bstr
to 1.0
. I'll do my best to prioritize it in the next month or so.
When do you plan to release 1.0
versions of crates that depend on bstr
?
from bstr.
from bstr.
Having this stable would also enable its use with the riot-wrappers crate, where for example process names are exposed (they're by all probability ASCII even, but an unsafe assume-it-is opens up for UB if a C user on the same system does something weird, and converting them to &str means pulling in UTF-8 checks, and either fallible operations or panics).
No hurry from me (I'll need to wait with this for my next API bump anyway), just a data point, and thanks for working on this!
from bstr.
Since there hasn't been a lot of movement in the past 9 months and since bstr
is a very, very important dependency for gitoxide
and probably countless other crates, I wonder what I can do to help. The reason for me getting a bit more pushy is the ongoing work to get gitoxide
into cargo
.
I am hereby happily offering to step in as collaborator or maintainer and serve as helping hand for whatever @BurntSushi deems right.
Thank a lot :)!
from bstr.
Note that I've published 1.0.0-pre.1
to crates.io. It's not on docs.rs yet though.
from bstr.
I read and understand you want bstr to allow for more possibilities around bytes:
-string of bytes conforming to utf-8
-string of bytes non-conforming to utf-8, but it's encoding is yet to be determined.
Yesterday, an important and lengthy cloud backup failed to complete. Why? Because the name of either the file or directory contained MACOS emojis and the backup's destination service does not like those kinds of characters(non-conformant). It's related because the file system on MACOS/UNIX/BSD accepts those character ranges be it visible or invisible. Linux xfs/btrfs/ext4 don't have any issues with those character ranges either.
Here is short specific example of an error caused by non-conformance. It may not seem like a big thing, but it is considering it's for a company's backup of data and it delays the backup for the entire set of data. Human intervention is necessary at this point.
rsync --archive someSourceDir/ /mnt/tapedrive/someDestinationDir/
within some subdirectory however there is a file called "blah:foo" notice the colon in there. The backup continues for hours then errors out on the ':' character and does not backup the file in question. If it's a directory name with a colon, it does not backup the directory in question. In other words it complicates matters. Why is that? LTFS formatted tape file systems don't like that ':' colon character in file/directory names. So the use case scenario with an emoji in a filename or directory name brought up another error but this time not on tape backups, but on cloud backups. These are big deals on traditional UNIX-based file systems, but it's a big deal on anything that isn't. I'm guessing AWS S3 and other cloud services have issues like this.
The above brings me to suggest these in the bstr API for helping determine if filenames and directory names are going to be ok before they land on said backup storage destination:
bstr::Is_LTFS_Conformant()
bstr::Is_S3_Conformant()
bstr::Is_Google_Cloud_BitbucketName_Conformant()
bstr::Is_Optical_UDF_FS_Conformant()
Oddly enough the only way I've found to circumvent these issues is by tar'ing someSourceDir and then rsync'ing the tar to said destination backup storage, but that's not always the ideal situation. It would be better to find those non-conformant filenames/directory names beforehand and propose some auto-correcting measures to conform to the respective storage range of characters and such.
Why am I mentioning it here? It's because strings, bstr, path are the usual go-to types for holding filenames and directory names. It would be good to have such helper tools available to help everybody save time when doing their own backups without experiencing any errors.
Also to further prevent these kinds of interoperability issues, why not place in the filesystem drivers new constraints that disallow emojis and other unwanted character ranges as filenames and directory names from the beginning. i.e. the openfile/createfile would disallow ':', emojis and such on every operating system. Get all the operating system and storage providers to collaborate on this once and for all.
Thank you for listening. Cheers.
from bstr.
@omac777 Let's please try to keep this thread focused on 1.0 concerns. This isn't an RFC for any arbitrary feature requests, but issues specific to a 1.0 release. I've moved your comment into #109 and responded there.
from bstr.
OK, the docs for what's currently on master have been published: https://docs.rs/bstr/1.0.0-pre.1/bstr/
(I haven't made the changes to stop consuming self
for some of the methods on BufReadExt
yet.)
from bstr.
I think helpful to address would be issues that could potentially lead to breaking changes before 1.0. Not truly understanding if this is the case or not with what I am about to state, let me share one convenience issue in particular: I am having trouble sometimes to create a BStr
instance, usually in situations where type inference bails out preventing "string".into()
to work.
Recently I saw code that had to do [].as_bstr()
to get an empty &BStr
because type inference wouldn't know that a BStr
was required. Maybe it's easy enough to have a constructor for empty BStr
, and also a function to create new BStr
instances directly, like bstr::BStr::empty()
or bstr::BStr::new("hello")
, the latter similar to std::path::Path::new(β¦)
.
from bstr.
@Byron I think these are the related issues for that, right? #84 and #86. I specifically skipped over those because I do not believe there are any breaking changes required to fix those things. But maybe I'm wrong.
Basically, when I created BString
and BStr
, I specifically tried to leave those types as devoid of methods as I could, since their principle purpose is to act as a Deref
point for Vec<u8>
and &[u8]
, respectively, and as a target for trait impls that want "byte string" specific behavior (like std::fmt::Debug
, serde::Serialize
and serde::Deserialize
). Adding more methods on to them makes the deref situation a little more dicey. But if one can convince me that it's okay, then I'd generally be happy to add methods or constructors to them. To be honest, I just haven't done the investigation required for that myself.
One possible "idiom" here is that I believe B("").as_bstr()
will always work. But if we can convince ourselves that a BStr::new
is OK (with basically the same signature as B
, except it returns a &BStr
), then I'd be more than happy to add that. Thankfully, it's not a breaking change though.
from bstr.
@Byron I think these are the related issues for that, right? #84 and #86. I specifically skipped over those because I do not believe there are any breaking changes required to fix those things. But maybe I'm wrong.
#84 is definitely related, maybe having other ways to create a &BStr
would reduce some pressure there.
But if one can convince me that it's okay, then I'd generally be happy to add methods or constructors to them. To be honest, I just haven't done the investigation required for that myself.
Thanks for your openness! It's certainly not the case here either as I am far from an actual investigation, all I have is nearly every-day usage and some returning pains. I think I was struck by fear that fixing any of these would lead to another breaking change which better happens soon then.
One possible "idiom" here is that I believe B("").as_bstr() will always work. But if we can convince ourselves that a BStr::new is OK (with basically the same signature as B, except it returns a &BStr), then I'd be more than happy to add that.
BStr::new()
seems like the most intuitive, and for now it doesn't appear to conflict with the Deref
targets either. I'd be willing to deal with it when it happens, and take the risk, for more convenience now. It's certainly a small thing as well, I found myself creating little helper methods for that throughout the codebase.
from bstr.
I've added another breaking change: restructure serde feature flags
Now that bstr has an 'alloc' feature, we need to rethink how we setup
the serde feature flags. Previously, all we had was 'std' and 'no std'.
But now we have 'std', 'alloc only' and 'core only'. In particular, 'no
std' is split into 'alloc only' and 'core only', since neither one bring
in std. To reflect this trichotomy, we rename 'serde1' to 'serde1-std',
and split 'serde1-nostd' into 'serde1-alloc' and 'serde1-core'.
from bstr.
I've added another breaking change: several methods on BufReadExt
now take &mut self
instead of self
.
from bstr.
Now that bstr has an 'alloc' feature, we need to rethink how we setup
the serde feature flags. Previously, all we had was 'std' and 'no std'.
But now we have 'std', 'alloc only' and 'core only'. In particular, 'no
std' is split into 'alloc only' and 'core only', since neither one bring
in std. To reflect this trichotomy, we rename 'serde1' to 'serde1-std',
and split 'serde1-nostd' into 'serde1-alloc' and 'serde1-core'.
Would it be worth bumping the MSRV to 1.60 and use weak / namespaced feature flags?
This way you have
serde1
could possibly be namedserde
std
depends onserde?/std
alloc
depends onserde?/alloc
- Optional dependencies that are not meant to be features can be properly hidden with
dep:
from bstr.
@epage Yeah I mused about this here: #111 (comment)
Basically, I kind of feel like 1.60 is really way too new. My first approximation is to track Debian stable (which is at Rust 1.48), but in principle, I'm okay with something newer.
Unfortunately, there does seem to be a fundamental conflict here. Namely, if we move forward with the existing trichotomy of serde features, then it's going to have to stay that way since I think moving to the scheme you propose would be a breaking change. (Not the MSRV bump, that's fine, but changing the features around.) Since I don't see myself putting out a bstr 2.0
any time soon, it would effectively freeze bstr
into a slightly sub-optimal feature configuration. But if I do adopt the nicer feature setup, then I'm being somewhat aggressive with the MSRV.
I personally don't mind being a bit aggressive, but I suppose it depends on my users.
@lopopolo @thomcc @Byron @TethysSvensson How do you feel about Rust 1.60 as the MSRV?
from bstr.
Given there's a question open here, I'm going to delay the 1.0 release another week.
from bstr.
Speaking of
ByteVec::into_os_string now returns Result<OsString, FromUtf8Error> instead of Result<OsString, Vec>.
In the docs, it says
- One could re-implement WTF-8 and re-encode file paths on Windows to WTF-8 by accessing their underlying 16-bit integer representation. Unfortunately, this isnβt zero cost (it introduces a second WTF-8 decoding step) and itβs not clear this is a good thing to do, since WTF-8 should ideally remain an internal implementation detail.
...While this library may provide facilities for (1) in the future, currently, this library only provides facilities for (2) and (3).
(1) could be simplified, depending on what is done with OsStr
, like rust-lang/rust#95290
Should the existing functions have more specific names to open the door for the more general functions to be added later, if possible?
Non-blocker for 1.0: would it also be worth pointing people to os_str_bytes
to help people who want (1)?
EDIT: Now split out as #115, #116
from bstr.
As a usability note, I wrote an application with BStr
in the internal APis and kept getting tripped up that I needed as_bstr()
or .map(ByteSlice::as_bstr)
at the end of any transformation I did on a BStr
to get it back into its own type. In the ideal world, the output type would match the input type.
I'm assuming this is fully intentional and won't be changing but figured it was still worth noting.
EDIT: Split out into #117
from bstr.
@BurntSushi 1.60.0 MSRV works for me. Applications I build with bstr
target the latest stable and library crates I've built with bstr
have a "bump MSRV in minor versions" policy which I'd be happy to exercise to accommodate a higher bstr
MSRV.
On a related note, I've been using the weak features and namespaced features in recent cargo and have found them to be quite pleasant to work with. Using serde1 = ["dep:serde", ...]
means that a --features serde
flag is no longer valid like it is now.
from bstr.
As a usability note, I wrote an application with
BStr
in the internal APis and kept getting tripped up that I neededas_bstr()
or.map(ByteSlice::as_bstr)
at the end of any transformation I did on aBStr
to get it back into its own type. In the ideal world, the output type would match the input type.I'm assuming this is fully intentional and won't be changing but figured it was still worth noting.
Right yeah. That's what bstr 0.1 was: https://docs.rs/bstr/0.1.4/bstr/
from bstr.
@lopopolo Yeah the new Cargo feature stuff is awesome. I very much like that it gives more control over what's a public feature.
@epage See #5 and #8 for more details on why bstr 0.1 -> 0.2
went from concrete BString
/BStr
types to extension traits on Vec<u8>
/[u8]
. It is slightly sad that if you have a BStr
and call an extension trait method that returns a subslice of it that you get a &[u8]
back instead. I'm not sure there is any simple machinery that would let me fix that while maintaining the existing extension trait API on [u8]
.
Happy to answer more questions on that but would prefer opening a new issue for it if you have more questions.
from bstr.
How do you feel about Rust 1.60 as the MSRV?
I'd be OK with it, thanks for asking.
from bstr.
Hi @BurntSushi I know this is late in the game for 1.0, but I wanted to float an idea of an optional dependency on simdutf8
for a SIMD-accelerated fast path to crate::utf8::validate
:
Lines 463 to 471 in 0d9d222
I'm not sure how you'd want to structure this, but maybe an optional performance
feature a la regex
(that maybe does nothing for 1.0 but leaves the possibility for additional deps open in subsequent releases?).
Edit: coming back to this, I'm actually not sure whether I should expect the DFA in bstr or SIMD routines in simdutf8 to be faster. I just associate SIMD with ποΈ
from bstr.
Yeah that's not a 1.0 concern. I'm unlikely to take a dependency for that. Instead, I would rather see the implementation ported to bstr or maybe even memchr. That said, I've never looked at it in depth, so I'm unsure of its complexity.
Feel free to open a new issue about this to avoid cluttering up the 1.0 release thread.
For this thread, I would really like to keep it scoped to 1.0 concerns.
from bstr.
OK, bstr 1.0.0-pre.3
is up: https://docs.rs/bstr/1.0.0-pre.3/bstr/
I've rejiggered/simplified the features (just a serde
feature and no more serde1-{std,alloc,core}
goop) and bumped the MSRV to Rust 1.60.
My new 1.0 release date is September 6, 2022. If there are any last comments on backward incompatible changes, now is the time to make them. :)
from bstr.
Related Issues (20)
- Character-oriented slicing and length APIs HOT 2
- Consider using `once_cell` instead of `lazy_static` HOT 3
- Ensure crate abides by the Unicode Data Files and Software License HOT 4
- bstr pulls in serde when `serde` feature is not specified HOT 6
- bstr-bench Cargo.lock out of sync
- Can the MSRV be lowered to 1.57 (or lower)? HOT 3
- Clarify None case in bstr::decode_utf8 HOT 1
- Feature request: `impl Deserialize for Box<BStr>`
- bstr 1.3.0 with `impl AsRef<BStr> for BStr` breaks some folks downstream HOT 1
- candidate versions found which didn't match: 0.2.17, 0.2.16, 0.2.15, ... HOT 5
- Possible panic safety issues in insert_str HOT 2
- Complementary ByteSlice functions addition - find_not_byte / rfind_not_byte HOT 1
- Use clippy in CI? HOT 2
- Intradoc links are broken when building with no default features HOT 3
- re-enable miri tests
- Accept array of str for split_str HOT 1
- remove `Borrow<BStr> for String` impls (and similar) in a semver compatible release HOT 9
- Add unescape_ascii fn HOT 4
- Display implementation doesn't respect Formatter options
- `bstr::Split` should implement clone. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bstr.