burntsushi / bstr Goto Github PK

View Code? Open in Web Editor NEW

745.0 745.0 51.0 2.32 MB

A string type for Rust that is not required to be valid UTF-8.

License: Other

Rust 96.17% Shell 3.83%

byte-string bytes graphemes substring substring-search unicode utf-8

bstr's People

Contributors

Stargazers

Watchers

Forkers

matklad voultapher isgasho ggriffiniii sylvestre thomcc freaky eclipseo sinhasantos m-ou-se erickt lopopolo sdroege youknowone tailhook azdle cad97 grayjack suhaibaffan mordak hoangpq justanotherdot icodein dtolnay-contrib zhengxiwan ldm0 xyl012 glandium pombredanne byron cole-miller tethyssvensson alexjago saethlin lingman clownsw kustomzone 4812571 mayhemheroes kianmeng nolanderc celeritascelery joshtriplett iq-scm lilyfoote embarkstudios filipandersson245 adriandelgado meakae amjad50

bstr's Issues

Compaing Cow<BStr> to BStr fails expected behavior

bstr version: 0.2.14
rustc version: 1.49.0

When executing the following tests, I would expect the bstr assertion to fail, but passes, while the str comparison fails as expected.

#[test]
fn test_cows() {
    let c1 = Cow::from(b"hello bstr".as_bstr());
    let c2 = b"goodbye bstr".as_bstr();
    assert_eq!(c1, c2);

    let c3 = Cow::from("hello str");
    let c4 = "goodbye str";
    assert_eq!(c3, c4);
}

no_std + alloc cargo feature configuration

I would like to use bstr to implement a conventionally UTF-8 String data structure to be used in a Rust implementation of Ruby.

As I've been implementing the core data structures in Artichoke, I've made as many of them as I can #[no_std] (for example, this Array implementation built with Vec).

The ByteVec extension trait has several methods I'd like to use instead of reimplementing myself, such as ByteVec::remove_char. Unfortunately, the ByteVec impl is gated on the std cargo feature which will require my crate to link to std despite my interest in only linking to alloc. For this String data structure, I am not interested in any of the PathBuf or OsString APIs.

I think this might be a breaking change for 1.0, so I wanted to raise this as soon as I ran into it.

Feat: Box<[u8]> -> Box<BStr>

It's currently possible to do &[u8] -> &BStr and Vec<[u8]> -> BString, but a way to construct owned heap-allocated but fixed-capacity bstrings Box<[u8]> -> Box<BStr> is not yet provided.

Distinguish between invalid and valid-but-incomplete utf-8

I have &[u8] that represents a section of a streaming read of some text. I'd like to convert it (possibly lossily) to a String as if it had all arrived in a single chunk. The issue is that the chunk may end part-way-through a valid utf8 sequence, and naively converting will corrupt any character unlucky enough to get sliced in this manner. To avoid that, I need to be able to distinguish if the final utf8 sequence is invalid or just incomplete.

Right now I do something like this:

let (ch, size) = bstr::decode_last_utf8(incoming_buffer);
let to_process = if ch.is_some() {
    // Ends with valid utf8, no worries.
    cur_buffer
} else {
    // Ends with either invalid or incomplete utf8, figure out which
    let (valid, invalid) = incoming_buffer.split_at(incoming_buffer.len() - size);
    if utf8_valid_prefix(invalid) {
        // Incomplete: Only process the valid part and leave
        // the rest for next time
        add_to_start_of_next_buffer(invalid);
        valid
    } else {
        // Process the whole thing -- we'll replace the invalid bit at the end
        // with the replacement char.
        cur_buffer
    }
};

Where utf8_valid_prefix looks like this:

fn utf8_valid_prefix(s: &[u8]) -> bool {
    let (head, tail) = if let Some((&h, t)) = s.split_first() {
        (h, t)
    } else {
        return true;
    };
    let seqlen = match head {
        0b0000_0000..=0b0111_1111 => 1,
        0b1000_0000..=0b1011_1111 => return false,
        0b1100_0000..=0b1101_1111 => 2,
        0b1110_0000..=0b1110_1111 => 3,
        0b1111_0000..=0b1111_0111 => 4,
        0b1111_1000..=0b1111_1111 => return false,
    };
    s.len() < seqlen
        && tail
            .iter()
            .all(|b| (0b1000_0000..=0b1011_1111).contains(b))
}

This works fine [0], (assuming I didn't mangle it too badly trying to simplify/clean it up for the issue), but it feels redundant given that bstr already done this when decoding (or something equivalent), and it just feels like the library can probably help more.

Anyway, I don't have strong opinions on what an API for this should look like, but I figured I'd bring it up.

One easy option, might be adding e.g. ByteSlice::is_char_boundary(&self, index: usize) -> bool and ByteSlice::utf8_sequence_len(&self, starting_at: usize) -> Option<usize> might be enough. Maybe also a prev_char_boundary? Had these existed, I probably wouldn't be filing this issue, but implementing these for not-necessarily-valid utf8 might end up hard or producing confusing results, and they might not be obvious functions to look for if you aren't familiar with how utf8 is encoded. I'm open to thoughts though.

Also, apologies in advance if something for this exists and I'm just missing it...

[0]: Actually, I suspect the last part (where we check that tail is valid) has a bug in that it can cause us to take a sequence which is both invalid and incomplete, and cause it to turn into two replacement characters, whereas if we had processed the whole buffer in one go we would only have one. That said, I'm not worried about it really, since the data was already invalid at that position. That said, it might be an indication that "distinguish between invalid and incomplete" isn't quite right.

More compatible escapes for control characters

Current Debug implementation yields something like this:

"\u{0}\u{0}\u{0} ftypisom\u{0}\u{0}\u{2}\u{0}isomiso2avc1mp"

And I'd like it to be like this:

"\0\0\0 ftypisom\0\0\x02\0isomiso2avc1mp"

Edit: these are first bytes of mp4 file, so quite real-world example
This is not only compatible with Rust (even for str constants), but also with C, Python, Bash, and probably more languages. And also a bit shorter.

What do you think of changing the representation?
If changing is okay, what to do with non-printable chars > 0xff? There is two choices: keep them as current \u{..} or always escape with \x. I would prefer latter, because when working with binary data looking at individual bytes makes more sense, and when working with textual data, these character codes are rarely informing enough.

Reverse word-based iterators?

First, thanks for creating bstr, it's great!

I'm working with a data with long lines that contain relevant data at the end. It would be useful to be able to iterate over words in reverse, e.g. a WordsReverse or FieldsReverse. Currently I handle this using rsplit on ' ', but this doesn't cope with multiple spaces and feels brittle.

Are there any technical reasons these structs don't exist? If not, would you be open to a PR to add them?

bstr transmute

Isn't transmuting an &[u8] into an &BStr only safe if BStr is #[repr(transparent)]?

CI workflow always uses stable Rust for all build matrixes

The rust component in the build matrix is unused and instead the fixed string stable is given to the tool chain setup action.

This means the pinned build for the MSRV test builds against the latest stable.

ByteSlice should offer more functions taking u8

I'm using bstr to help parse a few formats, some of them are text (but not necessarily utf8) and some of them are binary but contain strings.

I have a few things that I wished were present:

The various split functions should have variants for passing a byte instead of just str/char. I ended up using split_str(b"\0") and split_str(b"\xff") a few times which is going to be less efficient than directly invoking memchr.
Versions of fields_with/trim_start_with/trim_end_with which pass their function the byte instead, and don't bother with UTF-8 decoding.

It seems possible that you're more interested in this being useful for probably-text case than for (e.g. the emphasis is on the str, and not the b). If that's the case, sorry for this and the next bug I'm going to file!

Add BString convenience constructors (from Vec)

It's not clear to me from #5 if the wrappers are going to continue to live on (IMO they should, the points you made in the first comment are quite good and still apply), but if BString is going to live on, it would be nice if it implemented some of the same constructors as Vec.

Particularly, a Default impl, BString::new, and BString::with_capacity all seem very simple to implement and would improve ergonomics enough to justify their existence. The first one is especially helpful, as without it you cannot derive Default on a type containing a BString.

And ofc feel free to close this if BString is slated for removal soon.

remove_char bug?

use bstr::{ByteVec};
let mut s = Vec::from("1☃☃☃");
assert_eq!(s.remove_char(3), '☃');

rustc 1.53.0 (53cb7b09b 2021-06-17) error detail:
thread 'test4' panicked at 'assertion failed: (left == right)
left: '�',
right: '☃'',

Implement FusedIterator for CharIndices?

Implementing FusedIterator means consumers can rely on CharIndices to return None forever when it is exhausted.

Is this not already the case today?

feat: to_{capitalized,titlecase}, to_{capitalized,titlecase}_into, make_ascii_{capitalized,titlecase}

I'm particularly interested in titlecase support since it is not provided by the std.

For the capitalize cases, I believe they could be implemented like:

trait ByteSlice {
    fn make_ascii_capitalized(&mut self) {
        if let Some((head, tail)) = self.as_bytes().split_first_mut() {
            head.make_ascii_uppercase();
            tail.make_ascii_lowercase();
        }
    }

    fn to_capitalized_into(&self, buf: &mut Vec<u8>) {
        // This allocation assumes that in the common case, capitalizing
        // and lowercasing `char`s do not change the length of the
        // `Vec`.
        buf.reserve(bytes.len());
        let mut bytes = self.as_bytes();
        match bstr::decode_utf8(bytes) {
            (Some(ch), size) => {
                // Converting a UTF-8 character to uppercase may yield
                // multiple codepoints.
                for ch in ch.to_uppercase() {
                    buf.push_char(ch)
                }
                bytes = &bytes[size..];
            }
            (None, size) if size == 0 => return,
            (None, size) => {
                let (substring, remainder) = bytes.split_at(size);
                buf.extend_from_slice(substring);
                bytes = remainder;
            }
        }
        while !bytes.is_empty() {
            let (ch, size) = bstr::decode_utf8(bytes);
            if let Some(ch) = ch {
                // Converting a UTF-8 character to lowercase may yield
                // multiple codepoints.
                for ch in ch.to_lowercase() {
                    buf.push_char(ch);
                }
                bytes = &bytes[size..];
            } else {
                let (substring, remainder) = bytes.split_at(size);
                buf.extend_from_slice(substring);
                bytes = remainder;
            }
        }
        self = replacement;
    }
}

The above code is MIT licensed and comes from https://github.com/artichoke/artichoke/blob/fc87277afda4ac3db59ea9c080bbb5f5170b6d10/spinoso-string/src/lib.rs#L1305-L1353.

Adding license info

Hello @BurntSushi! Would it be possible to add some more license info?

In src/unicode, it appears that you've extracted some data from unicode.
In src/utf8.rs, you mention that code is based off of https://bjoern.hoehrmann.de/utf-8/decoder/dfa/, but don't reference what license that code falls under.

Could you add these licenses to the crate, and reference this in the Cargo.toml / readme? Thanks so much!

consider using extension traits on Vec<u8>/&[u8]

Currently, in 0.1 of bstr, the primary way to use and manipulate byte strings is with the BString (owned, growable) and BStr (borrowed slice) types. However, a major alternative design to using explicit types is to define extension traits that add more methods to the Vec<u8> and &[u8] types.

Reflecting back, I don't think I quite gave the extension trait path a thorough review before going with the explicit BString/BStr types. In particular, I perceive few key advantages to using explicit types:

Having distinct types provides some "semantic" meaning that the bytes should be treated as a string rather than just an arbitrary collection of bytes.
Have a convenient Debug representation that prioritizes the "stringness" of BString/BStr over the "slice of u8" representation shown for Vec<u8>/&[u8]. For example, "abc" instead of [97, 98, 99].
As a riff on (2), there may be other traits that one wants to implement "specially" for byte strings as opposed to "slice of u8." Serde comes to mind here.

If (1) were the only benefit, I think I could be persuaded to drop that line of reasoning, although it does appeal to me aesthetically. However, in my view, (2) is a fairly significant benefit, and it's one of the most important ergonomic improvements that I look forward to whenever I bring bstr in to one of my crates. Otherwise, I fairly consistently define my own helper functions for printing out byte strings when I don't have bstr, and it's honestly a pain. Especially when Vec<u8>/&[u8] are part of some other type.

With that said, in the course of actually using bstr in crates, I've come to the belief that using extension traits would make using string oriented APIs much more seamless and more ergonomic overall, with the notable exception of the aforementioned problems with the debug representation. In particular, using BString/BStr often requires annoying conversion routines between Vec<u8>/&[u8]. e.g., Most of the raw I/O APIs in std want a &[u8], so you wind up needing to write my_byte_string.as_bytes() quite a bit, which is annoying.

Moreover, using BString/BStr really motivates one to use them in as many places as possible, because of aforementioned annoying conversions. But this isn't always desirable, because you might want to expose APIs in terms of &[u8] for various reasons, including, but not limited to, not adding a public dependency on bstr. If we were using extension traits instead, then you could just import the traits and start using the new APIs immediately.

One possible alternative to this would to implement Deref and DerefMut for BString/BStr, which would eliminate the various as_bytes() calls, but you still need to explicitly construct byte strings. Moreover, this kind of feels like an abuse of Deref.

Another benefit of extension traits is that the API surface area of bstr could be quite a bit smaller, since many of the methods on BString/BStr are just re-exports of methods by the same name on Vec<u8>/&[u8].

Overall, my sense is that this crate would be better if it used extension traits. To mitigate (but not completely solve) my Debug problem, we could keep the BString/BStr types, but remove all of their inherent methods, make them implement Deref and add an appropriate Debug impl. You still have to explicit convert from Vec<u8>/&[u8], which is a little annoying, but I expect their use would be more limited and the Deref impl would make them overall more convenient to use.

Obviously, this is a fairly large breaking change to the current API, but given the only consumer (that I know of) is me, I think it's okay to do this. The library is called an experiment after all, and if we're going to make this change, then now would be the time to do it.

Pinging @joshtriplett and @eddyb, who I believe might have thoughts on this. (Please ping others that you think might have an opinion on this.)

Should CLASSES and STATES_FORWARD be const?

The decoding process seam to relies on those tow array. It don't seam like they get mutated anywhere, so in that context should they be const instead of static? I have tested and it do not appear like this modification break anything.

Also it seam that decode unsafe block rely on the decode_step logic to be correct. In this case should a comment be add a the top of decode_step to warn that faulty logic in that function would result in UB?

And again, bean pretty new to programming and to rust I may not know what are the best practice for those tow case and I may have miss something.

Implement conversion of OsStr/OsString to BStr/BString when Unix

Since on Unix systems OsString/OsStr are made of u8, it would be convenient to be able to convert these types to BString and BStr on Unix system.

I'm willing to implement that if considered a good addition to the crate.

Lifetime constraints on e.g. `ByteSlice::splitn_str` are too restrictive

They prevent things like:

use bstr::ByteSlice;

pub trait SliceExt<C> {
    fn split2(&self, c: C) -> Option<(&Self, &Self)>;
}

impl SliceExt<&[u8]> for [u8] {
    fn split2(&self, b: &[u8]) -> Option<(&[u8], &[u8])> {
        let mut iter = self.splitn_str(2, b);
        match (iter.next(), iter.next()) {
            (Some(a), Some(b)) => Some((a, b)),
            _ => None,
        }
    }
}

impl SliceExt<String> for [u8] {
    fn split2(&self, b: String) -> Option<(&[u8], &[u8])> {
        let mut iter = self.splitn_str(2, &b);
        match (iter.next(), iter.next()) {
            (Some(a), Some(b)) => Some((a, b)),
            _ => None,
        }
    }
}

The splitter doesn't need to have the same lifetime as the the slice that is being split.

Small documentation issues.

I noticed some small documentation problems:

The comment on fields_with is truncated:

bstr/src/ext_slice.rs

Line 1076 in 1d7dc1f

/// If this byte

The comment on Split refers to an F which does not exist.

bstr/src/ext_slice.rs

Lines 3283 to 3284 in 1d7dc1f

    
           /// `'a` is the lifetime of the byte string being split, while `F` is the type 
        
           /// of the predicate, i.e., `FnMut(char) -> bool`.

Question: For what functionality does `bstr` need `regex-automata` and `lazy-static`?

bstr/Cargo.toml

Line 25 in 91edb3f

unicode = ["lazy_static", "regex-automata"]

I'm considering using this crate, but it seems to have quite some heavy dependencies by default (I'm aware that I can turn it off). What are these dependencies used for?

A from_str(_radix) analogue

Working with mostly-UTF-8 byte strings, the two biggest things my current project is missing are substring matching (which this crate provides) and integer parsing (which std provides only for str).

This is easily worked around by calling str::from_utf8 first, because if the input is not valid UTF-8 then it would never make it through a from_str/parse or from_str_radix anyway.

But it would be convenient to have these functions in bstr as well, to avoid the overhead from that redundant from_utf8 call.

Using `ByteVec` consuming trait methods on BString fails with "cannot move out of dereference of BString"

My actual goal is to convert a BString into a String, in a way that errors if it's not UTF-8. Basically String::from_utf8(my_bstring.into_inner()).

However, when trying it, I see 'cannot move out of dereference of BString', which makes sense as all functionality is implemented via Deref.

The problem is that my API uses BStr and sometimes BString so it would be desirable to be able dissolve the BString once again.

Here is the playground link.

Thanks for your advice.

BStr Display implementation doesn't consider width and fill/align

I tried to use the width, fill/align to print BStr and I was surprised when it didn't work.

Was this intended?

add IntoIterator impl for BString

What is the intended Debug escape behavior for the BStrs that contain the Unicode replacement character?

I've based some ident parsing code on the fmt::Debug impl for &BStr, which currently looks like this:

bstr/src/impls.rs

Lines 331 to 339 in f56685b

    
           for (s, e, ch) in self.char_indices() { 
        
               if ch == '\u{FFFD}' { 
        
                   for &b in self[s..e].as_bytes() { 
        
                       write!(f, r"\x{:X}", b)?; 
        
                   } 
        
               } else { 
        
                   write!(f, "{}", ch.escape_debug())?; 
        
               } 
        
           }

For my usecase, this wasn't quite right because all Unicode characters outside of the ASCII range are valid ident characters. This means the replacement character itself is a valid ident char if it appears in the source byteslice. I ended up with something like this:

fn is_ident_char(ch: char) -> bool {
    ch.is_alphanumeric() || ch == '_' || !ch.is_ascii()
}

fn is_ident_until(name: &[u8]) -> Option<usize> {
    // Empty strings are not idents.
    if name.is_empty() {
        return Some(0);
    }
    for (start, end, ch) in name.char_indices() {
        match ch {
            // `char_indices` uses the Unicode replacement character to indicate
            // the current char is invalid UTF-8. However, the replacement
            // character itself _is_ valid UTF-8 and a valid Ruby identifier.
            //
            // If `char_indices` yields a replacement char and the byte span
            // matches the UTF-8 encoding of the replacement char, continue.
            REPLACEMENT_CHARACTER if name[start..end] == REPLACEMENT_CHARACTER_BYTES[..] => {}
            // Otherwise, we've gotten invalid UTF-8, which means this is not an
            // ident.
            REPLACEMENT_CHARACTER => return Some(start),
            ch if !is_ident_char(ch) => return Some(start),
            _ => {}
        }
    }
    None
}

The current implementation of Debug will always output the replacement character as three byte escapes. Is this intended?

Rename `BString::pop` to `pop_char`?

I think both "pops char" and "pops u8" are plausible semantics for pop, so perhaps it makes sense to disambiguate the name? We can also make this symmetric and have pop_byte, push_byte, pop_char, push_char.

Add parsing related functions and traits.

In the str type, we have:

pub fn parse<F>(&self) -> Result<F, <F as FromStr>::Err>
where
    F: FromStr,

Sometimes I just want to parse a u32 from a byte slice, but currently when working with std lib, I can only first perform checked convert from & [u8] it to & str, and then call str::parse. It would be a performance loss, because I don't really care whether this byte slice is a valid UTF-8 string.

Is it possible to add similiar functions and traits in bstr?

Add more From impls for constructing Cow

I'm constructing a set of error messages that use Cow<'static, [u8]> because the vast majority of messages are static byte strings, but some require dynamic allocation (for example by deferring to the Display impl of a wrapped error). The actual message field is a Cow<'static, BStr> to make use of BStr's debug implementation.

The constructors for the error structs are all From impls on owned and borrowed String and Vec<u8>.

Trying to construct a Cow<'_, Bstr> from a Vec<u8> or &'a [u8] is a bit verbose because the compiler gets confused when you buf.into().into(). I end up having to write:

impl From<Vec<u8>> for Message {
    fn from(message: Vec<u8>) -> Self {
        Self(Cow::Owned(message.into()))
    }
}

impl From<&'static [u8]> for Message {
    fn from(message: &'static [u8]) -> Self {
        Self(Cow::Borrowed(message.into()))
    }
}

Can we add a direct set of impls for constructing a Cow?

impl<'a> From<Vec<u8>> for Cow<'a, BStr> {
    fn from(bytes: Vec<u8>) -> Self {
        Cow::Owned(bytes.into())
    }
}
impl<'a> From<&'a [u8]> for Cow<'a, BStr> {
    fn from(bytes: &'a [u8]) -> Self {
        Cow::Borrowed(bytes.into())
    }
}

And maybe impl<'a> From<Cow<'a, [u8]>> for Cow<'a, BStr>.

These impls would also make it easier to turn a String/&str Cow into a BStr Cow.

the `Bytes` iterator should have an `as_slice` method

The underlying iterator std::slice::Iter has an as_slice method and it would be very convenient if that method would be re-exported in the impl Bytes, as this would allow an easy conversion between &[u8] and Bytes, which would make writing parsers easier, which often need to parse a few bytes, and then return the unparsed remaining byte slice (example use case (currently without usage of the Bytes iterator)).

`splitn_str()` returns an additional empty item if at most 2 substrings are requested but only 1 is found

Example

        let meh = b"ab";
        let meh = meh.splitn_str(2, b":").collect::<Vec<_>>();
        assert_eq!(meh, &[b"ab"]);

This however fails and instead returns [b"ab", b""]. I would expect it to only return one item.

Interestingly when changing it from 2 to 3 (leaving everything else the same), it returns the expected &[b"ab"].

A helper macro for bytestring concatenation/formatting

Hi there,

I've just released an article about bytestrings and encoding in general here: https://www.reddit.com/r/rust/comments/gz33u6/not_everything_is_utf8/, at the end of which I mention my intention of creating a format_bytes! macro to... format bytes.

I was wondering whether you had built that functionality somewhere I could re-use, or - if not - if you think that it would be a good fit for the bstr crate baring maybe a cargo feature for the added compilation dependencies for proc macros.

(Copy-pasted from an earlier exchange to preserve its history)

Someday: implement `ByteSlice` for `[u8; N]` using const generics

In the distant future, when bstr can use const generics, the following reduces the amount of B(b"blah") you need to to literals do by a decent amount, letting you use ByteSlice methods directly on b"foo".

(Of course, you still need B for the &str literal case, e.g. B("😅") and such, but in practice that case isn't as bad, since as_bytes() also works)

impl<const N: usize> ByteSlice for [u8; N] {
    #[inline]
    fn as_bytes(&self) -> &[u8] {
        self
    }

    #[inline]
    fn as_bytes_mut(&mut self) -> &mut [u8] {
        self
    }
}

(Well, in practice you also need impl<const N: usize> Sealed for [u8; N] {}, but that's not important).

Nothing to do for now, but would be a nice thing to add when (or if, I guess) the MSRV gets to 1.51.

`impl<'a> From<&'a BStr> for &'a [u8]`

Could From<&'a BStr> for &'a [u8] being implemented?

Can’t move out of BString using `into` methods

Because ByteVec methods operate through deref on BString, it is impossible to use the ones with move semantics (into_string etc.) to move out of a BString, which is confusing.

use bstr::{BString, ByteVec};

fn main() {
    let x = BString::from("hello world");
    x.into_string_lossy(); // cannot move out of dereference of `BString`
}

Consider impl AsRef<OsStr> for BStr/BString

I'd like to be able to pass a Vec<BString> directly to Command::args, or to other methods that want a &[T] for some T: AsRef<OsStr>. The easiest way to do that would be if BString implemented AsRef<OsStr>.

I realize that doing this would mask issues with invalid UTF-16 on Windows, but from what I understand, those are incredibly uncommon to encounter and difficult to create (unlike invalid UTF-8 on UNIX, which is quite easy to end up with).

Would you consider providing such an impl?

Q: Perf difference when using bstr vs regular Strings

Hi,

Will there be any performance difference when using bstr (i.e., without the UTF8 validation) compared to regular Rust Strings?

If so, is it noticeable enough that if my program is parsing out hundreds of strings from binary data (which itself will be in gigs), I can instead opt for bstr?

Thanks in advance for your answers

Unicode width of a byte slice?

I have some probably-utf8 bytes I'd like to know the display width for when rendered with a fixed-width font.

I could use bstr to iterate over the chars, and then use the unicode-width on each char, but this is less efficient than an implementation that skips over chunks that are all printable ASCII (which is the common case).

This seems like similar rationale to why some of the other unicode things are in here, but I wanted to know if a PR implementing this would be accepted before doing anything.

There are some subtleties which might it less desirable to support this, so I'll be up front about them: the value for width is defined to differ in CJK and non-CJK contexts (See http://www.unicode.org/reports/tr11/#ED6). In particular, there are a number of "ambiguous" characters, who are considered "wide" in CJK contexts and "narrow" otherwise.

So I think the API would look something like fn unicode_width(&self, cjk: bool) -> usize. It's worth noting that the unicode-width crate exposes separate functions, but having one be the default feels a little non-ideal, and this reduces the API surface anyway. An enum would also work but the only name I can think of for the non-cjk variant is, well, NonCjk (to be clear: Latin wouldn't be accurate). So, a bool punts on having to make that decision, even though it might be considered slightly opaque.

For some background for why I want it: my actual use case here involves printing some data which has evolved from "ascii-by-convention" to "utf8-by-convention"* to in a terminal application**.

* In general, bstr has been more or less perfect for this kind of data, which is common in old unixy systems and formats. This also means I only really need the non-cjk version of this, but providing one without the other seems pretty bad to me.

** All the normal terminal caveats like term locale and escape codes and the like are already handled or not relevant, and I'm aware that "width" is ultimately dependent on many things including the font, the renderer being used, etc.

`is_leading_utf8_byte` returns wrong answer for 0xf8..=0xff (utf-8 invalid bytes are not leading bytes)

is_leading_utf8_byte seems to assume it's input byte came from a stream of utf-8 encoded text

bstr/src/utf8.rs

Lines 797 to 802 in 91edb3f

    
           fn is_leading_utf8_byte(b: u8) -> bool { 
        
               // In the ASCII case, the most significant bit is never set. The leading 
        
               // byte of a 2/3/4-byte sequence always has the top two most significant 
        
               // bigs set. 
        
               (b & 0b1100_0000) != 0b1000_0000 
        
           }

Specifically, it returns true for bytes in the range 0xf8..=0xff, which is wrong. These bytes are not leading bytes, nor are they trailing bytes. They're bytes which can only appear in sequences of invalid utf-8 -- there's no case where they appear in a stream of valid utf8-encoded text, but obviously bstr doesn't want to rely on this.

This is an internal function so it may not matter, but I have a hard time understanding much of the code in this module (I am not someone who looks at a DFA state table and sees meaning in the numbers), so I am unsure whether or not it can cause problems... It's also totally possible this is handled in some other way, I was just finally getting around to doing #44 and noticed this.

(I did try some trivial tests to try and get weird things to happen without avail, but figured it was better to report it -- when I look at those tables my eyes glaze over, but hopefully the code's author understands it!)

Adding a CharOrRaw iterator

Hi @BurntSushi.

Recently I was trying to use this library to implement a Huffman coding app on a tar file. So I want to have char went it is valid utf8 and the raw byte when it is invalid. In the doc it mentions that I could use the CharIndices iterator and manually access the invalid byte, but that sounds like a bit of boiler plating.

I have created a pull request adding such a iterator. It is important to note that I am an inexperience programmer. In fact, if it is accepted, it will be my first contribution to an open source project.

Are they fundamental reason why such an iterator is not practical and could not be included in this project.

RFC: 1.0 release?

For those coming here that don't know what bstr is: it is a string library for &[u8]. The essential difference between the strings in bstr and the &str type in std is that bstr treats &[u8] as conventionally UTF-8 instead of requiring that it be UTF-8. Its main utility is in contexts where you believe your data is UTF-8 (but it might not be completely UTF-8) and you either don't have any information about what its actual encoding is or do not want to pay for the UTF-8 validity check. A common example of this is reading data from files. The bstr documentation says a lot more.

This issue is about releasing 1.0. Since I do not currently have any plans for a 2.0, I would like to get as many eyes on this as possible. If you have any feedback with respect to API breaking changes, I would love to hear about it.

OK, so I promise that the 1.0 release is imminent. Here is an exhaustive list of planned breaking API changes, all of which are currently present on master (some brought in via #104, others via #123):

Bytes::as_slice is renamed to Bytes::as_bytes.
ByteVec::into_os_string now returns Result<OsString, FromUtf8Error> instead of Result<OsString, Vec<u8>>.
ByteVec::into_path_buf now returns Result<PathBuf, FromUtf8Error> instead of Result<PathBuf, Vec<u8>>.
Find<'a> has been changed to Find<'h, 'n>, which represents the lifetimes of both the haystack and the needle, instead of the shorter of the two.
FindReverse<'a> has been changed to FindReverse<'h, 'n>, which represents the lifetimes of both the haystack and the needle, instead of the shorter of the two.
Split<'a> has been changed to Split<h, 's>, which represents the lifetimes of both the haystack and the splitter, instead of the shorter of the two.
SplitReverse<'a> has been changed to SplitReverse<'h, 's>, which represents the lifetimes of both the haystack and the splitter, instead of the shorter of the two.
SplitN<'a> has been changed to SplitN<h, 's>, which represents the lifetimes of both the haystack and the splitter, instead of the shorter of the two.
SplitNReverse<'a> has been changed to SplitNReverse<h, 's>, which represents the lifetimes of both the haystack and the splitter, instead of the shorter of the two.
ByteSlice::fields is now gated behind the unicode feature. Previously, it was available unconditionally.
serde1 has been renamed to serde1-std, and serde1-nostd has been split into serde1-alloc and serde1-core.
BufReadExt::for_byte_line now accepts &mut self instead of self.
BufReadExt::for_byte_record now accepts &mut self instead of self.
BufReadExt::for_byte_line_with_terminator now accepts &mut self instead of self.
BufReadExt::for_byte_record_with_terminator now accepts &mut self instead of self.
The OsStr and Path conversion routines had their API docs tweaked slightly so that they could defer to a possible OsStr::as_bytes (and OsStr::from_bytes) routine in the future, if it's added. But their behavior otherwise currently remains the same.
The serde1-* features have been dropped. bstr now just has a serde feature and uses the new dep: and pkg? syntax so that it will combine as one would expect with other features.
ByteSlice::copy_within_str has been removed, since slice::copy_within has been stable since Rust 1.37. slice::copy_within does the exact same thing.

Assuming this is an exhaustive list, and given that these are all very minor changes, I'm hopefully that migration from 0.2 to 1.0 will be very easy. Hopefully requiring no changes in most cases.

My plan is to release 1.0 on ~~July 11, 2022~~ ~~July 18, 2022~~ September 6, 2022. If you have feedback to give, please do so now. :-)

Note that I've published 1.0.0-pre.3 to crates.io. Docs: https://docs.rs/bstr/1.0.0-pre.3

(Below is the message I initial wrote a couple years ago. I'm way late to the party.)

It has been almost a year since I released bstr 0.2 with the major breaking change of moving most of the routines to extension traits. It seems like this has been a success. Namely, bstr's reverse dependency list is growing. More generally, I personally like working with the new API better than the old one. While I still bemoan the loss of a distinct type and its corresponding Debug impl as the One True Byte String, I think the benefits of the extension trait API have ended up outweighing that cost.

I've been giving some thought to bstr's API and its future evolution, and nothing immediately comes to mind in terms of breaking changes. That is, most everything I can think of are API additions, or at worst, deprecations. The only breaking change I can think of is to more carefully audit which API routines are available when unicode mode is disabled. I just want to make sure I'm not boxing myself into any corners there. (e.g., Some extant implementations might currently rely on std for its Unicode support, but it may wind up being the case that we want to re-implement some of those, which will require bringing in our own Unicode data. If that occurs, then those APIs should be gated behind the unicode feature.)

Otherwise, my feeling is that, unless I hear otherwise, I will make a 1.0 release in a few months. June 2020 is the 1 year anniversay of the 0.2 release, so that sounds like a good a time as any.

Thoughts?

cc @thomcc I know you've done some work on bstr and are actually using it, so would definitely appreciate if you have any thoughts here! Mostly what I'm looking for are things that we might want to do that will break the current API. While 1.0 doesn't necessarily mean "breaking changes must stop," I generally try to commit to a long termish period of stability for each major version in core libraries.

api: const fn BString::new

I'm using BString to implement an output capture struct and leverage the BString debug impl.

I'm currently adding const to as many zero argument new constructors in my code as I can. Since BString is a thin wrapper around Vec, I expected to have an analogous pub const fn new() method available.

There is no workaround for this since Into::into is not const.

Searching functions for byte ranges?

I don't have a concrete use for this, but when writing the byteset code it occurred to me that people might want to use it for e.g. searching for the next ASCII number, or the next lowercase ASCII character (ISTM like this is more generally useful for ranges of u8 rather than char, but I could be wrong).

These can be accomplished with the byteset functions, but less efficiently than dedicated functions for finding bytes in a range.

One of the flags to PCMPESTRI does allow for range checks, and defining such a thing in earlier SSEs is easier for ranges than for arbitrary byte sets.

The byteset functions could also possibly autodetect this, in many cases it would be cheap (e.g. b"0123456789"), but for some it would take an extra pass over the byte table afterwards to look for the consecutive runs (e.g. b"0918273645", which seems unfortunate.

Additionally, having to type out the members of the set is less syntactically convenient than b'0'..=b'9' (or whatever).

Not sure if these are worth adding. Again, I don't really have a use, so maybe it's worth waiting for someone who does. And it's unclear how extensive you'd like the searching capabilities to be on ByteSlice anyway. Thoughts?

Add function for counting number of times a given byte appears in a byte_string

I wanted this to count lines, and it seems like a good fit for this library that AFAICT isn't already offered.

I initially was going to do this as just memchr_iter(...).count() and submit a PR, with the hope that in practice it would be enough faster than the haystack.iter().filter(|&&t| t == needle).count() for it to be fine on its own. That said, when I added the benchmarks for that, it's turns out not to be any better than the naive in a lot of cases (which makes sense, you don't want to pay for the overhead of entering the loop so often).

Anyway https://github.com/thomcc/bstr/tree/count_byte has benchmarks set up and such if that helps, I can also submit that as a PR if you'd rather take it's implementation and optimize later.

The results of the benchmark on my machine are

bstr/count_byte/missing time:   [10.839 ns 10.944 ns 11.081 ns]
                        thrpt:  [84.131 GiB/s 85.185 GiB/s 86.011 GiB/s]
                 change:
                        time:   [-0.6860% +2.3142% +5.4224%] (p = 0.13 > 0.05)
                        thrpt:  [-5.1435% -2.2618% +0.6907%]
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

bstr/count_byte/normal  time:   [148.13 us 149.44 us 151.18 us]
                        thrpt:  [3.1541 GiB/s 3.1909 GiB/s 3.2190 GiB/s]
                 change:
                        time:   [-0.4056% +1.6184% +4.1171%] (p = 0.14 > 0.05)
                        thrpt:  [-3.9543% -1.5926% +0.4072%]
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

bstr/count_byte/frequent
                        time:   [4.4084 us 4.4287 us 4.4499 us]
                        thrpt:  [214.53 MiB/s 215.56 MiB/s 216.55 MiB/s]
                 change:
                        time:   [-2.4995% +0.8563% +4.8377%] (p = 0.68 > 0.05)
                        thrpt:  [-4.6144% -0.8490% +2.5635%]
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

std/count_byte/missing  time:   [276.77 ns 279.35 ns 282.68 ns]
                        thrpt:  [3.2979 GiB/s 3.3372 GiB/s 3.3683 GiB/s]
                 change:
                        time:   [+0.2130% +3.5072% +6.5533%] (p = 0.04 < 0.05)
                        thrpt:  [-6.1503% -3.3884% -0.2125%]
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) high mild
  8 (8.00%) high severe

std/count_byte/normal   time:   [137.53 us 138.13 us 138.77 us]
                        thrpt:  [3.4362 GiB/s 3.4520 GiB/s 3.4670 GiB/s]
                 change:
                        time:   [-0.9794% +1.2840% +3.7289%] (p = 0.29 > 0.05)
                        thrpt:  [-3.5948% -1.2677% +0.9891%]
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

std/count_byte/frequent time:   [284.68 ns 286.55 ns 288.44 ns]
                        thrpt:  [3.2321 GiB/s 3.2533 GiB/s 3.2747 GiB/s]
                 change:
                        time:   [-1.0976% +2.6418% +5.8560%] (p = 0.15 > 0.05)
                        thrpt:  [-5.5320% -2.5738% +1.1098%]
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

which are somewhat damning. In particular, it only being faster in the case where the item being searched for doesn't seem that useful.

(I also started poking at an optimized usize-bytes-at-a-time implementation, but couldn't work out the bugs in it, and at this point have spent too much time on this -- that said, if looking at someones half-working SWAR code sounds useful somehow, then https://gist.github.com/thomcc/ab4c9d4509e58912feaf0c538a071b81 is available).

Support UTF8 sequence length

I think it would be good to expose a free function from bstr that exposes some decoding-specific information about what a given byte means in the context of utf8.

Accessing this info is low level, but has various use cases -- some examples include finding a place to start parsing from given an index, finding a legal cutoff position if you need to truncate a buffer... Etc. (Let me know if you want more cases, I feel like I run into it a fair bit when working with partially invalid utf8).

Specifically, something like this:

// If `b` indicates the start of a utf8 sequnence boundary,
// returns `Some(sequence_len)`. Returns `None` for all other cases.
pub fn utf8_sequence_len(b: u8) -> Option<usize>;

Or... Maybe. I'd kinda like to distinguish between valid-but-not-leading and always-invalid bytes. Returning an enum maybe? Thoughts and bikeshedding welcome, I think in practice this would be useful, but also wanted to keep the things small and simple.

That said, I do feel strongly that this should not be methods on byteslice like ByteSlice::is_char_boundary(&self, index: usize) -> bool and ByteSlice::utf8_sequence_len(&self, index: usize) -> Option<usize> (mentioning mostly because I suggested these in #42) -- I think those two would be very confusing in practice:

ByteSlice::is_char_boundary would have to return different results from str::is_char_boundary even for a fully utf8 byte slice (example: index == len). Having the caller get the byte in question avoids this issue. (Renaming it doesn't even really solve this problem -- still seems like it could cause confusion if 0/len are not conidered boundaries).
ByteSlice::utf8_sequence_len(&self, idx) could behave too many ways -- specifically IDK if it only reads self[idx] or if it considers other bytes nearby (e.g. if it's not a leading byte). Making it a top level function only taking a u8 removes this ambiguity -- reasonably only one thing it could do

Can't call ByteVec functions if ByteSlice is in scope

I'm sure that I'm just doing something wrong but I can't seem to figure this out; sorry if it's just my limited understanding of rust.

Basically I want to do some simple searches against an OsStr and thought I would use this library since it handles this nicely. This is the most distilled version I could come up with that highlights the issue ('m doing a bit more in my actual code but not too much):

use bstr::{ByteSlice, ByteVec};
use std::ffi::OsStr;

fn main() {
    let ostr = OsStr::new("test");
    let fname = ByteVec::from_os_str_lossy(ostr);

    if fname.ends_with_str(b"ext") && fname.contains_str(b"words") {
        println!("found")
    }
}

Trying to compile that returns this:

error[E0283]: type annotations required: cannot resolve `_: bstr::ext_vec::ByteVec`
 --> src/main.rs:6:17
  |
6 |     let fname = ByteVec::from_os_str_lossy(ostr);
  |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  |
  = note: required by `bstr::ext_vec::ByteVec::from_os_str_lossy`

error: aborting due to previous error

For more information about this error, try `rustc --explain E0283`.
error: Could not compile `autorun_stats`.

To learn more, run the command again with --verbose.
[Finished running. Exit status: 101]

If I remove the ByteSlice import than it errors that I can't call contains_str as ByteSlice needs to be in scope; if I remove that call then everything is okay.

I tried a variety of different annotations in the assignment to fname but nothing I add seems to work.

I'm using rust rustc 1.36.0 and this is a 2018 project if that matters.

ByteSlice::trim on ASCII whitespace is substantially slower than core::str::trim

ByteSlice::trim (and related) are not competitive with libstd's in the case that the whitespace is ASCII.

The difference is as much as 50%, and is something I noticed when moving some code to use bstr, as a dip in that code's benchmarks.

I'm not an expert, but my understanding is that ASCII whitespace is much more common than non-ASCII whitespace in pretty much all scripts, so it's probably a good idea to optimize for.

Here are two benchmarks that demonstrate the issue: https://gist.github.com/thomcc/d017dec2bf7fbfd017e4f34cfd4db6f8 — it's a gist as it's a bit too long to really be great as a code block. It also contains a diff you can apply to insert them directly into bstrs existing benchmark code (2nd file in the gist).

The first (trim/source-lines) measures the time to trim a bunch of lines of source code (specifically, every line in ext_slice.rs — chosen arbitrarily), and is close to my real use case where, I saw an issues using bstr.

The second (trim/large-ascii-padded) is completely artificial, and just trims a huge string starting and ending with tons of ascii whitespace (with only a single non-whitespace character between it all to ensure both trim_start and trim_end are measured). It's focused on the specific issue, so probably better as a benchmark, but it doesn't reflect a real use case.

The results here show that for the current benchmark (trim/tiny), std and bstr are roughly the same performance, but that std is substantially faster on the the new benchmarks

bstr/trim/tiny          time:   [50.634 ns 50.925 ns 51.261 ns]
                        thrpt:  [502.31 MiB/s 505.63 MiB/s 508.54 MiB/s]
std/trim/tiny           time:   [50.592 ns 50.743 ns 50.917 ns]
                        thrpt:  [505.71 MiB/s 507.45 MiB/s 508.96 MiB/s]

bstr/trim/source-lines  time:   [90.672 us 90.931 us 91.222 us]
                        thrpt:  [1.1964 GiB/s 1.2003 GiB/s 1.2037 GiB/s]
std/trim/source-lines   time:   [55.251 us 55.669 us 56.236 us]
                        thrpt:  [1.9408 GiB/s 1.9605 GiB/s 1.9754 GiB/s]

bstr/trim/large-ascii-padded
                        time:   [9.4068 us 9.4174 us 9.4304 us]
                        thrpt:  [414.32 MiB/s 414.89 MiB/s 415.36 MiB/s]
std/trim/large-ascii-padded
                        time:   [4.1390 us 4.1472 us 4.1559 us]
                        thrpt:  [940.15 MiB/s 942.12 MiB/s 943.99 MiB/s]

ByteSlice functions to skip past or until a set of bytes.

It's common to want to skip past/until a specific set of bytes. C++'s std::string::find_first_of/find_first_not_of are an example.

The current API has trim_start_with and trim_end_with which can replaces some uses of this, but require unicode (#12) and often be much slower than an implementation that leverages memchr when the set of bytes is small.

I had some helpers for this, and ended up not being disappointed that I couldn't really replace them (I could replace some uses of them with trim_start_with, but possibly slower, due to both extra UTF-8 decoding, as well as not being able to use memchr in the until case (and both could be accelerated in the same manner, of course).

My implementations are here (skip_until/skip_while) https://gist.github.com/thomcc/a39c9bf5c7c50b0db1e5f1d4f92429a7 in case that's interesting or it's unclear what I mean.

Additionally, while I wouldn't have used them, presumably versions starting from the right, and versions along the lines of fields would be helpful. (Actually, had a fields version of this existed, it would have replaced some of my uses of these functions, probably)

That said, this is getting to be a lot of functions 🙁 -- obviously this could be done with some pattern-esque API, you mentioned not being interested in a design like that, though (which I completely respect, and think makes the docs much clearer). I don't think this case is that niche, but it might be too niche given how many functions it would have.

Anyway, I don't think this is that niche, but I'd understand a desire not to increase the number of functions too far.

find_byteset (and related) should return matches for an empty set

This code

use bstr::{B, ByteSlice};

fn main() {
    let haystack = B("abcxyz");
    println!("{:?}", haystack.find_byteset(""));
}

prints None, but it should print Some(0). Also, it seems like the empty set is part of the empty haystack, so this

use bstr::{B, ByteSlice};

fn main() {
    let haystack = B("");
    println!("{:?}", haystack.find_byteset(""));
}

should also print Some(0) I think?

This behavior is consistent with substring searches for an empty needle.

Goals of this crate

(Continuation from https://twitter.com/burntsushi5/status/1182677304478175232)

I was wondering what the (implementation) goals of this crate are. I noticed this crate implements pretty much all UTF-8 algorithms itself, even those that core already exposes, such as from_utf8. And even quite a few of the other algorithms are still implemented in core, although they are not exposed (as stable) (e.g. Utf8Lossy).

So a few questions:

Why are things like from_ut8 re-implemented (as the validate function)? Is this implementation better, and if so, is the goal to merge this into core at some point? Or are the goals of core::str::from_utf8 and validate in this crate different in some way (e.g. optimizing for something different)?
Should we try (through an RFC) to get some of the str_internals of core exposed, because apparently they are useful outside of core, such as next_code_point? If they ever do get stabilized, should the bstr crate use that one instead of keeping its own implementation?
Should there be some feature that can be enabled/disabled to turn this crate into a more minimal 'just small wrappers around core' type of crate? Right now, I would avoid using bstr in embedded projects, because—even when only using BStrs Display or something—it adds tables/functions that are more or less duplicates of things that are often also pulled in though core, which can be quite a waste of space (which is scarce on embedded platforms).

The reason I'm asking these questions is that I want to contribute a bit to this crate to make it more useful for a few applications I'm using it for, but I'm not sure what direction this crate should grow in.

In any case, thanks for making this crate. It's very nice! ^^

Write escaped string into a buffer

Hi @BurntSushi,

I'm using bstr for turning a Vec<u8>-like structure into debug strings and error messages. Specifically, I'm working on a Ruby implementation. In Ruby String is a Vec<u8> with a default UTF-8 encoding with no guarantees that the bytes are actually valid UTF-8.

bstr is the means by which I interpret these byte vectors as UTF-8 the best I can.

The fmt::Debug implementation on &BStr is very close to what I'd like, but I cannot use it because it wraps the escaped string in quotes. I need control of the output since these strings are being but into error messages.

I've put together this function for writing the escaped representation to an arbitrary fmt::Write (cribbing heavily form the fmt::Debug impl on &BStr).

pub fn escape_unicode<T>(mut f: T, string: &[u8]) -> Result<(), WriteError>
where
    T: fmt::Write,
{
    let buf = bstr::B(string);
    for (start, end, ch) in buf.char_indices() {
        if ch == '\u{FFFD}' {
            for byte in buf[start..end].as_bytes() {
                write!(f, r"\x{:X}", byte)?;
            }
        } else {
            write!(f, "{}", ch.escape_debug())?;
        }
    }
    Ok(())
}

Here's an example usage:

let mut message = String::from("undefined group name reference: \"");
string::escape_unicode(&mut message, name)?;
message.push('"');
Err(Exception::from(IndexError::new(interp, message)))

I'm trying to generate a message like this:

$ ruby -e 'm = /(.)/.match("a"); m["abc-\xFF"]'
Traceback (most recent call last):
	1: from -e:1:in `<main>'
-e:1:in `[]': undefined group name reference: "abc-\xFF" (IndexError)

Is this patch something you would consider upstreaming?

	/// `'a` is the lifetime of the byte string being split, while `F` is the type
	/// of the predicate, i.e., `FnMut(char) -> bool`.

	for (s, e, ch) in self.char_indices() {
	if ch == '\u{FFFD}' {
	for &b in self[s..e].as_bytes() {
	write!(f, r"\x{:X}", b)?;
	}
	} else {
	write!(f, "{}", ch.escape_debug())?;
	}
	}

	fn is_leading_utf8_byte(b: u8) -> bool {
	// In the ASCII case, the most significant bit is never set. The leading
	// byte of a 2/3/4-byte sequence always has the top two most significant
	// bigs set.
	(b & 0b1100_0000) != 0b1000_0000
	}