burntsushi / encoding_rs_io Goto Github PK

22.0 4.0 6.0 52 KB

Streaming I/O adapters for the encoding_rs crate.

License: Other

Shell 0.18% Rust 99.82%

encoding_rs_io's Introduction

encoding_rs_io

This crate provides streaming adapters for the encoding_rs crate. Adapters implement the standard library I/O traits and provide streaming transcoding support.

Documentation

https://docs.rs/encoding_rs_io

Usage

Add this to your Cargo.toml:

[dependencies]
encoding_rs_io = "0.1"

and this to your crate root:

extern crate encoding_rs_io;

Example

This example shows how to create a decoder that transcodes UTF-16LE (the source, indicated by a BOM) to UTF-8 (the destination).

extern crate encoding_rs;
extern crate encoding_rs_io;

use std::error::Error;
use std::io::Read;

use encoding_rs_io::DecodeReaderBytes;

fn main() {
    example().unwrap();
}

fn example() -> Result<(), Box<Error>> {
    let source_data = &b"\xFF\xFEf\x00o\x00o\x00b\x00a\x00r\x00"[..];
    // N.B. `source_data` can be any arbitrary io::Read implementation.
    let mut decoder = DecodeReaderBytes::new(source_data);

    let mut dest = String::new();
    // decoder implements the io::Read trait, so it can easily be plugged
    // into any consumer expecting an arbitrary reader.
    decoder.read_to_string(&mut dest)?;
    assert_eq!(dest, "foobar");
    Ok(())
}

Future work

Currently, this crate only provides a way to get possibly valid UTF-8 from some source encoding. There are other transformations that may be useful that we could include in this crate. Namely:

An encoder that accepts an arbitrary std::io::Write implementation and takes valid UTF-8 and transcodes it to a selected destination encoding. This encoder would implement std::fmt::Write.
A decoder that accepts an arbitrary std::fmt::Write implementation and takes arbitrary bytes and transcodes them from a selected source encoding to valid UTF-8. This decoder would implement std::io::Write.
An encoder that accepts an arbitrary UnicodeRead implementation and takes valid UTF-8 and transcodes it to a selected destination encoding. This encoder would implement std::io::Read.
A decoder that accepts an arbitrary std::io::Read implementation and takes arbitrary bytes and transcodes them from a selected source encoding to valid UTF-8. This decoder would implement the UnicodeRead trait.

Where UnicodeRead is a hypothetical trait that does not yet exist. Its definition might look something like this:

trait UnicodeRead {
    fn read(&mut self, buf: &mut str) -> Result<usize>;
}

Interestingly, of the above transformations, none of them correspond to DecodeReaderBytes. Namely, DecodeReaderBytes most closely corresponds to the last option, but instead of guaranteeing valid UTF-8 by implementing a trait like UnicodeRead, it instead implements std::io::Read, which pushes UTF-8 handling on to the caller. However, it turns out that this particular use case is important for operations like search, which can often be written in a way that don't assume UTF-8 validity but still benefit from it.

It's not clear which of the above transformations is actually useful, but all of them could theoretically exist. There is more discussion on this topic here (and in particular, the above formulation was taken almost verbatim from Simon Sapin's comments): hsivonen/encoding_rs#8

It is also perhaps worth stating that this crate very much intends on remaining coupled to encoding_rs, which helps restrict the scope, but may be too biased toward Web oriented encoding to solve grander encoding challenges. As such, it may very well be that this crate is actually a stepping stone to something with a larger scope. But first, we must learn.

License

This project is licensed under either of

Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

encoding_rs_io's People

Contributors

Stargazers

Watchers

Forkers

ignatenkobrain segfault87 dralley atouchet boudra artem-tim

encoding_rs_io's Issues

[RFE] Asynchronous IO support

Would you be open to a feature that supported Tokio's AsyncRead trait for the underlying IO?

add support for Windows-1252 fallback

We want to expose the ability to "use Windows-1252, but if there's a BOM, honor the BOM." See @hsivonen's comment here: hsivonen/encoding_rs#8 (comment)

I'm not sure if this is something where we can add a new option to the builder for, or if this requires deeper changes in the decoder.

[RFE] User-provided encoding detection function

As an example, the XML specification recommends a special encoding detection scheme in cases where the BOM doesn't exist: https://www.w3.org/TR/xml11/#sec-guessing

In the event that a single-byte, ascii-compatible encoding is being used, you're supposed to inspect the XML declaration to determine which specific encoding to use.

My initial thoughts about implementing this are: the code could just read a full buffer of data (instead of using BomPeeker) and pass a reference to this buffer directly to the encoding detection functions (Encoding::for_bom(&[u8]) and / or a user-provided one), adjusting self.pos if necessary to ignore the BOM.

That would simplify the code at the same time, with the caveat that the user must ensure that the buffer is sufficient for the detection schemes (e.g., minimum 3 bytes for BOM detection), but that feels like a reasonable restriction?

It could look something like this:

pub fn xml_detect_encoding(bytes: &[u8]) -> Option<&'static Encoding> {
    match bytes {
        _ if bytes.starts_with(&[0x00, b'<', 0x00, b'?']) => Some((UTF_16BE, 0)), // Some BE encoding, for example, UTF-16 or ISO-10646-UCS-2
        _ if bytes.starts_with(&[b'<', 0x00, b'?', 0x00]) => Some((UTF_16LE, 0)), // Some LE encoding, for example, UTF-16 or ISO-10646-UCS-2
        _ if bytes.starts_with(&[b'<', b'?', b'x', b'm']) => { // Some ASCII compatible
             unimplemented!(r#"parse from the XML 'encoding' tag e.g. <?xml version="1.0" encoding="UTF-8" standalone="no" ?> "#);
       }
        _ => None,
    }
}

let f = File::open("inputdata.xml")?;
let mut rdr = DecodeReaderBytes::new(f);
rdr.detect_encoding_with(xml_detect_encoding)?;
assert_eq!(rdr.encoding(), UTF_16LE);

It could be a DecodeReaderBytesBuilder option instead of an explicit function call, but it feels like a slightly different category from the others.

DecodeReaderBytes should implement read_to_end

Currently rg with --multiline when operating on many files can be 50x slower without --mmap compared to --multiline --mmap.

More than 99% of CPU time is spent in ReadBuf::initialize_unfilled, which is called from default_read_buf, called from default_read_to_end, called from read_to_end here:

        if self.config.heap_limit.is_none() {
            let mut buf = self.multi_line_buffer.borrow_mut();
            buf.clear();
            let cap =
                file.metadata().map(|m| m.len() as usize + 1).unwrap_or(0);
            buf.reserve(cap);
            read_from.read_to_end(&mut *buf).map_err(S::Error::error_io)?;
            return Ok(());
        }

https://github.com/BurntSushi/ripgrep/blob/master/crates/searcher/src/searcher/mod.rs#L911-L919

If buf grows large, then the initialize_unfilled function will clear the entire capacity of the vector for every file, irrespective of the file's size, which in my case resulted in 300 GB of memory transfers for only 3 GB of data.

If DecodeReaderBytes implemented the read_to_end function, then it would be able to avoid initializing the entire buffer, only writing to the part of it that actually needs to be written.

(Alternatively, ripgrep could be changed to not call read_to_end, or to not reuse a single Vec for every file.)

Gate TinyTranscoder behind a feature flag

A provided buffer that's smaller than 7 bytes is probably an extremely rare case. It would be nice if I could skip paying for the larger decoder struct and the branches in that case and just panic (or maybe statically assert?) if I do something silly.

implement Read::read_to_string on top of Decoder::decode_to_string

This is a possibly somewhat common method, and the decoder in this crate could implement it more efficiently by avoiding UTF-8 re-validation. See @hsivonen's comment here: hsivonen/encoding_rs#8 (comment)

UTF8 to Any Encoder implementation

Is there anything in progress/alternative regarding the first point of the Future work paragraph of the README ?

An encoder that accepts an arbitrary std::io::Write implementation and takes valid UTF-8 and transcodes it to a selected destination encoding. This encoder would implement std::fmt::Write.

If not, would you be open to a PR implementing such encoder ?

Always transcode Utf8

I'm using encoding_rs_io to make a stream of always valid utf8 because invalid utf8 is not handled upstream. The way the options are laid out at present it seems there's no way to force transcoding to occur if there is no BOM in the file. I think I found a way to do it by layering multiple DecodeReaderBytes over each other, but I'm unsure that it works in all cases and a little dismayed that it requires multiple layers instead of just having an option to force transcoding.

Here's the code I have today:

pub fn new_utf8_reader(data: &[u8]) -> impl Read + '_ {
	let cursor = Cursor::new(data);
	// The first layer has utf8-passthrough, and the second
	// no passthrough but an explicit encoding. This unexpected
	// chain was concocted to handle the case where the file has
	// no BOM and is encoded with something other than utf8 or
	// contains invalid utf-8 characters. Basically, this
	// forces transcoding.
	// When there is a non UFT-8 BOM the first layer will transcode to UTF-8. (so will the second, redundantly)
	// When there is no BOM or a UTF-8 BOM the second layer will transcode to UTF-8.
	let uncorrected = DecodeReaderBytesBuilder::new()
		.utf8_passthru(true)
		.build(cursor);

	DecodeReaderBytesBuilder::new()
		.encoding(Some(UTF_8))
		.strip_bom(true)
		.build(uncorrected)
}

Is this the best way to force transcoding to utf8 in the presence of unknown data (which may or may not contain a BOM, and may or may not be valid) given the API today?

I don't think I'm the only one with this problem. It took some time to figure out an answer. Would it be worth it to do one of the following...

Add this usage to the documentation as a recipe?
Introduce a new 'force transcoding' option?
Add a factory function that does this?

Implement BufRead

I might be missing something, but it seems like DecodeReaderBytes contains an internal buffer for transcoded bytes

    /// The internal buffer to store transcoded bytes before they are read by callers.
    buf: B,

as well as a position in that buffer where reads start from, as well as an end of the transcoded bytes

    /// The current position in `buf`. Subsequent reads start here.
    pos: usize,
    /// The number of transcoded bytes in `buf`. Subsequent reads end here.
    buflen: usize,

Shouldn't it be possible to trivially implement BufRead on top of that without a needing external buffer (and hence, extra copying)?