Code Monkey home page Code Monkey logo

Comments (39)

joshgoebel avatar joshgoebel commented on September 27, 2024

What should the correct behavior here be with regards to readLine then? It seems there is a conflict... if you read a single byte of a 4 byte UTF-8... what is readLine to do if you call it next? Your "position" in the buffer is now invalid for returning lines... the next 3 bytes are invalid...

from wren-cli.

joshgoebel avatar joshgoebel commented on September 27, 2024

One idea would be for readByte to have a mini-buffer of it's own... such that readByte always pulls a full UTF-8 character off the buffer and then subsequent calls returns the remaining bytes... but if sone were to alternate willy-nilly between readByte and readLine one could still lose data.

from wren-cli.

mhermier avatar mhermier commented on September 27, 2024

readLine should technically returns a buffer no matter what encoding errors happens in there. Strictly speaking UTF-8 encoding is supported but not required to be in a String. If we are in the middle of an UTF-8 char operator[_] should handle it somehow (or bomb which would be saner than what we currently have).

from wren-cli.

joshgoebel avatar joshgoebel commented on September 27, 2024

If we are in the middle of an UTF-8 char operator[_] should handle it somehow (or bomb which would be saner than what we currently have).

You mean add the support to String itself? Right now I can't even see any easy way of truncating the string and cutting off a portion of a character... it seems all the indexing stuff is based on full characters.

I could definitely see the path to making readLine throw an error if readByte was in the "middle" of a character though but then it almost feels like you need some API to check the status of the stream/buffer...

from wren-cli.

ruby0x1 avatar ruby0x1 commented on September 27, 2024

If you mean indexing on String, indices are bytes. See the description here.

from wren-cli.

mhermier avatar mhermier commented on September 27, 2024

To solve the original problem, the proper solution would be to add range support to StringByteSequence.[_] or do some proper buffering for efficiency.

from wren-cli.

mhermier avatar mhermier commented on September 27, 2024

@ruby0x1 The doc is correct, but the non documented range implementation is not byte indexed....

from wren-cli.

joshgoebel avatar joshgoebel commented on September 27, 2024

If you mean indexing on String, indices are bytes.

Yes, right, I read that... but the behavior around "half-characters" is what's causing this bug.

__buffered = __buffered[lineSeparator + 1..-1]

Trying to advance the buffer by a single index seems to eat forward to the next whole UTF-8 character. So that results in bytes being dropped. I already have a fix if you assume that one doesn't use both readByte and readLine which seems problematic.

the proper solution would be to add range support to StringByteSequence.[_]

How would that help here? It seems (I suppose I'm assuming) that readLine expects to work with UTF-8 [strings] while readByte expects to work with bytes... The problem seems to be a desire to hold a cursor at TWO different places in the buffer concurrently.

// add some UTF-8 charcters to the buffer
Stdin.readByte() 
// we're now in the "middle" of a UTF-8 character
Stdin.readLine() 

What is readLine() expected to return here exactly? I was suggesting an error but then it would seem we need to add an API to detect this or advance to the next full-character so that user land could handle this, no?

If it could return the broken string at the exact correct position that might be one answer, but I'm not sure if that's possible or what that would look like.

Is Fiber.abort what one would use here to raise an error?

from wren-cli.

ruby0x1 avatar ruby0x1 commented on September 27, 2024

@mhermier on the same page, I documented it recently, but the fact it's not indexed in bytes should be documented too.

image

from wren-cli.

mhermier avatar mhermier commented on September 27, 2024

@ruby0x1 test it with unicode characters, results should be broken, and be based on UTF-8 indexes:

from wren-cli.

mhermier avatar mhermier commented on September 27, 2024

EDIT: Previous message contained a broken test.

The following patch should be more conform to the byte indexing specification. I can make a branch if required.

commit 5baaffacb4909c83ea9302c8feed582d101e2600
Author: Michel Hermier <[email protected]>
Date:   Sun Apr 25 21:58:31 2021 +0200

    wren/core: Fix `String` range subscript so that it conforms to bytes indexing requirement.

diff --git a/src/vm/wren_value.c b/src/vm/wren_value.c
index 2700c4def..323517034 100644
--- a/src/vm/wren_value.c
+++ b/src/vm/wren_value.c
@@ -724,26 +724,12 @@ Value wrenNewStringLength(WrenVM* vm, const char* text, size_t length)
 Value wrenNewStringFromRange(WrenVM* vm, ObjString* source, int start,
                              uint32_t count, int step)
 {
+  ObjString* result = allocateString(vm, count);
   uint8_t* from = (uint8_t*)source->value;
-  int length = 0;
-  for (uint32_t i = 0; i < count; i++)
-  {
-    length += wrenUtf8DecodeNumBytes(from[start + i * step]);
-  }
-
-  ObjString* result = allocateString(vm, length);
-  result->value[length] = '\0';
-
   uint8_t* to = (uint8_t*)result->value;
   for (uint32_t i = 0; i < count; i++)
   {
-    int index = start + i * step;
-    int codePoint = wrenUtf8Decode(from + index, source->length - index);
-
-    if (codePoint != -1)
-    {
-      to += wrenUtf8Encode(codePoint, to);
-    }
+    to[i] = from[start + i *step];
   }
 
   hashString(result);
diff --git a/test/core/string/subscript_range.wren b/test/core/string/subscript_range.wren
index f4d6cd298..6c8e4ddb9 100644
--- a/test/core/string/subscript_range.wren
+++ b/test/core/string/subscript_range.wren
@@ -42,12 +42,13 @@ System.print("abc"[3..-1] == "") // expect: true
 // Bytes:           11111
 //        012345678901234
 // Chars: sø mé ஃ  thî ng
+System.print("søméஃthîng".bytes.join(", ")) // expect: 115, 195, 184, 109, 195, 169, 224, 174, 131, 116, 104, 195, 174, 110, 103
 System.print("søméஃthîng"[0..3]) // expect: søm
 System.print("søméஃthîng"[3...10]) // expect: méஃt
 
 // Only includes sequences whose first byte is in the range.
-System.print("søméஃthîng"[2..6]) // expect: méஃ
-System.print("søméஃthîng"[2...6]) // expect: mé
-System.print("søméஃthîng"[2...7]) // expect: méஃ
+System.print("søméஃthîng"[2..6].bytes.join(", ")) // expect: 184, 109, 195, 169, 224
+System.print("søméஃthîng"[2...6].bytes.join(", ")) // expect: 184, 109, 195, 169
+System.print("søméஃthîng"[2...7].bytes.join(", ")) // expect: 184, 109, 195, 169, 224
 
 // TODO: Strings including invalid UTF-8.

from wren-cli.

joshgoebel avatar joshgoebel commented on September 27, 2024

Personally I find the string tests easier to read than the byte value tests... from the tests though I can't tell that any behavior changed here? Looks like it's doing exactly what it was before?

from wren-cli.

ruby0x1 avatar ruby0x1 commented on September 27, 2024

It changed indexing with a range (substring) to use bytes instead of whole characters. A single character can be multiple bytes which was what happened before, it counts those, the change doesn't.

from wren-cli.

mhermier avatar mhermier commented on September 27, 2024

It is completely different, since some produce invalid UTF-8 strings that makes the test suite explodes.

from wren-cli.

joshgoebel avatar joshgoebel commented on September 27, 2024

Yeah I think I see now, it just was very confusing that the entire format of the tests also changed so you can't look at the tests diff at a glance to see the diff.

I assume previously that in this specific case the beginning bytes were "rounded up" and the end bytes were also to the next whole character (that's how I'm reading it).

from wren-cli.

joshgoebel avatar joshgoebel commented on September 27, 2024

So this patch would actually resolve the whole issue here I think... but PERHAPS introduce the problem where someone calling readLine would blow up because they retrieved invalid UTF-8... (I'm not sure what the behavior of indexOf is in this scenario) and that would be a problem for the user?

And can + on a String still operate if the original string is not valid UTF-8? IE, can we still append the buffer here?

from wren-cli.

mhermier avatar mhermier commented on September 27, 2024

Yes, and a little bit more than that, if there was legitimate invalid chars in the middle of the sequence they would have bin previously silently removed.

The change is breaking some existing code, since now we don't trim invalid character anymore, next valid character is no longer reached at index 1...

+ should be safe, it does some printf magic, so it does not really care about invalid bytes, since it blindly memcpy wren String content.

from wren-cli.

mhermier avatar mhermier commented on September 27, 2024

However, when previously it was possible to revert utf-8 strings using negative range steps, it does only work with ascii-7 strings now...

from wren-cli.

joshgoebel avatar joshgoebel commented on September 27, 2024

if there was legitimate invalid chars in the middle of the sequence they would have bin previously silently removed.

I don't follow this and I'm not sure what a "legitimate invalid char" is. :)

from wren-cli.

mhermier avatar mhermier commented on September 27, 2024

Legitimate like, you use an invalid utf8 char on purpose to make a split between 2 strings, in a protocol.

Well while it fix somehow, the fact that we have bytes indexing in String is a not really logical to me, it would be more logical if this was accessible through StringByteSequence

from wren-cli.

joshgoebel avatar joshgoebel commented on September 27, 2024

the fact that we have bytes indexing in String is a not really logical to me,

Agree, it's more than a tad bit confusion.

from wren-cli.

mhermier avatar mhermier commented on September 27, 2024

Rethinking at it, maybe we should provide the old behavior unchanged, and provide the new one via StringByteSequence operator [_]. Previous one was allowing to invert letters of the string, so even if indexing was odd it has some usage (even if some test where missing)

from wren-cli.

mhermier avatar mhermier commented on September 27, 2024

Argg we are doomed, StringByteSequence[_] returns a Num ... Have to sleep on this, to see if an idea pops up.

from wren-cli.

joshgoebel avatar joshgoebel commented on September 27, 2024

and provide the new one via StringByteSequence operator [_]

Doesn't that already work? The problem this issue is raising is that StringByteSequence is being used for readBytes where-as the String (with it's indexing behavior) is being used to advance the cursor.

readByte itself can be fixed by buffering each character and only advancing after all the bytes in the character are read:

  static readByte() {
    return read_ {
      if (__char==null) {
        __offset = 0
        __char = __buffered[0]
        __max = __char.bytes.count
      }
      // Peel off the next byte.
      var byte = __char.bytes[__offset]
      __offset = __offset + 1

      if (__offset == __max) {
        __char = null
        __buffered = __buffered[1..-1]
      }

      return byte
    }
  }

But then you're left with specifying what the correct behavior of readLine should be... or readCodePoint as I suggested also.

from wren-cli.

mhermier avatar mhermier commented on September 27, 2024

That does not work, StringByteSequence operator [_] Does not work with range because smart index returns the byte as Num.

Yup, this is a mess from the start, I don't know what to think. As this your code only works on valid UTF-8. For correctness and efficiency you should store __buffered.bytes and peel off bytes from it.

from wren-cli.

joshgoebel avatar joshgoebel commented on September 27, 2024

StringByteSequence operator [_] Does not work with range because smart index returns the byte as Num.

I guess I'm not sure why we need it to? It would really help if we back-up and first specified the correct/desired behavior here otherwise we're just chasing our tails without end. How are readBytes and readLines intended to interact in this edge case? My assumptions:

  • readBytes always returns the next byte from the string, encoding is irrelevant
  • readLines returns native UTF-8 strings because that's our native String storage implementation

. For correctness and efficiency you should store __buffered.bytes and peel off bytes from it.

You're talking about the case where our input is just random gibberish? That this sort of relies on the input being UTF-8 at core...? I went with that because I feel like the underlying string implementation does that anyways, but I guess that's not 100% true if + is a memcpy...

SO... lets go back and figure out the correct behavior first. I get the impression the String semantics of Wren have been designed with intention (even if they are confusing) and that the correct solution here may not be to just break String, effectively turning it into a byte sequence. :-) So:

// first input some UTF-8 characters to STDIN the buffer
// then
Stdin.readByte() 
// we're now in the "middle" of a UTF-8 character
Stdin.readLine() 

What is readLine() expected to return here exactly? Or should it error? OTTOMH I feel like it should error and someone wanting to mix UTF-8 and bytes (building a protocol layer, etc) should do so very carefully. It seems more common that one would be reading most input via bytes or lines, rather than mixing.

from wren-cli.

mhermier avatar mhermier commented on September 27, 2024

from wren-cli.

mhermier avatar mhermier commented on September 27, 2024

from wren-cli.

joshgoebel avatar joshgoebel commented on September 27, 2024

I may have to bow out of this one. I thought I could solve it by storing a list of strings/packets as they arrive, but you end up left with all sorts of edge cases... what if a line is spread across multiple packets, what if a codePoint is? You can lazy join the packets before every read, but how many packets? Worse case you might scan the WHOLE input stream looking for a line ending you never find. And compiling strings into bigger ones just makes it harder to advance the cursor since we have no great way to do that one byte at a time... (while also GCing the old data you've moved past)

The simplest thing I tried was just storing the stream as a List of bytes... but this is slow because of all the conversions and also the cost of removing items from the beginning of a list. You might say that doesn't matter for typing in the Repl - but if you wanted to do string processing on a huge input file you were piping in suddenly it would matter a lot.

The fastest (and simplest) way to do this might just be using a circular buffer on the C side... store the data as literal bytes in RAM, append as needed... never any need to copy, transform, convert to or from lists. "Truncating" the beginning of the buffer is instant as it's just updating the head to point to the new location in the buffer. The most expensive operation then would become readCodePoint which requires building UTF-8 characters/codes.

Thats my thought for now. Happy to discuss further with anyone who wanted to pick this up and run with it.


If we only care about solving the exact issue filed here you could accomplish that with:

__buffered = __buffered.bytes[1..-1].map { |x| String.fromByte(x) }.join() 

This would be far more correct behavior... at the cost of rebuilding the whole buffer every time a character is read...

from wren-cli.

mhermier avatar mhermier commented on September 27, 2024

I think we can reduce the problem by giving a max size to readLine(), so in code it should be something like:

readLine() { readLine(1024) }
readLine(maxSize) { ... }

That way you upper bound lookup and buffering.

Making a circular buffer, would also requires such upper bound as the size of the buffer. So API wise, it would not change anything as the naive implementation. So we can go the naive implementation for now to make it work, and if it really become a bottle neck, we could go to a dedicated ring buffer.

from wren-cli.

joshgoebel avatar joshgoebel commented on September 27, 2024

Scoping readLine indeed helps reduce worse case on one of the edge cases. :-)

Making a circular buffer, would also requires such upper bound as the size of the buffer.

Of course, but the bigger win is in all the other things you'd get for free. Anyways I just burnt out on this one. So many other things I can better contribute to. Looking forward to seeing if this can truly be solved simply and if so how well that will hold up over time. :-)

from wren-cli.

joshgoebel avatar joshgoebel commented on September 27, 2024
__buffered = __buffered[1..-1]

Ruby, out of curiosity just how performant was the original to begin with? Is it smart enough to do a memcpy of the original string when given a range? I think because of UTF-8 it has to walk the whole string character by character, yes?

If so then perhaps the "simple fix" posted above might not even be that terrible of a performance regression.

from wren-cli.

mhermier avatar mhermier commented on September 27, 2024

The performance is poor, as it was only really used in the cli. So it is not really important to be efficient for now, since most of the effort is still on the VM, but yet we need something to be functional for the cli to show case.

from wren-cli.

joshgoebel avatar joshgoebel commented on September 27, 2024

Well if it was already quite slow as-is then I'd recommend we just patch it as I suggested above and close out this issue. It'll indeed be fine enough either way for reading CLI input from the keyboard.

I can still make a PR if I can get a confirmation everyone is on board with that solution.

from wren-cli.

ruby0x1 avatar ruby0x1 commented on September 27, 2024

Slow is relative, it's probably not slow in the sense you're expecting. It's not the fastest it possibly could be but that doesn't mean performance is poor.

from wren-cli.

joshgoebel avatar joshgoebel commented on September 27, 2024

Slow is relative, it's probably not slow in the sense you're expecting.

Actually, it is slow in the sense I'm expecting. 🙃 I wasn't suggesting the performance was inadequate. I was meaning/expecting "in an absolute sense" (compared to C optimized code) not whether it's "fast enough" practically speaking.

My point was only IF it was already super optimized on the C side that I'd be more hesitant to drop in a change that then made it 20x slower. BUT since that's not the case I'll again suggest the tiny patch of just converting to bytes, shifting the data and converting back to a string.

Keeping in mind the actual bug/issue here is that we're reading bytes, but shifting by UTF-8. I feel like I may have gone down a rabbit hole and expanded the scope of the discussion too much originally vs fixing the issue and perhaps creating follow-ups for the larger concerns.

The bad code:

var byte = __buffered.bytes[0] // This operates on bytes
__buffered = __buffered[1..-1] // This operates on UTF8 strings

__buffered = __buffered.bytes[1..-1].map { |x| String.fromByte(x) }.join() 

Do you have any specific thoughts on that as a fix?

from wren-cli.

mhermier avatar mhermier commented on September 27, 2024

Maybe we could add a String.bytesAt(index_or_range) or alike, that would return a new string but that would operate on bytes?
It would not solve everything, but would allow a fast byte array manipulations, and would have a low impact on the API.

from wren-cli.

joshgoebel avatar joshgoebel commented on September 27, 2024

I think that would certainly make some sense, though what to name it is a question. bytesAt to me doesn't make me think I'm getting back a string.

from wren-cli.

mhermier avatar mhermier commented on September 27, 2024

Well thing is that I don't find a more sensible name to it, and have nothing more to propose for now.

The other possibility would be to make it a StringByteSequence method, but it would cause a double allocation (String and the new StringByteSequence).

from wren-cli.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.