Comments (11)
I'm not sure if String#length
is the right conditional here. Look at this for Binary and ASCII strings:
$ irb
[3.1.2] > s = "\u{1F600}"
=> "๐"
[3.1.2] > s.encoding
=> #<Encoding:UTF-8>
[3.1.2] > s.length
=> 1
[3.1.2] > a = s.force_encoding(Encoding::ASCII)
=> "\xF0\x9F\x98\x80"
[3.1.2] > b = s.b
=> "\xF0\x9F\x98\x80"
[3.1.2] > a.encoding
=> #<Encoding:US-ASCII>
[3.1.2] > b.encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > b == a
=> false
[3.1.2] > a == b
=> false
I think we'll have to dive though the MRI sources to find out what the extra conditional check is.
from artichoke.
That length comparison should be self.char_len()
. len()
is not encoding aware, it just looks at the underlying bytes.
from artichoke.
Let's add test cases for this in the string_test.rb
functional tests.
from artichoke.
Suspect:
[3.1.2] > s = "\u{1F600}"
=> "๐"
[3.1.2] > s.force_encoding(Encoding::ASCII)
=> "\xF0\x9F\x98\x80"
[3.1.2] > t = s.b
=> "\xF0\x9F\x98\x80"
[3.1.2] > s == t
=> false
[3.1.2] > s.encoding
=> #<Encoding:US-ASCII>
[3.1.2] > t.encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > s.valid_encoding?
=> false
[3.1.2] > t.valid_encoding?
=> true
[3.1.2] > s = "A"
=> "A"
[3.1.2] > s.force_encoding(Encoding::ASCII)
=> "A"
[3.1.2] > t = s.b
=> "A"
[3.1.2] > s == t
=> true
[3.1.2] > s.valid_encoding?
=> true
[3.1.2] > t.valid_encoding?
=> true
e.g. \xF0
is not a valid US-ASCII
char. Ruby is still doing it's best to show the underlying bytes of course. And since all bytes are valid in ASCII-8BIT
from artichoke.
Hey!
I will be happy to try to solve this issue. Is there anything else I should know about it?
from artichoke.
Go for it @AI-Mozi
from artichoke.
@AI-Mozi In case it helps, I think this line in the Ruby docs is the crux of the issue:
Returns false if the two strings' encodings are not compatible
โ๏ธ from here: https://ruby-doc.org/core-3.1.2/String.html#method-i-3D-3D
I spent some time with a colleague at work thinking about what this actually means, and I think I now have a grasp. Sorry in advance if my explanation is lacking and/or this becomes a thesis ๐ฌ.
In short, it comes to how the characters are represented, instead of their binary values. This hopefully can be explained by these two code pages:
- https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout
- https://en.wikipedia.org/wiki/ISO/IEC_8859-2#Code_page_layout
Here, you can see that \x30
is represented by 0
in both code pages. However later on, \xA1
is represented by a ยก
in 8859-1, where as ฤ
in 8859-2. So although their binary contents are the same, how they are displayed to the user would be different.
Note, this is confusing in ruby, because it's reliant on whether your shell is set up to view these character sets or not (I believe this is what the limitation is, but not 100% sure - echo $LANG
and you'll likely see UTF-8 for example). How this manifests is as follows:
[3.1.2] > s = "\x30".force_encoding(Encoding::ISO_8859_1)
=> "0"
[3.1.2] > t = "\x30".force_encoding(Encoding::ISO_8859_2)
=> "0"
[3.1.2] > s.encoding
=> #<Encoding:ISO-8859-1>
[3.1.2] > t.encoding
=> #<Encoding:ISO-8859-2>
[3.1.2] > s == t
=> true
[3.1.2] > u = "\xA1".force_encoding(Encoding::ISO_8859_1)
=> "\xA1"
[3.1.2] > v = "\xA1".force_encoding(Encoding::ISO_8859_2)
=> "\xA1"
[3.1.2] > u == v
=> false
Some analysis:
\x30
outputs as a0
since this symbol is encoded the same as what shell supports (UTF-8\x30
is also represented by0
)\xA1
is output as hex, since that character is not the same in UTF-8 as it is in the encoding specified. e.g. My shell knows this text is not UTF-8, but also it doesn't know how to output it- When ruby says: "Returns false if the two strings' encodings are not compatible" - I believe it's saying: "If all characters are represented in the same way in both encodings, then it's equal, otherwise it is not"
โ๏ธ In saying all the above, Artichoke currently only supports Binary, ASCII, and UTF-8 strings. The good news, is that the first 128 characters (0 indexed) are represented the same across all of these encodings. e.g. any two byte strings that only include the characters \x00
=> \x7F
should be equal, regardless of encoding. If the strings contain anything \x80
and above, I'd expect to give equality. Proof:
[3.1.2] > s = "\x80"
=> "\x80"
[3.1.2] > t = s.b
=> "\x80"
[3.1.2] > u = s.dup.force_encoding(Encoding::ASCII)
=> "\x80"
[3.1.2] > s.encoding
=> #<Encoding:UTF-8>
[3.1.2] > t.encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > u.encoding
=> #<Encoding:US-ASCII>
[3.1.2] > s == t
=> false
[3.1.2] > t == u
=> false
[3.1.2] > s == u
=> false
Hopefully the above makes some sense. I'm not actually sure how/where I'd implement this, but I thought the info might help
from artichoke.
In saying this:
[3.1.2] > u = "\xD6".force_encoding(Encoding::ISO_8859_1)
=> "\xD6"
[3.1.2] > v = "\xD6".force_encoding(Encoding::ISO_8859_2)
=> "\xD6"
[3.1.2] > u == v
=> false
โ๏ธ I'm not sure if this is "correct" from MRI. In both code pages, this would be ร
so "In theory" they should equal since the those two encodings are compatible for that character. I imagine this is because it would be massively hard to manage giant equality tables of "this code point looks like this here" etc. Although I do like the challenge of making a Rust library that can do this sort of equality ๐ฌ.
Note, I suspect MRI (and we could also use this same logic in Artichoke) uses these conditions for equality:
- internal bytes are the same
- Encoding is the same OR both strings only contain ascii chars
from artichoke.
Hey! I've had a bit of a break but now I'd like to complete this task :)
Is spinoso-string the only place where I should add changes?
And could you please provide some examples of tests that would hep me check if my changes are correct?
from artichoke.
hi @AI-Mozi. you'll want to modify the PartialEq
implementation on EncodedString
to also check for the left and right sides having the same char_len
:
artichoke/spinoso-string/src/enc/mod.rs
Lines 60 to 74 in 2db5303
from artichoke.
And thats all? Just check if have same char_len
?
from artichoke.
Related Issues (20)
- Tests in `spec-runner/vendor/ruby` are not for MRI 3.1.2 HOT 2
- Build Rust crate for truncating floats to ints aka `RB_FIXABLE`
- Add support for `Regexp` needles to `String#byteindex` HOT 4
- `String#index` and `String#rindex` should return character offsets instead of byte offsets
- Add `byteindex` and `byterindex` methods to `spinoso-string`
- Add support for `Regexp` needles to `String#index` and `String#rindex` HOT 7
- Creating Time with `Time.utc` or `Time.local` should wrap 60 seconds around if not a valid leap second HOT 8
- spinoso-time error enum contains private, unreported types
- Parse difference with MRI for un-parenthesized method calls with an un-bracketed hash argument HOT 1
- scolapasta-path incorrectly mutates verbatim paths on Windows
- MatchData#offset returns byte offset instead of utf8 char offset for utf8 strings
- Dependabot cannot parse `rust-toolchain` file HOT 2
- Evaluate Profile-Guided Optimization (PGO) HOT 3
- `spinoso_string::String::chars` does not yield encoded strings
- `String#chars` is slow
- Embed a Windows manifest file in Artichoke binaries
- `Kernel#p` when given no arguments triggers a Rust unsafe precondition violation HOT 1
- Naming of internal packages HOT 2
- Awareness opportunity: add to asdf/mise plugin HOT 2
- Private vulnerability reporting HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from artichoke.