It seems that stupidedi does not support UTF-8 chacaters when reading out an EDI. I ge

I've just done this for now: <div class="highlight highlight-source-ruby notransla

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

For anybody who encounters this problem, <a href="http://www.rubygeocoder.com/" rel="n

Support for reading UTF-8 strings about stupidedi HOT 6 CLOSED

shioyama commented on May 22, 2024

Support for reading UTF-8 strings

from stupidedi.

Comments (6)

shioyama commented on May 22, 2024

I've just done this for now:

C_BYTES    = (0..65535).inject(""){|string, c| string << c rescue RangeError; string}.freeze

Not really a solution, but enough so I can continue testing.

from stupidedi.

kputnam commented on May 22, 2024

Unfortunately, X12 does not support UTF-8. The specification lists the set of allowed characters which is H_BASIC (a subset of C_BYTES). There is an extended set of characters, H_EXTENDED, which can be used if both trading partners agree.

If you're constructing a X12 document to send to someone, you may have to transcode your UTF-8 to the limited character set. I would be surprised if extending C_BYTES works, just because it wasn't intended to. You might try generating a document and then write it out as X12, then try reading it back in (even just using edi-pp) to make sure it looks right.

from stupidedi.

kputnam commented on May 22, 2024

I'm trying to think through how it would work, but one thing that might cause trouble is Reader.is_control_character? returns true if a character isn't in H_BASIC or H_EXTENDED. So I think most UTF-8 would be treated as control characters.

From what I can work out, the consume_isa method in StreamReader will look for the start of the document, which is always ISA, and ignore control characters. That should be OK unless you have an input like where something like IあああSああA occurs before the X12 part of the file starts, since this will be tokenized as ISA and it will think that's the beginning of a X12 document. Probably most people have files that are entirely X12, but I had files which had an arbitrary header message before the ISA token (the spec doesn't forbid this), so consume_isa is written to skip that.

So to summarize StreamReader figures out where the X12 starts in a stream of arbitrary characters. Because it would throw away the new UTF-8 characters (it thinks they are control characters), it might identify a sequence of characters as the start of the X12 when it isn't. Seems unlikely to actually happen, unless your X12 files have random junk in between the ISA/ISE envelopes.

Next, TokenReader is what scans the stream of characters for either segment identifiers like ST, GE etc or specific characters or delimiters, or entire parts of a segment, like all of its elements, etc. In most of these functions, when reading until a particular substring is matched, any "control characters" are thrown out, like they weren't even present in the input. So I think this will probably cause all of the new UTF-8 characters to be discarded, since they are classified as control characters along with things like line endings.

If you notice that happening, then you might look at Reader.is_control_character? and change it so all the stuff that it previously classified as control characters (e.g., \n\t\f\v, various single-byte characters) are still control characters, but the characters that you've added above 255.chr aren't marked as control characters. That might actually work!

from stupidedi.

shioyama commented on May 22, 2024

@kputnam Thanks very much for the detailed reply! If X12 does not support UTF-8, then I think that's enough to convince me to not use it. Actually our partner asked us if we could send it in non-UTF-8 characters so I think we'll have to do that.

Now just have to think of a way to map UTF-8 (Japanese) addresses into corresponding English addresses... which is a slightly different problem.

from stupidedi.

kputnam commented on May 22, 2024

Good luck!

from stupidedi.

shioyama commented on May 22, 2024

For anybody who encounters this problem, geocoder is your friend:

result = Geocoder.search("東京都武蔵野市吉祥寺本町二丁目5番10号 いちご吉祥寺ビル").first
result.address
=> "2 Chome-5-10 Kichijōji Honchō, Musashino-shi, Tōkyō-to 180-0004, Japan"

from stupidedi.

Support for reading UTF-8 strings about stupidedi HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent