Code Monkey home page Code Monkey logo

Comments (6)

shioyama avatar shioyama commented on May 22, 2024

I've just done this for now:

C_BYTES    = (0..65535).inject(""){|string, c| string << c rescue RangeError; string}.freeze

Not really a solution, but enough so I can continue testing.

from stupidedi.

kputnam avatar kputnam commented on May 22, 2024

Unfortunately, X12 does not support UTF-8. The specification lists the set of allowed characters which is H_BASIC (a subset of C_BYTES). There is an extended set of characters, H_EXTENDED, which can be used if both trading partners agree.

If you're constructing a X12 document to send to someone, you may have to transcode your UTF-8 to the limited character set. I would be surprised if extending C_BYTES works, just because it wasn't intended to. You might try generating a document and then write it out as X12, then try reading it back in (even just using edi-pp) to make sure it looks right.

from stupidedi.

kputnam avatar kputnam commented on May 22, 2024

I'm trying to think through how it would work, but one thing that might cause trouble is Reader.is_control_character? returns true if a character isn't in H_BASIC or H_EXTENDED. So I think most UTF-8 would be treated as control characters.

From what I can work out, the consume_isa method in StreamReader will look for the start of the document, which is always ISA, and ignore control characters. That should be OK unless you have an input like where something like IあああSああA occurs before the X12 part of the file starts, since this will be tokenized as ISA and it will think that's the beginning of a X12 document. Probably most people have files that are entirely X12, but I had files which had an arbitrary header message before the ISA token (the spec doesn't forbid this), so consume_isa is written to skip that.

So to summarize StreamReader figures out where the X12 starts in a stream of arbitrary characters. Because it would throw away the new UTF-8 characters (it thinks they are control characters), it might identify a sequence of characters as the start of the X12 when it isn't. Seems unlikely to actually happen, unless your X12 files have random junk in between the ISA/ISE envelopes.

Next, TokenReader is what scans the stream of characters for either segment identifiers like ST, GE etc or specific characters or delimiters, or entire parts of a segment, like all of its elements, etc. In most of these functions, when reading until a particular substring is matched, any "control characters" are thrown out, like they weren't even present in the input. So I think this will probably cause all of the new UTF-8 characters to be discarded, since they are classified as control characters along with things like line endings.

If you notice that happening, then you might look at Reader.is_control_character? and change it so all the stuff that it previously classified as control characters (e.g., \n\t\f\v, various single-byte characters) are still control characters, but the characters that you've added above 255.chr aren't marked as control characters. That might actually work!

from stupidedi.

shioyama avatar shioyama commented on May 22, 2024

@kputnam Thanks very much for the detailed reply! If X12 does not support UTF-8, then I think that's enough to convince me to not use it. Actually our partner asked us if we could send it in non-UTF-8 characters so I think we'll have to do that.

Now just have to think of a way to map UTF-8 (Japanese) addresses into corresponding English addresses... which is a slightly different problem.

from stupidedi.

kputnam avatar kputnam commented on May 22, 2024

Good luck!

from stupidedi.

shioyama avatar shioyama commented on May 22, 2024

For anybody who encounters this problem, geocoder is your friend:

result = Geocoder.search("東京都武蔵野市吉祥寺本町二丁目5番10号 いちご吉祥寺ビル").first
result.address
=> "2 Chome-5-10 Kichijōji Honchō, Musashino-shi, Tōkyō-to 180-0004, Japan"

from stupidedi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.