Code Monkey home page Code Monkey logo

Comments (7)

clebergnu avatar clebergnu commented on May 21, 2024

This is also reproducible with in-tree input as well, for instance, with test/enc/enc001.txt. It means that make test fails on all Linux platforms I tried.

I meant to take a better look at this, and was tracking it (and have some extra info) here:

https://trello.com/c/EgR0JZaK/1180-pict-broken-with-utf-16-possibly-others

from pict.

clebergnu avatar clebergnu commented on May 21, 2024

An update here. I quickly debugged this, and found that the reading of lines from an unicode encoded file is broken:

diff --git a/cli/mparser.cpp b/cli/mparser.cpp
index 4e12245..46ba5f3 100644
--- a/cli/mparser.cpp
+++ b/cli/mparser.cpp
@@ -107,6 +107,7 @@ bool readLineFromFile( wifstream& file, wstring& line )
         if( file.eof()
          || c == L'\n'
          || c == 0 ) return( true );
+        // execution never gets here, no line is ever read
         line += c;
     }

And this seems to be where the execution loops forever. I'll follow up with a fix.

from pict.

clebergnu avatar clebergnu commented on May 21, 2024

On Linux, I've traced this all the way to libstdc++. The file.get() call ultimately gets to the buffer sbumpc(). There, the current buffer position (gptr()) is always equal to the the end of buffer pointer (epgtr(). A snippet from where this is checked:

      /**
       *  @brief  Getting the next character.
       *  @return  The next character, or eof.
       *
       *  If the input read position is available, returns that character
       *  and increments the read pointer, otherwise calls and returns
       *  @c uflow().
      */
      int_type
      sbumpc()
      {
	int_type __ret;
	if (__builtin_expect(this->gptr() < this->egptr(), true))
	  {
	    __ret = traits_type::to_int_type(*this->gptr());
	    this->gbump(1);
	  }
	else
	  __ret = this->uflow();
	return __ret;
      }

At this point, I imagine the libstdc++ code is not that naive and buggy, so I tend to believe that something else is required for the buffer to operate correctly. I've briefly looked at imbue and locales regarding the stream, but I really need a better grasp on the fundamentals here.

from pict.

jaccz avatar jaccz commented on May 21, 2024

Thanks for trying to get to the bottom of this. Really appreciate it.

from pict.

clebergnu avatar clebergnu commented on May 21, 2024

Some news here: wifstreams do operate differently when a locale is set via imbue. With a simple hack such as this:

$ git diff
diff --git a/cli/mparser.cpp b/cli/mparser.cpp
index 4e12245..43bf28e 100644
--- a/cli/mparser.cpp
+++ b/cli/mparser.cpp
@@ -1,5 +1,6 @@
 #include <fstream>
 #include <sstream>
+#include <locale>
 #include "model.h"
 using namespace std;
 
@@ -436,6 +437,8 @@ bool CModelData::readModel( const wstring& filePath )
         return( false );
     }
 
+    locale loc(locale("en_US.UTF-8"));
+    file.imbue(loc);
     wstring line;
 
     // read definition of parameters

pict can then parse the model on the test/enc/enc003.txt file:

$ ./pict test/enc/enc003.txt 
A       B       C
a       1       x
a       3       y
c       1       z
b       1       y
c       2       x
b       2       z
a       2       y
b       3       x
a       3       z
c       3       y

But this "correct" operation is dependent on the content of the file being read (which in this case is UTF-8 with a BOM). Based on my experiments, at least on GNU/Linux, wifstream does not seem to be a good framework for building a getEncodingType() function. ifstream (or even fopen(), open(), etc) on the other hand, are immune to locales, BOMs, etc.

To summarize it, it looks like this will need more than a quick hack.

from pict.

jaccz avatar jaccz commented on May 21, 2024

Thanks for looking into this Cleber. I poked around a bit as well. The diff between platforms is annoying. If you ever wondered why readLineFromFile reads each character in a loop, it is because long time ago getline() was behaving differently on Windows and MacOS and getting text char-by-char was the least common denominator that worked. I suppose getline isn't quite working these days either.

The reliance on BOMs is a partial solution at best. It might be time to start passing the input locale/encoding to PICT explicitly as a param:

pict.exe model.txt -l "en_US.UTF-8"

and use that for anything other than ANSI or files with the couple of already supported BOMs.

from pict.

qykth-git avatar qykth-git commented on May 21, 2024

I think this issue fixed by #60.
u8_rus.txt works well with #60 .

from pict.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.