After run with UTF-8 encoded file with Russian characters (attached) it just doesn't n

I think this issue fixed by <a class="issue-link js-issue-link" data-error-text="Faile

High CPU load and hang of the PICT process after utf8 input on Linux about pict HOT 7 CLOSED

microsoft commented on May 21, 2024

High CPU load and hang of the PICT process after utf8 input on Linux

from pict.

Comments (7)

clebergnu commented on May 21, 2024

This is also reproducible with in-tree input as well, for instance, with test/enc/enc001.txt. It means that make test fails on all Linux platforms I tried.

I meant to take a better look at this, and was tracking it (and have some extra info) here:

https://trello.com/c/EgR0JZaK/1180-pict-broken-with-utf-16-possibly-others

from pict.

clebergnu commented on May 21, 2024

An update here. I quickly debugged this, and found that the reading of lines from an unicode encoded file is broken:

diff --git a/cli/mparser.cpp b/cli/mparser.cpp
index 4e12245..46ba5f3 100644
--- a/cli/mparser.cpp
+++ b/cli/mparser.cpp
@@ -107,6 +107,7 @@ bool readLineFromFile( wifstream& file, wstring& line )
         if( file.eof()
          || c == L'\n'
          || c == 0 ) return( true );
+        // execution never gets here, no line is ever read
         line += c;
     }

And this seems to be where the execution loops forever. I'll follow up with a fix.

from pict.

clebergnu commented on May 21, 2024

On Linux, I've traced this all the way to libstdc++. The file.get() call ultimately gets to the buffer sbumpc(). There, the current buffer position (gptr()) is always equal to the the end of buffer pointer (epgtr(). A snippet from where this is checked:

      /**
       *  @brief  Getting the next character.
       *  @return  The next character, or eof.
       *
       *  If the input read position is available, returns that character
       *  and increments the read pointer, otherwise calls and returns
       *  @c uflow().
      */
      int_type
      sbumpc()
      {
	int_type __ret;
	if (__builtin_expect(this->gptr() < this->egptr(), true))
	  {
	    __ret = traits_type::to_int_type(*this->gptr());
	    this->gbump(1);
	  }
	else
	  __ret = this->uflow();
	return __ret;
      }

At this point, I imagine the libstdc++ code is not that naive and buggy, so I tend to believe that something else is required for the buffer to operate correctly. I've briefly looked at imbue and locales regarding the stream, but I really need a better grasp on the fundamentals here.

from pict.

jaccz commented on May 21, 2024

Thanks for trying to get to the bottom of this. Really appreciate it.

from pict.

clebergnu commented on May 21, 2024

Some news here: wifstreams do operate differently when a locale is set via imbue. With a simple hack such as this:

$ git diff
diff --git a/cli/mparser.cpp b/cli/mparser.cpp
index 4e12245..43bf28e 100644
--- a/cli/mparser.cpp
+++ b/cli/mparser.cpp
@@ -1,5 +1,6 @@
 #include <fstream>
 #include <sstream>
+#include <locale>
 #include "model.h"
 using namespace std;
 
@@ -436,6 +437,8 @@ bool CModelData::readModel( const wstring& filePath )
         return( false );
     }
 
+    locale loc(locale("en_US.UTF-8"));
+    file.imbue(loc);
     wstring line;
 
     // read definition of parameters

pict can then parse the model on the test/enc/enc003.txt file:

$ ./pict test/enc/enc003.txt 
A       B       C
a       1       x
a       3       y
c       1       z
b       1       y
c       2       x
b       2       z
a       2       y
b       3       x
a       3       z
c       3       y

But this "correct" operation is dependent on the content of the file being read (which in this case is UTF-8 with a BOM). Based on my experiments, at least on GNU/Linux, wifstream does not seem to be a good framework for building a getEncodingType() function. ifstream (or even fopen(), open(), etc) on the other hand, are immune to locales, BOMs, etc.

To summarize it, it looks like this will need more than a quick hack.

from pict.

jaccz commented on May 21, 2024

Thanks for looking into this Cleber. I poked around a bit as well. The diff between platforms is annoying. If you ever wondered why readLineFromFile reads each character in a loop, it is because long time ago getline() was behaving differently on Windows and MacOS and getting text char-by-char was the least common denominator that worked. I suppose getline isn't quite working these days either.

The reliance on BOMs is a partial solution at best. It might be time to start passing the input locale/encoding to PICT explicitly as a param:

pict.exe model.txt -l "en_US.UTF-8"

and use that for anything other than ANSI or files with the couple of already supported BOMs.

from pict.

qykth-git commented on May 21, 2024

I think this issue fixed by #60.
u8_rus.txt works well with #60 .

from pict.

High CPU load and hang of the PICT process after utf8 input on Linux about pict HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent