Code Monkey home page Code Monkey logo

Comments (19)

etrepum avatar etrepum commented on July 22, 2024

So the reason this works on Mac but not on Linux is because you're using Python 2.x which has two different configure-time options for unicode: wide (UCS4) and narrow (UCS2). Linux has pretty much always done UCS4 and everywhere else has done UCS2 (so that Py_UNICODE is the same type as wchar_t, mostly). This is historically a huge pain in the ass and it's fixed in Python 3, which has a far more correct unicode implementation. The behavior difference is because on a UCS4 build simplejson must combine surrogates, but on a UCS2 build it just blindly stuffs the UCS2 code points in there. It would be more correct if simplejson were to validate this even on UCS2 builds but it simply doesn't because it's easier not to bother (Python 2.x doesn't go through much effort to validate unicode in general).

I'll consider adding a flag to loosen up handling of unpaired surrogates, what behavior would you want? On some builds of Python, it's not possible to represent them as-is... so sane behavior would be something like replacing them with a placeholder or removing them entirely.

from simplejson.

d0ugal avatar d0ugal commented on July 22, 2024

You obviously have a much stronger understanding of this than I do - so its hard for me to suggest what the best behaviour is, I'll give this some thought. On one hand being able to replace bad characters would be a great for us and certainly solve our use case. Incidentally, presumably I can remove them myself before parsing - perhaps that's should be the recommended way and maybe this is simply a documentation bug and a a note could be added for people that get stuck in this situation.

For the cases when I had different results with and without the speedups. Which side would you consider the bug? Is it a bug for not failing or a bug for failing? I'm assuming the inconsistency is a bug. I'll can get you all the samples of these if it helps on Monday (I'm avoiding my work machine until then).

from simplejson.

etrepum avatar etrepum commented on July 22, 2024

I'm a bit skeptical that there's difference with and without speedups, I
haven't tried reproducing this yet. As much evidence as you can provide
would be appreciated.

On Saturday, February 23, 2013, Dougal Matthews wrote:

You obviously have a much stronger understanding of this than I do - so
its hard for me to suggest what the best behaviour is, I'll give this some
thought. On one hand being able to replace bad characters would be a great
for us and certainly solve our use case. Incidentally, presumably I can
remove them myself before parsing - perhaps that's should be the
recommended way and maybe this is simply a documentation bug and a a note
could be added for people that get stuck in this situation.

For the cases when I had different results with and without the speedups.
Which side would you consider the bug? Is it a bug for not failing or a bug
for failing? I'm assuming the inconsistency is a bug. I'll can get you all
the samples of these if it helps on Monday (I'm avoiding my work machine
until then).


Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-13997223.

from simplejson.

d0ugal avatar d0ugal commented on July 22, 2024

It looks like my findings may be out of date. I have tested this with simplejson 3.1.0 - I can't find any that fail with speedups and work without. I was only able to reproduce this behaviour with simplejson 2.3.2 (which our server still runs.)

Any idea what happens to the unpaired surrogates on this version? Here is my quick test. It shows 3.1.0 with speedups, without speedups and then 2.3.2 with and without (when it works on the final version).

(I'm going to try and trim down to the problem part of the file shortly.)

vagrant@precise64:/vagrant$ sudo pip install simplejson -U
Downloading/unpacking simplejson from http://pypi.python.org/packages/source/s/simplejson/simplejson-3.1.0.tar.gz#md5=71df0076d4a35d29bfea530cb8226c26
  Downloading simplejson-3.1.0.tar.gz (64kB): 64kB downloaded
  Running setup.py egg_info for package simplejson

Installing collected packages: simplejson
  Found existing installation: simplejson 3.0.0
    Uninstalling simplejson:
      Successfully uninstalled simplejson
  Running setup.py install for simplejson
    building 'simplejson._speedups' extension
    gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.7 -c simplejson/_speedups.c -o build/temp.linux-x86_64-2.7/simplejson/_speedups.o
    gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro build/temp.linux-x86_64-2.7/simplejson/_speedups.o -o build/lib.linux-x86_64-2.7/simplejson/_speedups.so

Successfully installed simplejson
Cleaning up...
vagrant@precise64:/vagrant$ pip freeze | grep simplejson
Warning: cannot find svn location for distribute==0.6.24dev-r0
simplejson==3.1.0
vagrant@precise64:/vagrant$ python -c 'import simplejson as json; print json.load(open("failing2.json")).keys()'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/simplejson/__init__.py", line 398, in load
    use_decimal=use_decimal, **kw)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/__init__.py", line 454, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 374, in decode
    obj, end = self.raw_decode(s)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 393, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Unpaired high surrogate: line 1 column 330665 (char 330664)
vagrant@precise64:/vagrant$ python -c 'import simplejson as json; json._toggle_speedups(False); print json.load(open("failing2.json")).keys()'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/simplejson/__init__.py", line 398, in load
    use_decimal=use_decimal, **kw)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/__init__.py", line 454, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 374, in decode
    obj, end = self.raw_decode(s)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 393, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
  File "/usr/local/lib/python2.7/dist-packages/simplejson/scanner.py", line 119, in scan_once
    return _scan_once(string, idx)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/scanner.py", line 90, in _scan_once
    _scan_once, object_hook, object_pairs_hook, memo)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 198, in JSONObject
    value, end = scan_once(s, end)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/scanner.py", line 90, in _scan_once
    _scan_once, object_hook, object_pairs_hook, memo)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 198, in JSONObject
    value, end = scan_once(s, end)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/scanner.py", line 92, in _scan_once
    return parse_array((string, idx + 1), _scan_once)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 254, in JSONArray
    value, end = scan_once(s, end)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/scanner.py", line 90, in _scan_once
    _scan_once, object_hook, object_pairs_hook, memo)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 198, in JSONObject
    value, end = scan_once(s, end)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/scanner.py", line 90, in _scan_once
    _scan_once, object_hook, object_pairs_hook, memo)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 198, in JSONObject
    value, end = scan_once(s, end)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/scanner.py", line 87, in _scan_once
    return parse_string(string, idx + 1, encoding, strict)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 127, in py_scanstring
    raise JSONDecodeError(msg, s, end)
simplejson.scanner.JSONDecodeError: Unpaired high surrogate: line 1 column 330659 (char 330658)
vagrant@precise64:/vagrant$ sudo pip install simplejson==2.3.2
Downloading/unpacking simplejson==2.3.2
  Downloading simplejson-2.3.2.tar.gz (50kB): 50kB downloaded
  Running setup.py egg_info for package simplejson

Installing collected packages: simplejson
  Found existing installation: simplejson 3.1.0
    Uninstalling simplejson:
      Successfully uninstalled simplejson
  Running setup.py install for simplejson
    building 'simplejson._speedups' extension
    gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.7 -c simplejson/_speedups.c -o build/temp.linux-x86_64-2.7/simplejson/_speedups.o
    gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro build/temp.linux-x86_64-2.7/simplejson/_speedups.o -o build/lib.linux-x86_64-2.7/simplejson/_speedups.so

Successfully installed simplejson
Cleaning up...
vagrant@precise64:/vagrant$ pip freeze | grep simplejson
Warning: cannot find svn location for distribute==0.6.24dev-r0
simplejson==2.3.2
vagrant@precise64:/vagrant$ python -c 'import simplejson as json; print json.load(open("failing2.json")).keys()'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/simplejson/__init__.py", line 357, in load
    use_decimal=use_decimal, **kw)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/__init__.py", line 413, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 402, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 418, in raw_decode
    obj, end = self.scan_once(s, idx)
simplejson.decoder.JSONDecodeError: Unpaired high surrogate: line 1 column 330664 (char 330664)
vagrant@precise64:/vagrant$ python -c 'import simplejson as json; json._toggle_speedups(False); print json.load(open("failing2.json")).keys()'
[u'hits', u'_shards', u'took', u'_scroll_id', u'timed_out']

from simplejson.

d0ugal avatar d0ugal commented on July 22, 2024

Example of an inconsistent result, again from an old version 2.3.2 - so it may just be that this has been fixed.

The JSON in question is;

{"a": "\uda90\uac14"}

Test results;

vagrant@precise64:/vagrant$ python -c 'import simplejson as json; print json.__version__; print json.loads("{\"a\": \"\uda90\uac14\"}").keys()';
2.3.2
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/simplejson/__init__.py", line 413, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 402, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 418, in raw_decode
    obj, end = self.scan_once(s, idx)
simplejson.decoder.JSONDecodeError: Unpaired high surrogate: line 1 column 14 (char 14)

vagrant@precise64:/vagrant$ python -c 'import simplejson as json; json._toggle_speedups(False); print json.loads("{\"a\": \"\uda90\uac14\"}").keys()';
[u'a']

from simplejson.

etrepum avatar etrepum commented on July 22, 2024

If you go far enough in the past, anything is possible. I'm only really
interested in current behavior, there are no maintenance releases for old
versions.

On Monday, February 25, 2013, Dougal Matthews wrote:

Example of an inconsistent result, again from an old version 2.3.2 - so it
may just be that this has been fixed.

The JSON in question is;

{"a": "\uda90\uac14"}

Test results;

vagrant@precise64:/vagrant$ python -c 'import simplejson as json; print json.version; print json.loads("{"a": "\uda90\uac14"}").keys()';
2.3.2
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/simplejson/init.py", line 413, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 402, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 418, in raw_decode
obj, end = self.scan_once(s, idx)
simplejson.decoder.JSONDecodeError: Unpaired high surrogate: line 1 column 14 (char 14)

vagrant@precise64:/vagrant$ python -c 'import simplejson as json; json._toggle_speedups(False); print json.loads("{"a": "\uda90\uac14"}").keys()';
[u'a']


Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-14049234.

from simplejson.

d0ugal avatar d0ugal commented on July 22, 2024

Indeed - I wasn't suggesting that it should be fixed or changed etc. but simply trying to understand what was happening in this case. i.e. how was it able to work previously? What was the behaviour then? Sorry if I'm failing to be clear.

from simplejson.

etrepum avatar etrepum commented on July 22, 2024

If there was different behavior, it was a bug. The behavior before was
probably just garbage in, garbage out. I'm hesitant to make that a
supported behavior because it will just cause problems somewhere else
(especially in Python 3).

On Monday, February 25, 2013, Dougal Matthews wrote:

Indeed - I wasn't suggesting that it should be fixed or changed etc. but
simply trying to understand what was happening in this case. i.e. how was
it able to work previously? What was the behaviour then? Sorry if I'm
failing to be clear.


Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-14057221.

from simplejson.

d0ugal avatar d0ugal commented on July 22, 2024

Ok, thanks. I was just trying to understand how it "worked" before. I'm fairly sure this shouldn't be changed now. If anything, I think a documentation change would be best. Are you able to explain how to remove/replace the surrogates?

from simplejson.

etrepum avatar etrepum commented on July 22, 2024

Since these unpaired surrogates occur as escape sequences it's not
straightforward to remove them without making changes to simplejson. The
only simple way I can think of is to catch the exception and run it again
with an edited document. Not ideal. I'd rather modify simplejson than try
and document something like that.

On Tuesday, February 26, 2013, Dougal Matthews wrote:

Ok, thanks. I was just trying to understand how it "worked" before. I'm
fairly sure this shouldn't be changed now. If anything, I think a
documentation change would be best. Are you able to explain how to
remove/replace the surrogates?


Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-14103380
.

from simplejson.

etrepum avatar etrepum commented on July 22, 2024

I haven't forgotten about this, just haven't had time to do anything about it yet. It's on my list for 3.3.0.

from simplejson.

d0ugal avatar d0ugal commented on July 22, 2024

Awesome - thanks. Let me know if I can help at all. I worked around our issue with horrific regexes to find IDs and then find the documents and repair them. 😄

from simplejson.

ltrabuco avatar ltrabuco commented on July 22, 2024

@d0ugal Would you mind sharing how you repaired the documents?

from simplejson.

d0ugal avatar d0ugal commented on July 22, 2024

It was fairly horrific and I don't imagine it's wise as a general approach. However, with the number of documents we have it works for now. Basically, use a really old simplejson that is more lax about invalid Unicode and then the few that still failed have been updated and the invalid characters removed.

from simplejson.

ltrabuco avatar ltrabuco commented on July 22, 2024

@d0ugal Thanks!

from simplejson.

serhiy-storchaka avatar serhiy-storchaka commented on July 22, 2024

I think it will be good to have an "error" parameter in decoder/encoder. error="strict" (by default) corresponds to current behavior on wide builds, while error="surrogatepass" corresponds to behavior on narrow builds and non-strict parsers on other languages. Other error handlers are possible but less important.

I have opened an issue on CPython bugtracker [1] and I'm going to prepare patch for standard json module and then port it to simplejson.

[1] http://bugs.python.org/issue17906

from simplejson.

etrepum avatar etrepum commented on July 22, 2024

Are you sure "error" is the best name for this? There are a number of other
options to control other kinds of errors. I know there's a parallel to
str/unicode but I think a more explicit name might be more appropriate
here.

On Saturday, May 4, 2013, Serhiy Storchaka wrote:

I think it will be good to have an "error" parameter in decoder/encoder.
error="strict" (by default) corresponds to current behavior on wide builds,
while error="surrogatepass" corresponds to behavior on narrow builds and
non-strict parsers on other languages. Other error handlers are possible
but less important.

I have opened an issue on CPython bugtracker [1] and I'm going to prepare
patch for standard json module and then port it to simplejson.

[1] http://bugs.python.org/issue17906


Reply to this email directly or view it on GitHubhttps://github.com//issues/62#issuecomment-17434306
.

from simplejson.

d0ugal avatar d0ugal commented on July 22, 2024

I don't think "error" is the best name either, for the builtin unicode() the "errors" arg works because at that point you are in the unicode problem space so its clear. Something like unicode_errors while rather verbose would be clearer - I can't think of a better alternative at the moment.

from simplejson.

serhiy-storchaka avatar serhiy-storchaka commented on July 22, 2024

After investigating the problem deeper, I see that new parameter is not needed. RFC 4627 does not make exceptions for the range 0xD800-0xDFFF, and the decoder must accept lone surrogates, both escaped and unescaped. See a patch at http://bugs.python.org/issue17906.

from simplejson.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.