Code Monkey home page Code Monkey logo

jsonparse's Introduction

This is a streaming JSON parser. For a simpler, sax-based version see this gist: https://gist.github.com/1821394

The MIT License (MIT) Copyright (c) 2011-2012 Tim Caswell

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

jsonparse's People

Contributors

chrisdickinson avatar creationix avatar galniv avatar jlank avatar lbdremy avatar papandreou avatar raynos avatar rubenv avatar shimaore avatar zectbynmo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jsonparse's Issues

Enable parsing of UTF-8 characters

This is basically an issue-ification of the following TODO as found in the source code:

// TODO: Handle native utf8 characters, this code assumes ASCII input

Currently, jsonparse will turn non-ASCII UTF-8 chars into garbage.

Invalid JSON (Invalid UTF-8 character at position 0 in state STRING1)

We're indirectly using jsonparse via JSONStream to stream in JSON data stored in Google Cloud Storage and we're intermittently seeing the following error:

Invalid JSON (Invalid UTF-8 character at position 0 in state STRING1)

99% of the time the data is parsed successfully so I'm guessing it's related to where the chunks of data are split over http -- I believe it could be related to emoji characters or Japanese chars as both exist in our json but I'm struggling to pin point exactly where it's failing.

Is there perhaps a way to log more information re: the string value it failed on?

RangeError for toString('utf8') on Nodejs 8.6.0

I see the following error for Node.js 8.6.0:

jsonparse.js:94
    this.string += this.stringBuffer.toString('utf8');
                                     ^

RangeError: Invalid string length
    at Parser.proto.appendStringChar (jsonparse.js:94:38)
    at Parser.proto.write (jsonparse.js:197:34)

This does not happen with later versions of Node.js it seems but because of constrains I have to use this particular version.

Is there a workaround for this issue that I could use?

Why not take Strings as input?

Hello,
I saw that @dominictarr wrote about this being much slower than v8s JSON.parse(), put I thought that some improvements would maybe be possible. Some thoughts:
Why don't you take strings as input? I think that this should give you a huge speed improvement because you don't have to call multiple methods per character, but can instead skip over strings until you hit a backslash or quote and then do str.slice(). See this pull request for isaacs sax xml parser which got a 169% speed increase just by adding some fast string-skipping code: isaacs/sax-js#25

If you want to continue accepting buffers, you could just inspect the last six bytes in order to determine where the last complete character ends - a character starts with 0 or 11, so seek back (max 6 bytes) until you hit such a character, then check whether it's a complete character by inspecting the first byte of the character.

Allow pars escaped surrogate pairs

Surrogate pairs are parsed as two 16bit chars instead of one 32bit char.
For example this Json contains two 32bit emojis:
[ { "id" : "1", "message" : "\uD83D\uDE0B\uD83C\uDF70" } ]

We'r using java faster xml lib on one side FasterXML/jackson-core#223 and nodejs jsonparse on the other side.

There is a diff with fix:
diff.txt

Fix depreciation warning

Please apply the following patch that fix depreciation warning

Subject: Fix depreciation warning for nodejs (>= 10)
From: Bastien Roucariès <[email protected]>

Fix debci

Forwarded: 

Index: jsonparse/jsonparse.js
===================================================================
--- jsonparse.orig/jsonparse.js
+++ jsonparse/jsonparse.js
@@ -56,7 +56,7 @@ function Parser() {
   this.value = undefined;
 
   this.string = undefined; // string data
-  this.stringBuffer = Buffer.alloc ? Buffer.alloc(STRING_BUFFER_SIZE) : new Buffer(STRING_BUFFER_SIZE);
+  this.stringBuffer = Buffer.alloc(STRING_BUFFER_SIZE);
   this.stringBufferOffset = 0;
   this.unicode = undefined; // unicode escapes
   this.highSurrogate = undefined;
@@ -67,7 +67,7 @@ function Parser() {
   this.state = VALUE;
   this.bytes_remaining = 0; // number of bytes remaining in multi byte utf8 char to read after split boundary
   this.bytes_in_sequence = 0; // bytes in multi byte utf8 char to read
-  this.temp_buffs = { "2": new Buffer(2), "3": new Buffer(3), "4": new Buffer(4) }; // for rebuilding chars split before boundary is reached
+  this.temp_buffs = { "2": new Buffer.alloc(2), "3": new Buffer.alloc(3), "4": new Buffer.alloc(4) }; // for rebuilding chars split before boundary is reached
 
   // Stream offset
   this.offset = -1;
@@ -125,7 +125,7 @@ proto.appendStringBuf = function (buf, s
   this.stringBufferOffset += size;
 };
 proto.write = function (buffer) {
-  if (typeof buffer === "string") buffer = new Buffer(buffer);
+  if (typeof buffer === "string") buffer = new Buffer.from(buffer);
   var n;
   for (var i = 0, l = buffer.length; i < l; i++) {
     if (this.tState === START){
@@ -221,16 +221,16 @@ proto.write = function (buffer) {
           var intVal = parseInt(this.unicode, 16);
           this.unicode = undefined;
           if (this.highSurrogate !== undefined && intVal >= 0xDC00 && intVal < (0xDFFF + 1)) { //<56320,57343> - lowSurrogate
-            this.appendStringBuf(new Buffer(String.fromCharCode(this.highSurrogate, intVal)));
+            this.appendStringBuf(new Buffer.from(String.fromCharCode(this.highSurrogate, intVal)));
             this.highSurrogate = undefined;
           } else if (this.highSurrogate === undefined && intVal >= 0xD800 && intVal < (0xDBFF + 1)) { //<55296,56319> - highSurrogate
             this.highSurrogate = intVal;
           } else {
             if (this.highSurrogate !== undefined) {
-              this.appendStringBuf(new Buffer(String.fromCharCode(this.highSurrogate)));
+              this.appendStringBuf(new Buffer.from(String.fromCharCode(this.highSurrogate)));
               this.highSurrogate = undefined;
             }
-            this.appendStringBuf(new Buffer(String.fromCharCode(intVal)));
+            this.appendStringBuf(new Buffer.from(String.fromCharCode(intVal)));
           }
           this.tState = STRING1;
         }
Index: jsonparse/test/boundary.js
===================================================================
--- jsonparse.orig/test/boundary.js
+++ jsonparse/test/boundary.js
@@ -9,7 +9,7 @@ test('2 byte utf8 \'De\' character: д',
     t.equal(value, 'д');
   };
 
-  var de_buffer = new Buffer([0xd0, 0xb4]);
+  var de_buffer = new Buffer.from([0xd0, 0xb4]);
 
   p.write('"');
   p.write(de_buffer);
@@ -25,7 +25,7 @@ test('3 byte utf8 \'Han\' character: 我
     t.equal(value, '我');
   };
 
-  var han_buffer = new Buffer([0xe6, 0x88, 0x91]);
+  var han_buffer = new Buffer.from([0xe6, 0x88, 0x91]);
   p.write('"');
   p.write(han_buffer);
   p.write('"');
@@ -39,7 +39,7 @@ test('4 byte utf8 character (unicode sca
     t.equal(value, '𠜎');
   };
 
-  var Ux2070E_buffer = new Buffer([0xf0, 0xa0, 0x9c, 0x8e]);
+  var Ux2070E_buffer = new Buffer.from([0xf0, 0xa0, 0x9c, 0x8e]);
   p.write('"');
   p.write(Ux2070E_buffer);
   p.write('"');
@@ -53,8 +53,8 @@ test('3 byte utf8 \'Han\' character chun
     t.equal(value, '我');
   };
 
-  var han_buffer_first = new Buffer([0xe6, 0x88]);
-  var han_buffer_second = new Buffer([0x91]);
+  var han_buffer_first = new Buffer.from([0xe6, 0x88]);
+  var han_buffer_second = new Buffer.from([0x91]);
   p.write('"');
   p.write(han_buffer_first);
   p.write(han_buffer_second);
@@ -69,8 +69,8 @@ test('4 byte utf8 character (unicode sca
     t.equal(value, '𠜎');
   };
 
-  var Ux2070E_buffer_first = new Buffer([0xf0, 0xa0]);
-  var Ux2070E_buffer_second = new Buffer([0x9c, 0x8e]);
+  var Ux2070E_buffer_first = new Buffer.from([0xf0, 0xa0]);
+  var Ux2070E_buffer_second = new Buffer.from([0x9c, 0x8e]);
   p.write('"');
   p.write(Ux2070E_buffer_first);
   p.write(Ux2070E_buffer_second);
@@ -85,7 +85,7 @@ var p = new Parser();
     t.equal(value, 'Aж文𠜱B');
   };
 
-  var eclectic_buffer = new Buffer([0x41, // A
+  var eclectic_buffer = new Buffer.from([0x41, // A
                                     0xd0, 0xb6, // ж
                                     0xe6, 0x96, 0x87, // 文
                                     0xf0, 0xa0, 0x9c, 0xb1, // 𠜱

Streaming multi-byte UTF8 characters not being parsed correctly

When streaming data into jsonparse that consists of multi-byte utf8 characters, if a data chunk splits a multi-byte character, jsonparse does not properly reconcile the character between data events. I wrote a quick demo repo to show this behavior and started writing blog post to explain the issue in more detail (not finished). In the meantime check the demo repo out, it has the current implementation and proposed patch working. For more context on this issue see this thread with @mikeal discussing where the "proper" place to reconcile / parse mutli-byte utf8 characters is. I already have a proposed fix written up for jsonparse with test cases, but wanted to open an issue first and get your feedback before I made a PR.

Thanks!

how to parse selected values from json?

I have the following code:

request({url: 'https://myurl.com/stream?method=json'})
    .pipe(JSONStream.parse('*'))     
    .pipe(es.mapSync(function (data) {
      console.log(data);
      var var1 = JSON.stringify(data);
      io.emit('notification', var1);
    }))

which works perfect for receiving ALL data from the json stream or when I change

    .pipe(JSONStream.parse('*')) 

to

.pipe(JSONStream.parse('Name')) 

to get only the name.

However what do I need to do in order to get

Name, Address and ZIP from the json stream? I could nowhere find the answer to this.

The JSON looks like this:

{"Date":"2015-03-16T13:00:12.860630336Z","Name":"Peter","Address":"Demostreet","ZIP":"1234"}

parsing json & the `Stream` interface

hi!

I'm looking for a practical streaming json parser.

basically, what I think would be incredibly useful would be a parser that you could pipe into from a raw stream:

   //(load all docs from local couchdb)

   request('http://localhost:5984/tests/_all_docs)
   .pipe(new StreamingJsonParser())
   .pipe(anotherStream)

  //(note, in 0.5.x pipe returns the dest pipe, so it is chainable)

now, i'd expect StreamingJsonParser to take a raw stream, and emit objects.

I think for this to actually be useful, the root of the json stream should be on an array,
then, the 'data' events are the members of the array.

emitting the members of the first array the parser sees would work for the cases that I have examined so far
(github, twitter, rackspace, and couchdb)

unfortunately, couchdb views do not actually have an array at the root, but instead it's like this:

{total_rows: 1000, rows: [
...
]}

which is why I am advocating emitting a stream of the members of the bottom most array.

what do you think?

Add a license

Could you add a LICENSE file (or license in the package.json)?

Thanks!

Some big numbers not converted to string

Some numbers which are larger than Number.MAX_SAFE_INTEGER can still be represented accurately as a regular JavaScript number.

In those cases jsonparse will return them as a number, rather than a string.

I'm not sure if that's intentional but I thought it was worth flagging. I was expecting all numbers outside double precision range to be returned as strings. Here's a few examples where this is not happening:

144380449412828603 string 
144122580203659657 string
144250504882249760 number
144222334382612875 string
144353568153548541 string
144131338871386780 number
144274369105917272 string
144188125506805060 number

One potential issue that might arise from this is passing the output from jsonparse to BigInt:

Number('144188125506805060');         // 144188125506805060  👍 
BigInt('144188125506805060');         // 144188125506805060n 👍  
BigInt(Number('144188125506805060')); // 144188125506805056n 👎 

The cause of the issue (if it is indeed considered an issue) is this condition:

if ((text.match(/[0-9]+/) == text) && (result.toString() != text)) {

An additional check against Number.MAX_SAFE_INTEGER could suffice as a solution, though may not be backwards compatible.

new Buffer() constructor is deprecated

if (typeof buffer === "string") buffer = new Buffer(buffer);

this.appendStringBuf(new Buffer(String.fromCharCode(this.highSurrogate, intVal)));

this.appendStringBuf(new Buffer(String.fromCharCode(this.highSurrogate)));

this.appendStringBuf(new Buffer(String.fromCharCode(intVal)));

Use Buffer constructor is currently depreciated :

It's not possible to deprecate Node.js ≤ 4.4.x and 5.0.0 — 5.9.x or maybe use a polyfill ?

Or maybe you can just use the alloc method to keep compatibility :

jsonparse/jsonparse.js

Lines 54 to 56 in b2d8bc6

function alloc(size) {
return Buffer.alloc ? Buffer.alloc(size) : new Buffer(size);
}

"buffer" and "i" on line 413 undefined?

Looks like buffer and i on line 413 are undefined.

jsonparse/jsonparse.js

Lines 409 to 422 in b2d8bc6

proto.numberReviver = function (text) {
var result = Number(text);
if (isNaN(result)) {
return this.charError(buffer, i);
}
if ((text.match(/[0-9]+/) == text) && (result.toString() != text)) {
// Long string of digits which is an ID string and not valid and/or safe JavaScript integer Number
this.onToken(STRING, text);
} else {
this.onToken(NUMBER, result);
}
}

Memory leak

I have some code that uses jsonparse (via JSONStream) to parse a file that is about 170MB. The heap keeps growing, and eventually almost continual gc grinds the process almost to a halt.

I thought at first the leak was caused by dominictarr/JSONStream, but I think that I've narrowed the leak down to jsonparse.

This code causes a leak, that I don't think should happen.

var Parser = require('jsonparse');

var string = (new Array(10 * 1024 + 1)).join("x");

var parser = new Parser();
// parser.onValue = function(value) {
//   //console.log('received:', value);
// };

parser.write('[')
while (true) {
  parser.write('"' + string + '",')
}

It streams a never ending array of strings to jsonparse. It's silly, but it seemed to be a simple way to simulate parsing a large file and provoke the leak.

Running with the -trace_gc flag shows that the heap grows rapidly, gc is unable to reclaim much from the heap, and the heap is quickly exhausted.

I don't see why this code shouldn't be able to run indefinitely. Until it does, I'm probably not going to be able to process large files with jsonparse (which is a shame).

Status of this Library

Hey @creationix!

I was just wondering about the state of this package? With no readme, docs, recent commits or issue resolutions, but TONS of installs, I am unsure if it is safe to use this. Is this module recommended? Or is there a newer streaming json parser around?

Not working with browserify buffers

I'm trying to use jsonparse (via @dominictarr's JSONStream) with browserify and, since buffer-browserify doesn't (and apparently can't) support buf[index] it doesn't work.

JSONStream already detects whether Buffer is available so maybe that could detect browserify buffers and choose not to use them, but maybe it would be better if jsonparse could deal with it itself?

What do y'all think?

Stuck in for loop for certain multi-byte utf8 characters in an open quote

When a string in the JSON stream includes the "registered trademark" character (http://www.fileformat.info/info/unicode/char/00ae/index.htm), Parser.write() gets stuck in the for loop at https://github.com/creationix/jsonparse/blob/master/jsonparse.js#L96 indefinitely. I believe that this is because this.bytes_in_sequence remains 0 in the code block starting at https://github.com/creationix/jsonparse/blob/master/jsonparse.js#L130

In the case of the "registered trademark" unicode, character, n = 174, so this.bytes_in_sequence remains 0 and i doesn't ever increase at https://github.com/creationix/jsonparse/blob/master/jsonparse.js#L142

Adding a line like:

        if ((n >= 128) && (n <= 193)) this.bytes_in_sequence = 1;

at around line 130 seemed to fix things for me.

I ran into this while using dominictarr/JSONStream.

Simpler implementation is possible?

I am probably missing something or failing on edge cases, but I was able to implement a simple JSON parser like so:

https://github.com/ORESoftware/tap-json-parser/blob/master/index.ts

The reason I started working on my own version, was because JSONStream (which uses jsonparse), was failing when parsing stdout that contained non-JSON delimited data. So if JSON and non-JSON is mixed together it appears to fail.

E.g.:

console.log(JSON.stringify({foo:"bar"});
console.log('yolo rolo cholo');
console.log(JSON.stringify({zim:"zam"});

the above should fail JSONStream (and perhaps jsonparse too?).

So I made an attempt, based off of a super simple try/catch on each line of data, and it works.

Maybe you know why my implementation might fail in certain scenarios/edge cases. I am honestly hoping you can tell me why my implementation might be insufficient, so I can fix it.

thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.