creationix / jsonparse Goto Github PK

A streaming JSON parser written in pure JavaScript for node.js

License: MIT License

JavaScript 100.00%

jsonparse's Introduction

This is a streaming JSON parser. For a simpler, sax-based version see this gist: https://gist.github.com/1821394

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

jsonparse's People

Contributors

Stargazers

Watchers

Forkers

polotek ryanramage papandreou chrisdickinson dmcaulay juliangruber gismatthew ibank jlank gijs rocketcoder rubenv explodingbarrel alex-components halavi raynos galniv mnjstwins lulululululu zenlambda mpnovikova zectbynmo laurent-news furqanzafar aleventhal dda-digt ikirmitz aravindhd lynxluna hayes thorjarhun fiserro nhz-io liuxiaodong dqsully tsouza kamesh95 mnilko roirevolution mmikhalko reelevant-tech shimaore qwtel amcasey aritz-cracker daoyuly mbman weglot jamesgibbons92 alienzhou jersonwei bnaya comunica bergos membrane-io small-machines zqisheng cdauth

jsonparse's Issues

Dubious-looking assignment needs a comment (or fix).

I just noticed this code in index.js:

var NULL1   = C.NULL1   = 0x41;
var NULL2   = C.NULL3   = 0x42;
var NULL3   = C.NULL2   = 0x43;

https://github.com/creationix/jsonparse/blob/master/jsonparse.js#L25

Notice how C.NULL3 gets assigned to NULL2 and vice-versa. This seems incongruous with the surrounding code and looks like an error. If it is not an error, then it at least deserves an explanatory comment.

Enable parsing of UTF-8 characters

This is basically an issue-ification of the following TODO as found in the source code:

// TODO: Handle native utf8 characters, this code assumes ASCII input

Currently, jsonparse will turn non-ASCII UTF-8 chars into garbage.

Infinite loop still exists for some characters

See my pull request

Invalid JSON (Invalid UTF-8 character at position 0 in state STRING1)

We're indirectly using jsonparse via JSONStream to stream in JSON data stored in Google Cloud Storage and we're intermittently seeing the following error:

Invalid JSON (Invalid UTF-8 character at position 0 in state STRING1)

99% of the time the data is parsed successfully so I'm guessing it's related to where the chunks of data are split over http -- I believe it could be related to emoji characters or Japanese chars as both exist in our json but I'm struggling to pin point exactly where it's failing.

Is there perhaps a way to log more information re: the string value it failed on?

RangeError for toString('utf8') on Nodejs 8.6.0

I see the following error for Node.js 8.6.0:

jsonparse.js:94
    this.string += this.stringBuffer.toString('utf8');
                                     ^

RangeError: Invalid string length
    at Parser.proto.appendStringChar (jsonparse.js:94:38)
    at Parser.proto.write (jsonparse.js:197:34)

This does not happen with later versions of Node.js it seems but because of constrains I have to use this particular version.

Is there a workaround for this issue that I could use?

Why not take Strings as input?

Hello,
I saw that @dominictarr wrote about this being much slower than v8s JSON.parse(), put I thought that some improvements would maybe be possible. Some thoughts:
Why don't you take strings as input? I think that this should give you a huge speed improvement because you don't have to call multiple methods per character, but can instead skip over strings until you hit a backslash or quote and then do str.slice(). See this pull request for isaacs sax xml parser which got a 169% speed increase just by adding some fast string-skipping code: isaacs/sax-js#25

If you want to continue accepting buffers, you could just inspect the last six bytes in order to determine where the last complete character ends - a character starts with 0 or 11, so seek back (max 6 bytes) until you hit such a character, then check whether it's a complete character by inspecting the first byte of the character.

Allow pars escaped surrogate pairs

Surrogate pairs are parsed as two 16bit chars instead of one 32bit char.
For example this Json contains two 32bit emojis:
[ { "id" : "1", "message" : "\uD83D\uDE0B\uD83C\uDF70" } ]

We'r using java faster xml lib on one side FasterXML/jackson-core#223 and nodejs jsonparse on the other side.

There is a diff with fix:
diff.txt

Fix depreciation warning

Please apply the following patch that fix depreciation warning

Subject: Fix depreciation warning for nodejs (>= 10)
From: Bastien Roucariès <[email protected]>

Fix debci

Forwarded: 

Index: jsonparse/jsonparse.js
===================================================================
--- jsonparse.orig/jsonparse.js
+++ jsonparse/jsonparse.js
@@ -56,7 +56,7 @@ function Parser() {
   this.value = undefined;
 
   this.string = undefined; // string data
-  this.stringBuffer = Buffer.alloc ? Buffer.alloc(STRING_BUFFER_SIZE) : new Buffer(STRING_BUFFER_SIZE);
+  this.stringBuffer = Buffer.alloc(STRING_BUFFER_SIZE);
   this.stringBufferOffset = 0;
   this.unicode = undefined; // unicode escapes
   this.highSurrogate = undefined;
@@ -67,7 +67,7 @@ function Parser() {
   this.state = VALUE;
   this.bytes_remaining = 0; // number of bytes remaining in multi byte utf8 char to read after split boundary
   this.bytes_in_sequence = 0; // bytes in multi byte utf8 char to read
-  this.temp_buffs = { "2": new Buffer(2), "3": new Buffer(3), "4": new Buffer(4) }; // for rebuilding chars split before boundary is reached
+  this.temp_buffs = { "2": new Buffer.alloc(2), "3": new Buffer.alloc(3), "4": new Buffer.alloc(4) }; // for rebuilding chars split before boundary is reached
 
   // Stream offset
   this.offset = -1;
@@ -125,7 +125,7 @@ proto.appendStringBuf = function (buf, s
   this.stringBufferOffset += size;
 };
 proto.write = function (buffer) {
-  if (typeof buffer === "string") buffer = new Buffer(buffer);
+  if (typeof buffer === "string") buffer = new Buffer.from(buffer);
   var n;
   for (var i = 0, l = buffer.length; i < l; i++) {
     if (this.tState === START){
@@ -221,16 +221,16 @@ proto.write = function (buffer) {
           var intVal = parseInt(this.unicode, 16);
           this.unicode = undefined;
           if (this.highSurrogate !== undefined && intVal >= 0xDC00 && intVal < (0xDFFF + 1)) { //<56320,57343> - lowSurrogate
-            this.appendStringBuf(new Buffer(String.fromCharCode(this.highSurrogate, intVal)));
+            this.appendStringBuf(new Buffer.from(String.fromCharCode(this.highSurrogate, intVal)));
             this.highSurrogate = undefined;
           } else if (this.highSurrogate === undefined && intVal >= 0xD800 && intVal < (0xDBFF + 1)) { //<55296,56319> - highSurrogate
             this.highSurrogate = intVal;
           } else {
             if (this.highSurrogate !== undefined) {
-              this.appendStringBuf(new Buffer(String.fromCharCode(this.highSurrogate)));
+              this.appendStringBuf(new Buffer.from(String.fromCharCode(this.highSurrogate)));
               this.highSurrogate = undefined;
             }
-            this.appendStringBuf(new Buffer(String.fromCharCode(intVal)));
+            this.appendStringBuf(new Buffer.from(String.fromCharCode(intVal)));
           }
           this.tState = STRING1;
         }
Index: jsonparse/test/boundary.js
===================================================================
--- jsonparse.orig/test/boundary.js
+++ jsonparse/test/boundary.js
@@ -9,7 +9,7 @@ test('2 byte utf8 \'De\' character: д',
     t.equal(value, 'д');
   };
 
-  var de_buffer = new Buffer([0xd0, 0xb4]);
+  var de_buffer = new Buffer.from([0xd0, 0xb4]);
 
   p.write('"');
   p.write(de_buffer);
@@ -25,7 +25,7 @@ test('3 byte utf8 \'Han\' character: 我
     t.equal(value, '我');
   };
 
-  var han_buffer = new Buffer([0xe6, 0x88, 0x91]);
+  var han_buffer = new Buffer.from([0xe6, 0x88, 0x91]);
   p.write('"');
   p.write(han_buffer);
   p.write('"');
@@ -39,7 +39,7 @@ test('4 byte utf8 character (unicode sca
     t.equal(value, '𠜎');
   };
 
-  var Ux2070E_buffer = new Buffer([0xf0, 0xa0, 0x9c, 0x8e]);
+  var Ux2070E_buffer = new Buffer.from([0xf0, 0xa0, 0x9c, 0x8e]);
   p.write('"');
   p.write(Ux2070E_buffer);
   p.write('"');
@@ -53,8 +53,8 @@ test('3 byte utf8 \'Han\' character chun
     t.equal(value, '我');
   };
 
-  var han_buffer_first = new Buffer([0xe6, 0x88]);
-  var han_buffer_second = new Buffer([0x91]);
+  var han_buffer_first = new Buffer.from([0xe6, 0x88]);
+  var han_buffer_second = new Buffer.from([0x91]);
   p.write('"');
   p.write(han_buffer_first);
   p.write(han_buffer_second);
@@ -69,8 +69,8 @@ test('4 byte utf8 character (unicode sca
     t.equal(value, '𠜎');
   };
 
-  var Ux2070E_buffer_first = new Buffer([0xf0, 0xa0]);
-  var Ux2070E_buffer_second = new Buffer([0x9c, 0x8e]);
+  var Ux2070E_buffer_first = new Buffer.from([0xf0, 0xa0]);
+  var Ux2070E_buffer_second = new Buffer.from([0x9c, 0x8e]);
   p.write('"');
   p.write(Ux2070E_buffer_first);
   p.write(Ux2070E_buffer_second);
@@ -85,7 +85,7 @@ var p = new Parser();
     t.equal(value, 'Aж文𠜱B');
   };
 
-  var eclectic_buffer = new Buffer([0x41, // A
+  var eclectic_buffer = new Buffer.from([0x41, // A
                                     0xd0, 0xb6, // ж
                                     0xe6, 0x96, 0x87, // 文
                                     0xf0, 0xa0, 0x9c, 0xb1, // 𠜱

Please sign cryptographilly git tag and release signature file

Hi,

I would be better for a security point of view to sign cryptographically release.

Thank you

Streaming multi-byte UTF8 characters not being parsed correctly

When streaming data into jsonparse that consists of multi-byte utf8 characters, if a data chunk splits a multi-byte character, jsonparse does not properly reconcile the character between data events. I wrote a quick demo repo to show this behavior and started writing blog post to explain the issue in more detail (not finished). In the meantime check the demo repo out, it has the current implementation and proposed patch working. For more context on this issue see this thread with @mikeal discussing where the "proper" place to reconcile / parse mutli-byte utf8 characters is. I already have a proposed fix written up for jsonparse with test cases, but wanted to open an issue first and get your feedback before I made a PR.

Thanks!

how to parse selected values from json?

I have the following code:

request({url: 'https://myurl.com/stream?method=json'})
    .pipe(JSONStream.parse('*'))     
    .pipe(es.mapSync(function (data) {
      console.log(data);
      var var1 = JSON.stringify(data);
      io.emit('notification', var1);
    }))

which works perfect for receiving ALL data from the json stream or when I change

    .pipe(JSONStream.parse('*'))

.pipe(JSONStream.parse('Name'))

to get only the name.

However what do I need to do in order to get

Name, Address and ZIP from the json stream? I could nowhere find the answer to this.

The JSON looks like this:

{"Date":"2015-03-16T13:00:12.860630336Z","Name":"Peter","Address":"Demostreet","ZIP":"1234"}

parsing json & the `Stream` interface

hi!

I'm looking for a practical streaming json parser.

basically, what I think would be incredibly useful would be a parser that you could pipe into from a raw stream:

   //(load all docs from local couchdb)

   request('http://localhost:5984/tests/_all_docs)
   .pipe(new StreamingJsonParser())
   .pipe(anotherStream)

  //(note, in 0.5.x pipe returns the dest pipe, so it is chainable)

now, i'd expect StreamingJsonParser to take a raw stream, and emit objects.

I think for this to actually be useful, the root of the json stream should be on an array,
then, the 'data' events are the members of the array.

emitting the members of the first array the parser sees would work for the cases that I have examined so far
(github, twitter, rackspace, and couchdb)

unfortunately, couchdb views do not actually have an array at the root, but instead it's like this:

{total_rows: 1000, rows: [
...
]}

which is why I am advocating emitting a stream of the members of the bottom most array.

what do you think?

Add a license

Could you add a LICENSE file (or license in the package.json)?

Thanks!

Some big numbers not converted to string

Some numbers which are larger than Number.MAX_SAFE_INTEGER can still be represented accurately as a regular JavaScript number.

In those cases jsonparse will return them as a number, rather than a string.

I'm not sure if that's intentional but I thought it was worth flagging. I was expecting all numbers outside double precision range to be returned as strings. Here's a few examples where this is not happening:

144380449412828603 string 
144122580203659657 string
144250504882249760 number
144222334382612875 string
144353568153548541 string
144131338871386780 number
144274369105917272 string
144188125506805060 number

One potential issue that might arise from this is passing the output from jsonparse to BigInt:

Number('144188125506805060');         // 144188125506805060  👍 
BigInt('144188125506805060');         // 144188125506805060n 👍  
BigInt(Number('144188125506805060')); // 144188125506805056n 👎

The cause of the issue (if it is indeed considered an issue) is this condition:

jsonparse/jsonparse.js

Line 416 in b2d8bc6

if ((text.match(/[0-9]+/) == text) && (result.toString() != text)) {

An additional check against Number.MAX_SAFE_INTEGER could suffice as a solution, though may not be backwards compatible.

new Buffer() constructor is deprecated

jsonparse/jsonparse.js

Line 132 in b2d8bc6

if (typeof buffer === "string") buffer = new Buffer(buffer);

jsonparse/jsonparse.js

Line 228 in b2d8bc6

    
           this.appendStringBuf(new Buffer(String.fromCharCode(this.highSurrogate, intVal)));

jsonparse/jsonparse.js

Line 234 in b2d8bc6

this.appendStringBuf(new Buffer(String.fromCharCode(this.highSurrogate)));

jsonparse/jsonparse.js

Line 237 in b2d8bc6

this.appendStringBuf(new Buffer(String.fromCharCode(intVal)));

Use Buffer constructor is currently depreciated :

https://nodejs.org/en/docs/guides/buffer-constructor-deprecation/

It's not possible to deprecate Node.js ≤ 4.4.x and 5.0.0 — 5.9.x or maybe use a polyfill ?

Or maybe you can just use the alloc method to keep compatibility :

jsonparse/jsonparse.js

Lines 54 to 56 in b2d8bc6

    
           function alloc(size) { 
        
             return Buffer.alloc ? Buffer.alloc(size) : new Buffer(size); 
        
           }

Erroneously rounds long integer numbers

SO that
{"organizerKey": 7161093205057351174} becomes {"organizerKey": 7161093205057352000}

"buffer" and "i" on line 413 undefined?

Looks like buffer and i on line 413 are undefined.

jsonparse/jsonparse.js

Lines 409 to 422 in b2d8bc6

    
           proto.numberReviver = function (text) { 
        
             var result = Number(text); 
        
             if (isNaN(result)) { 
        
               return this.charError(buffer, i); 
        
             } 
        
             if ((text.match(/[0-9]+/) == text) && (result.toString() != text)) { 
        
               // Long string of digits which is an ID string and not valid and/or safe JavaScript integer Number 
        
               this.onToken(STRING, text); 
        
             } else { 
        
               this.onToken(NUMBER, result); 
        
             } 
        
           }

Memory leak

I have some code that uses jsonparse (via JSONStream) to parse a file that is about 170MB. The heap keeps growing, and eventually almost continual gc grinds the process almost to a halt.

I thought at first the leak was caused by dominictarr/JSONStream, but I think that I've narrowed the leak down to jsonparse.

This code causes a leak, that I don't think should happen.

var Parser = require('jsonparse');

var string = (new Array(10 * 1024 + 1)).join("x");

var parser = new Parser();
// parser.onValue = function(value) {
//   //console.log('received:', value);
// };

parser.write('[')
while (true) {
  parser.write('"' + string + '",')
}

It streams a never ending array of strings to jsonparse. It's silly, but it seemed to be a simple way to simulate parsing a large file and provoke the leak.

Running with the -trace_gc flag shows that the heap grows rapidly, gc is unable to reclaim much from the heap, and the heap is quickly exhausted.

I don't see why this code shouldn't be able to run indefinitely. Until it does, I'm probably not going to be able to process large files with jsonparse (which is a shame).

Status of this Library

Hey @creationix!

I was just wondering about the state of this package? With no readme, docs, recent commits or issue resolutions, but TONS of installs, I am unsure if it is safe to use this. Is this module recommended? Or is there a newer streaming json parser around?

Is this module compatible with Streams2?

That's all

intermittent rounding errors

some times numbers don't come out exactly like they came in...

https://gist.github.com/1237220

it's only the smallest end of the scale, but still important.

strangely, it only happens sometimes.

Not working with browserify buffers

I'm trying to use jsonparse (via @dominictarr's JSONStream) with browserify and, since buffer-browserify doesn't (and apparently can't) support buf[index] it doesn't work.

JSONStream already detects whether Buffer is available so maybe that could detect browserify buffers and choose not to use them, but maybe it would be better if jsonparse could deal with it itself?

What do y'all think?

Stuck in for loop for certain multi-byte utf8 characters in an open quote

When a string in the JSON stream includes the "registered trademark" character (http://www.fileformat.info/info/unicode/char/00ae/index.htm), Parser.write() gets stuck in the for loop at https://github.com/creationix/jsonparse/blob/master/jsonparse.js#L96 indefinitely. I believe that this is because this.bytes_in_sequence remains 0 in the code block starting at https://github.com/creationix/jsonparse/blob/master/jsonparse.js#L130

In the case of the "registered trademark" unicode, character, n = 174, so this.bytes_in_sequence remains 0 and i doesn't ever increase at https://github.com/creationix/jsonparse/blob/master/jsonparse.js#L142

Adding a line like:

        if ((n >= 128) && (n <= 193)) this.bytes_in_sequence = 1;

at around line 130 seemed to fix things for me.

I ran into this while using dominictarr/JSONStream.

Simpler implementation is possible?

I am probably missing something or failing on edge cases, but I was able to implement a simple JSON parser like so:

https://github.com/ORESoftware/tap-json-parser/blob/master/index.ts

The reason I started working on my own version, was because JSONStream (which uses jsonparse), was failing when parsing stdout that contained non-JSON delimited data. So if JSON and non-JSON is mixed together it appears to fail.

E.g.:

console.log(JSON.stringify({foo:"bar"});
console.log('yolo rolo cholo');
console.log(JSON.stringify({zim:"zam"});

the above should fail JSONStream (and perhaps jsonparse too?).

So I made an attempt, based off of a super simple try/catch on each line of data, and it works.

Maybe you know why my implementation might fail in certain scenarios/edge cases. I am honestly hoping you can tell me why my implementation might be insufficient, so I can fix it.

thanks!

	function alloc(size) {
	return Buffer.alloc ? Buffer.alloc(size) : new Buffer(size);
	}

	proto.numberReviver = function (text) {
	var result = Number(text);

	if (isNaN(result)) {
	return this.charError(buffer, i);
	}

	if ((text.match(/[0-9]+/) == text) && (result.toString() != text)) {
	// Long string of digits which is an ID string and not valid and/or safe JavaScript integer Number
	this.onToken(STRING, text);
	} else {
	this.onToken(NUMBER, result);
	}
	}

creationix / jsonparse Goto Github PK

jsonparse's Introduction

jsonparse's People

Contributors

Stargazers

Watchers

Forkers

jsonparse's Issues

Recommend Projects

Recommend Topics

Recommend Org