klaemo / csv-stream Goto Github PK
View Code? Open in Web Editor NEW:page_with_curl: Streaming CSV Parser for Node. Small and made entirely out of streams.
License: Other
:page_with_curl: Streaming CSV Parser for Node. Small and made entirely out of streams.
License: Other
heya, I updated csv-spectrum so that you can just require() it now, example usage is here: https://github.com/maxogden/binary-csv/blob/master/test/test.js#L151
This would be nice for debugging in large files.
I guess this format would make the most sense:
parser.on('data', (line, number) => { ... })
Sample code:
var str = require('string-to-stream'), csv = require('csv-streamify');
str("COL0,COL1\ncol0,col1\n").pipe(csv({objectMode: true, columns: true})).on('data', function(chunk) {
console.log(chunk);
});
str("COL0\ncol0\n").pipe(csv({objectMode: true, columns: true})).on('data', function(chunk) {
console.log(chunk);
}).on('error', function(error) {
console.log(error);
});
Expected output:
{ COL0: 'col0', COL1: 'col1' }
{ COL0: 'col0' }
Actual output:
{ COL0: 'col0', COL1: 'col1' }
(maybe with an error?)
Something happens when input has only one column.
To see what happens to your code in Node.js 10, Greenkeeper has created a branch with the following changes:
.travis.yml
package.json
files, so that was left aloneIf you’re interested in upgrading this repo to Node.js 10, you can open a PR with these changes. Please note that this issue is just intended as a friendly reminder and the PR as a possible starting point for getting your code running on Node.js 10.
Greenkeeper has checked the engines
key in any package.json
file, the .nvmrc
file, and the .travis.yml
file, if present.
engines
was only updated if it defined a single version, not a range..nvmrc
was updated to Node.js 10.travis.yml
was only changed if there was a root-level node_js
that didn’t already include Node.js 10, such as node
or lts/*
. In this case, the new version was appended to the list. We didn’t touch job or matrix configurations because these tend to be quite specific and complex, and it’s difficult to infer what the intentions were.For many simpler .travis.yml
configurations, this PR should suffice as-is, but depending on what you’re doing it may require additional work or may not be applicable at all. We’re also aware that you may have good reasons to not update to Node.js 10, which is why this was sent as an issue and not a pull request. Feel free to delete it without comment, I’m a humble robot and won’t feel rejected 🤖
There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.
Your Greenkeeper Bot 🌴
I am using csv-streamify in my log parsing project: https://github.com/cboscolo/elb2loggly.
Parsing lines with escaped strings does not work properly. For example, using var csvToJson = csv({objectMode: true, delimiter: ' '});
to parse this line:
2016-06-01T14:09:00.418027Z anaconda-org "GET https://pypi.anaconda.org:443/username/simple/virtualenv/ HTTP/1.1" "pip/8.1.2 {\"openssl_version\":\"OpenSSL 1.0.2d 9 Jul 2015\"}" TLSv1.2
Results in this:
[ "2016-06-01T14:09:00.418027Z", "anaconda-org", "GET https://pypi.anaconda.org:443/username/simple/virtualenv/ HTTP/1.1", "pip/8.1.2 {\\openssl_version\\:\\OpenSSL", "1.0.2d", "9", "Jul", "2015\\}", "TLSv1.2" ]
Instead of this:
[ "2016-06-01T14:09:00.418027Z", "anaconda-org", "GET https://pypi.anaconda.org:443/username/simple/virtualenv/ HTTP/1.1", "pip/8.1.2 {"openssl_version":"OpenSSL 1.0.2d 9 Jul 2015"}", "TLSv1.2" ]
Have you considered adding support for escaped strings? If not, do you know of any other csv parsing npm modules that do?
Branch | Build failing 🚨 |
---|---|
Dependency | mocha |
Current Version | 3.4.2 |
Type | devDependency |
This version is covered by your current version range and after updating it in your project the build failed.
As mocha is “only” a devDependency of this project it might not break production or downstream projects, but “only” your build or test tools – preventing new deploys or publishes.
I recommend you give this issue a high priority. I’m sure you can resolve this 💪
--forbid-only
and --forbid-pending
flags. Use these in CI or hooks to ensure tests aren't accidentally being skipped! (@charlierudolph)--napi-modules
flag (@jupp0r)The new version differs by 34 commits.
82d879f
Release v3.5.0
bf687ce
update mocha.js for v3.5.0
ec73c9a
update date for release of v3.5.0 in CHANGELOG [ci skip]
1ba2cfc
update CHANGELOG.md for v3.5.0 [ci skip]
065e14e
remove custom install script from travis (no longer needed)
4e87046
update karma-sauce-launcher URL for npm@5
6886ccc
increase timeout for slow-grepping test
2408d90
Make dependencies use older version of readable-stream to work around incompatibilities introduced by 2.3.0 on June 19th
68a1466
Try not clearing the env for debug in the integration test and see if that fixes Node 0.10 on AppVeyor; if need be, some other fix/workaround can be applied to handle whatever was up with debug without causing this issue
958fbb4
Update new tests to work in browser per test hierarchy reorganization
1df7c94
Merge pull request #2704 from seppevs/increase_test_coverage_of_mocha_js
1f270cd
Stop timing out (#2889)
27c7496
upgrade to [email protected]; closes #2859 (#2860)
50fc47d
fix CI; closes #2867 (#2868)
1b1377c
Add test for ignoreLeaks and fix descriptions
There are 34 commits in total.
See the full diff
There is a collection of frequently asked questions and of course you may always ask my humans.
Your Greenkeeper Bot 🌴
When parsing the character stream for multi-character sequences, such as CR-LF or quote-quote, the parser will not detect those sequences when they are broken between chunks.
https://github.com/klaemo/csv-stream/blob/master/csv-streamify.js#L79
'\n' is hardcoded. a simple change to this.newline works great
Hey I'm trying to parse something like this (this is what you get from twitter when you request your tweet archive):
"tweet_id","in_reply_to_status_id","in_reply_to_user_id","timestamp","source","text","retweeted_status_id","retweeted_status_user_id","retweeted_status_timestamp","expanded_urls"
"402889081560236033","","","2013-11-19 20:00:17 +0000","<a href=""http://bjb.io"" rel=""nofollow"">Friend Overview</a>","RT @luk: Decided to match my spending on unnecessary luxury services with donations. Today's pick: @ScriptEdnyc for my $25 uber ride yester…","402887184916566016","16569603","2013-11-19 19:52:44 +0000",""
"402888943315976192","402888586510340096","15116482","2013-11-19 19:59:44 +0000","<a href=""http://bjb.io"" rel=""nofollow"">Friend Overview</a>","@pamasaur @jennschiffer I can't think of a single thing that statement isn't true about.","","","",""
"402887420011892736","","","2013-11-19 19:53:40 +0000","<a href=""http://bjb.io"" rel=""nofollow"">Friend Overview</a>","RT @miketaylr: Further evidence that vendor-prefixed APIs are bad for the web: https://t.co/7qOQQcacXr","402887080319021056","16642746","2013-11-19 19:52:19 +0000","https://miketaylr.com/posts/2013/11/just-uppercase-everything.html,https://miketaylr.com/posts/2013/11/just-uppercase-everything.html"
"402884996119416832","","","2013-11-19 19:44:02 +0000","<a href=""http://bjb.io"" rel=""nofollow"">Friend Overview</a>","OH: “they don't even use npm which is sort of like sitting next to the world's greatest rollercoaster and refusing to ride it”","","","",""
and I'm not getting the right results.
Here's my example script
const csv = require('csv-streamify')
const fs = require('fs')
fs.createReadStream('tweets.limited.csv')
.pipe(csv({objectMode: true, columns: true}))
.on('data', function (data) { console.dir(data) })
.on('error', function (err) { console.error(err) })
.on('end', function () { console.log('finished') })
and here's my console output
○ node index.js
{ tweet_id: '402889081560236033',
in_reply_to_status_id: '","',
in_reply_to_user_id: '2013-11-19 20:00:17 +0000',
timestamp: '<a href="http://bjb.io" rel="nofollow">Friend Overview</a>',
source: 'RT @luk: Decided to match my spending on unnecessary luxury services with donations. Today\'s pick: @ScriptEdnyc for my $25 uber ride yester…',
text: '402887184916566016',
retweeted_status_id: '16569603',
retweeted_status_user_id: '2013-11-19 19:52:44 +0000',
retweeted_status_timestamp: '"\n402888943315976192,402888586510340096,15116482,2013-11-19 19:59:44 +0000,<a href="http://bjb.io" rel="nofollow">Friend Overview</a>,@pamasaur @jennschiffer I can\'t think of a single thing that statement isn\'t true about.,"',
expanded_urls: '","' }
finished
So the first thing to notice is that I'm not getting all the rows. The second thing is that in_reply_to_status_id
is wrong, (should be blank) so every other column below it is offset.
If there is a quote inside a table cell (e.g. \tSay "hello"!\t
, if the cell separator is a tab) then csv-stream simply deletes the quote. Google Spreadsheet generates this kind of data if you copy a cell that contains a single line of text and has a quote in it (then the cell itself isn’t quoted and inner quotes aren’t escaped).
IMO, it would be better if such quotes were not removed.
When dealing with very large files, and you want to use the callback, the "doc" object can become very very big leading to out of memory errors. It would be useful to use the callback at different intervals: for instance, use the callback after processing n amount of rows.
Branch | Build failing 🚨 |
---|---|
Dependency | mocha |
Current Version | 3.3.0 |
Type | devDependency |
This version is covered by your current version range and after updating it in your project the build failed.
As mocha is “only” a devDependency of this project it might not break production or downstream projects, but “only” your build or test tools – preventing new deploys or publishes.
I recommend you give this issue a high priority. I’m sure you can resolve this 💪
Mocha is now moving to a quicker release schedule: when non-breaking changes are merged, a release should happen that week.
This week's highlights:
allowUncaught
added to commandline as --allow-uncaught
(and bugfixed)--no-warnings
and --trace-warnings
flags (@sonicdoe)The new version differs by 9 commits0.
7554b31
Add Changelog for v3.4.0
9f7f7ed
Add --trace-warnings
flag
92561c8
Add --no-warnings
flag
ceee976
lint test/integration/fixtures/simple-reporter.js
dcfc094
Revert "use semistandard
directly"
93392dd
no special case for macOS running Karma locally
4d1d91d
--allow-uncaught cli option
fb1e083
fix allowUncaught in browser
4ed3fc5
Add license report and scan status
false
See the full diff
There is a collection of frequently asked questions and of course you may always ask my humans.
Your Greenkeeper Bot 🌴
skipEmpty: false //defaults to false, emits everything
if true then it does not emit empty lines for example, in the following it would skip the blank line. The use case is more for when the file has trailing blank line at the end.
id
1
2
3
4
The CSV spec (such as it is, ) specifies that a double quote can be encoded as \"
or ""
. Unfortunately, csv-streamify doesn't support the ""
option.
Dialogue,"""Hello", said the man"
is valid but unsupported CSV.
Couldn't figure this out, but switching packages to csvtojson seemed to resolve for me
const parser = csvReader({ columns: true });
const reader = fs.createReadStream(...).pipe(parser);
while (1) {
reader.pause()
await something()
reader.resume()
}
After around the 6th? resume, it will consistently close the read stream for reasons that elude me
Hello,
I was trying to process large csv file (~100MB+), and found that this library does not properly return all rows.
Use this dataset to reproduce: description.csv.gz
Code:
// streamify_bug.js
var inputFile = require('fs').createReadStream('./description.csv'),
csv = require("csv-streamify"),
parser = csv({
objectMode: true,
columns: true
});
parser.on('readable', printLine);
inputFile.pipe(parser);
function printLine() {
var line = parser.read();
console.log(parser.lineNo);
}
If I count number of produced lines (node streamify_bug.js | wc -l
) I get 993,319
. But description.csv
has:
> wc -l description.csv
1635938 description.csv
1,635,938
lines.
I'm using csv-streamify
version 1.0.0. npm version
:
{ http_parser: '1.0',
node: '0.10.32',
v8: '3.14.5.9',
ares: '1.9.0-DEV',
uv: '0.10.28',
zlib: '1.2.3',
modules: '11',
openssl: '1.0.1i',
npm: '1.4.28' }
create simple, but useful benchmark to measure the througput
(patches welcome!)
Hey I noticed that the codebase here has the latest version at 2.0.0 but npmjs.org only has version 1.0.0:
https://www.npmjs.com/package/csv-streamify
Any chance you could push the latest to npmjs.org? I'm running into a few issues I would like to fix.
While implementing csv-streamify
I found out that there was a strange character (or wrongly encoded file) in the CSV that returned on an dirty column name.
example.csv
weird_char,ok_col,some_emoj🤙
1,1,1
The object returning included the exact same string as in the keys.
{
'weird_char': 1,
ok_col: 1,
'some_emoj🤙': 1
}
My request is to add the ability to map or clean this column names to get a cleaner object, like
{
weird_char: 1,
ok_col: 1,
some_emoj: 1
}
One solution would be to add a regex replace in https://github.com/klaemo/csv-stream/blob/master/csv-streamify.js#L56, example
if (state.lineNo === 0) {
state._columns = state._line.map(col => col.replace(/[^a-zA-Z0-9_]/g,''))
state.lineNo += 1
reset()
return
}
A nicer alternative would be to allow to pass a function as value for columns
const csv = require('csv-streamify')
const parser = csv({
columns: (cols) => cols.map(col => col.replace(/[^a-zA-Z0-9_]/g,'')),
objectMode: true,
})
As discussed on Twitter, I have values that have trailing or ending whitespace which I would like to get rid of. My proposal is to provide a crop option which would crop the values before adding them to the array.
But it does do manipulation on the values then which might not be the point of this library.
the parser seems kind of slow, after confirming this with a benchmark (#4) find ways to improve it
(ideas and patches welcome!)
Hi,
Great Library. Just wanted to let you know an issue I faced.. SFDC has a non-compliant CSV export (from it's bulk export tool).. producing rows like the "00T1p00002Pq1pREAR" row below.. This causes your program to stay in the _isQuoted state as it flows through potentially hundreds of records until it encounters another quote to end this state... Like, for instance, if you save below as a csv file , the 4 rows past 00T1p00002Pq1pREAR don't get digested..
Anyways.. our workaround was that I coded a local version of your csv-stringify with the following.. Anyways.. this was just sort of an "escape hatch". Using the existing csv-stringify, there out of 82 million SFDC "task" records, we only processed 76 million .. w/ my code change, we processed 81,994,204 records due to my "recovery" code..
I know this doesn't work in all cases (e.g. in a String field that might just magically hit my Regex.. (in our use case, this is very improbable).. And perhaps you have a real fix to the issue.. but just figured I'd drop a line..
Thanks again for the great work!
/*
...
...
// newline
if (!state._isQuoted && (c === opts.newline || c === opts.newline[0])) {
state._newlineDetected = true
queue(c)
continue
}
**if (opts.newlineRegexFailsafe && state._isQuoted && (c === opts.newline || c === opts.newline[0])) {
// find next delimiter.. then apply regex to see if we are at a newline
var buff = [];
var z=1;
var nxtChar;
while(
(!opts.newlineRegexFailsafe.maxReadAheadLength || z<opts.newlineRegexFailsafe.maxReadAheadLength) // restrict how many characters to read ahead.. performance optimization
&& i+z < data.length // don't read past end of input
&& (nxtChar = data.charAt(i+z))!=opts.delimiter) { // read up until the next encounter of delimiter
buff.push(nxtChar);
z++;
}
var succeedingCharacters = buff.join("");
if (succeedingCharacters.match(opts.newlineRegexFailsafe.regex)) {
emitLine(this)
continue
}
}**
const csv = require('csv-streamify');
const input = 'subset.csv';
const fs = require('fs');
// If parser is in _isQuoted state.. As a failsafe for malformed, multi-quoted fields, If I get to a newline
// that has a pattern following of 18 characters w/ first 3 of a certain SFDC Id convention (e.g. 00T)
// we will consider this the new line.. this previous record we would be emitting will be incomplete (e.g. will)
// not contain all of the columns, as it is caught up in the multi-quote column.... It will be left to the caller
// to check for the correct number of columns, and dispose/or/deal with this errart row.
const parser = csv({"newlineRegexFailsafe" : {"regex" : "^(00T|001|003|005|a21|801|006|00U|a25)[a-zA-Z0-9]{15}$", "maxReadAheadLength" : 20}});
//const parser = csv()// test with this one.. you will see it fails on the 00T1p00002Pq1pREAR record
// emits each line as a buffer or as a string representing an array of fields
var idx = 0;
parser.on('data', function (values) {
console.log(idx + ":" + values[0] + ":" + values.length);
//NOTE: this is a "Task", 79 columns..
//if values.length = 79, probably a good record (perhaps add some simple heuristic to verify a few of the expected contents of columns
//if values.length < 79, was a victim of the double-double-quote issue.. probably easiest to dispose (and log) of record than try to recover
//if values.length > 79, was a victim, but was a rare/unfortunate victim in that the chunk size of the parser straddled the maxReadAheadLength
//string and thus, it ran on into the next record. NOTE: At most we will lose 2 records to the double-double quote issue, as the code will
//read in the next chunk and continue to read through the next record (making values.length > 79) in this _isQuoted state
// until it again encouters the newlineRegexFailsafe regex on the subsequent record.
idx++;
});
//001, 003, 00U, 006, 801, a21, a25, 00T, 00500T1p00002Pq1pREAR
fs.createReadStream(input, {start: 0}).pipe(parser);
00T1p00002Pq1pOEAR,0032400000pPGSuAAO,0012400000poV9uAAE,E-mailed - Anord,2017-03-01,Completed,Normal,false,005U0000000OflxIAC,,Attempted Contact,false,0012400000poV9uAAE,true,2017-03-01T14:55:03.000+0000,005U0000000OflxIAC,2017-03-01T14:55:03.000+0000,005U0000000OflxIAC,2018-03-14T16:47:33.000+0000,false,false,,,,,,false,,false,,,,,,,,,,,Task,Attempted Contact,false,,,,,R.348341659,,Non-Target,,,,Attempted Contact,,false,,0.0,,,,,,,,,,,,,,,,,R.348341659,,,,,
00T1p00002Pq1pREAR,0032400000pPGSuAAO,0012400000poV9uAAE,Roofbuilders,2018-03-07,Completed,Normal,false,0051p000008bcDxAAI,"""",Call,false,0012400000poV9uAAE,true,2018-03-07T17:50:08.000+0000,0051p000008bcDxAAI,2018-03-08T09:39:04.000+0000,0051p000008bcDxAAI,2018-03-14T16:47:33.000+0000,false,false,,,,,,false,,false,,,,,,,,,,,Task,Call,false,,,,,R.412305459,,Non-Target,"ONS-Richmond, VA-00494",00eU0000000eAbrIAE,Aerotek Chesapeake,Call,,true,2018-03-08T09:39:04.000+0000,0.0,,,,,,,,,,,,,,,,,R.412305459,,,,,
00T1p00002Pq1pSEAR,0032400000pPGSuAAO,0012400000poV9uAAE,Left Message & E-mailed - Roofbuilders,2018-03-08,Completed,Normal,false,0051p000008bcDxAAI,#NIS,Attempted Contact,false,0012400000poV9uAAE,true,2018-03-08T09:39:03.000+0000,0051p000008bcDxAAI,2018-03-08T09:39:03.000+0000,0051p000008bcDxAAI,2018-03-14T16:47:33.000+0000,false,false,,,,,,false,,false,,,,,,,,,,,Task,Attempted Contact,false,,,,,R.412377485,,Non-Target,"ONS-Richmond, VA-00494",00eU0000000eAbrIAE,Aerotek Chesapeake,Attempted Contact,,false,,0.0,,,,,,,,,,,,,,,,,R.412377485,,,,,
00T1p00002Pq1pWEAR,0032400000pPGSuAAO,0012400000poV9uAAE,Left Message - Attempted Contact,2017-07-28,Completed,Normal,false,005U0000004YtsvIAC,,Attempted Contact,false,0012400000poV9uAAE,true,2017-07-28T16:33:47.000+0000,005U0000004YtsvIAC,2017-07-28T16:33:47.000+0000,005U0000004YtsvIAC,2018-03-14T16:47:33.000+0000,false,false,,,,,,false,,false,,,,,,,,,,,Task,Attempted Contact,false,,,,,R.374017937,,Non-Target,"ONS-Fairfax, VA - P&I-00422",00eU0000000eAbrIAE,Aerotek Chesapeake,Attempted Contact,,false,,0.0,,,,,,,,,,,,,,,,,R.374017937,,,,,
00T1p00002Pq1pXEAR,0032400000pPGSuAAO,0012400000poV9uAAE,Candidate Summary/G2 Edited by Joseph Henry Breithaupt,2017-07-05,Completed,Normal,false,00524000003MqTYAA0,"really nice guy, jumpy between contracts, worked through people solutions from 2012-14, ennis flint left because he got his bachelors degree and got offered the position at alloy polymers, he is interested in getting more hands on with PLC work or a higher paying maintenance engineer role, he is working a split shift at alloy which he hates (comes in for the morning, leaves and comes back for the evening) he is currently making 28/hr but would be interested in 60k and up because he does get overtime, sending him job descriptions for foley and sabra",G2,false,0012400000poV9uAAE,true,2017-07-05T09:04:25.000+0000,00524000003MqTYAA0,2017-07-05T09:04:25.000+0000,00524000003MqTYAA0,2018-03-14T16:47:33.000+0000,false,false,,,,,,false,,false,,,,,,,,,,,Task,G2,false,,,,,R.369709668,,Non-Target,"ONS-Richmond, VA-00494",00eU0000000eAbrIAE,Aerotek Chesapeake,G2,,false,,0.0,,,,,,,,,,,,,,,,,R.369709668,,,,,
00T1p00002Pq1pYEAR,0032400000pPGSuAAO,0012400000poV9uAAE,TT,2017-07-06,Completed,Normal,false,00524000003MqTYAA0,"not the right experience for either sabra or foley, he is interested in staying in touch for other roles moving forward, sharp guy",Call,false,0012400000poV9uAAE,true,2017-07-06T11:07:09.000+0000,00524000003MqTYAA0,2017-07-06T11:07:09.000+0000,00524000003MqTYAA0,2018-03-14T16:47:33.000+0000,false,false,,,,,,false,,false,,,,,,,,,,,Task,Call,false,,,,,R.370010467,,Non-Target,"ONS-Richmond, VA-00494",00eU0000000eAbrIAE,Aerotek Chesapeake,Call,,true,2017-07-06T00:00:00.000+0000,0.0,,,,,,,,,,,,,,,,,R.370010467,,,,,
I am running a CSV conversion with:
// parser.js var Transform = require('stream').Transform; var csv = require('csv-streamify'); var JSONStream = require('JSONStream') var csvToJson = csv({objectMode: true, delimiter: ';', inputEncoding: 'utf8'}); var parser = new Transform(); parser._transform = function(data, encoding, done) { this.push(data); done(); }; var jsonToStrings = JSONStream.stringify(false); // Pipe the streams process.stdin .pipe(csvToJson) .pipe(parser) .pipe(jsonToStrings) .pipe(process.stdout);
But I get:
events.js:72 throw er; // Unhandled 'error' event ^ TypeError: Invalid non-string/buffer chunk at validChunk (_stream_writable.js:150:14) at Transform.Writable.write (_stream_writable.js:179:12) at write (_stream_readable.js:573:24) at flow (_stream_readable.js:582:7) at CSVStream.pipeOnReadable (_stream_readable.js:614:5) at CSVStream.EventEmitter.emit (events.js:92:17) at emitReadable_ (_stream_readable.js:408:10) at emitReadable (_stream_readable.js:404:5) at readableAddChunk (_stream_readable.js:165:9) at CSVStream.Readable.push (_stream_readable.js:127:10)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.