lukaswagner / csv-parser Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 2.0 860 KB

Quick, multi-threaded CSV parser with focus on handling huge files.

Home Page: https://csv.lwgnr.dev

License: MIT License

Pug 0.14% TypeScript 92.50% JavaScript 6.59% HTML 0.77%

csv js parser ts

csv-parser's People

Contributors

Stargazers

Watchers

Forkers

scheibel hpicgs

csv-parser's Issues

support date columns

Two approaches to storing dates:

Array of Date objects. This would allow access without additional overhead, at the cost of additional storage size. If I'm reading the memory debug tabs correctly, a Date is 60 bytes in Chrome and 88 bytes in Firefox.
Int64Array storing the time in unix epoch. This would require a slight get/set overhead due to the conversion to Date, but additional get/set methods could be provided for accessing the raw unix time at no additional cost. The memory usage would be way less (8 bytes). Additionally, this would simplify passing the data to WebGL.

Considering the pros and cons, using an Int64Array seems to make more sense.

add documentation for public interface

This should include parameter semantics and expected value ranges, as well as code that shows exemplary usage.
Ideally, a function self-indicates when it is useful to get used, and when it should not get used if it is ambiguous.

rebuildColumn should rebuild the underlying chunks as well

Refactor internal `loader.open()` method

Currently when opening a data source, the buffer or stream property of the internal loader instance is set to the data source and then the open() method is called - this seems kind of unintuitive.
I propose to pass the input data source as parameter of the open() method. This way, opening a data source depends on a single method (open()) only, rather than the open() method and X setters (buffer, stream), one for each internal data source.

row limiting

add an option limit the number of loaded rows

get/set on colums off by one

The find predicate is wrong:
const chunk = this._chunks.find((c) => c.offset < index && c.offset + c.length >= index);
When reading index 0, no chunk is found, since the first chunk has offset 0, which is equal to the index. Thus, the lower check has to be changed to <=. Accordingly, the upper check has to be changed to >, as offset + length is already outside the chunk.

column filtering and generating columns not properly supported

The current implementation does not allow filtering columns. On top of this, generated columns aren't working either, as the passed function can't be serialized for passing to a worker. To solve both issues, the following interface should be implemented:

type OriginalColumn = {
    /**
     * References a detected column.
     */
    index: number,
    /**
     * May override name.
     */
    name?: string,
    /**
     * May override data type.
     */
    type?: DataType,
    /**
     * Map values to same type. This may be used as a simpler alternative to
     * generated columns, e.g. for scaling or offsetting of values.
     */
    map?: (v: any) => any
}

type GeneratedColumn = {
    name: string,
    type: DataType,
    /**
     * In order to be passed to workers, this has to be a string.
     * The function must be of type (line: string[]) => any.
     */
    func: string,
    /**
     * Allow sharing a state between func invocations, as well as passing data
     * to func. Note that this has to be fully serializable.
     */
    state: Object // func.bind(state), can be used to pass state
}

type Column = OriginalColumn | GeneratedColumn;

end of chunk not handled correctly, results in additional lines

parsing a 1m-line csv results in 1m+1 lines being read when split up to two workers. the last value of the first chunk is set to 0. possibly connected: the second chunk's offset is set to NaN.

improve and simplify interface

While it's nice to have a completely promise-based interface, the current implemetation has usability problems.

Current State

The minimal usage is still quite complex:

const detectedColumns = await loader.open(id);
const [columns, dispatch] = loader.load({
    columns: detectedColumns.map(({ type }) => type),
    generatedColumns: [],
});
for await (const _ of dispatch());
return columns;

For users that just want their data, this is quite a lot of code. "Why do I need this loop? Why do I need to pass an empty array? What's with that redundant map call?"

But even for advanced users, the interface is unintuitive: The dispatchers done event will always be awaited before the columns are returned. Thus, the done event is redundant, as it will always be directly followed by the function returning.

Requirements

What events does the interface have to support? This is separated into the core functionality and the per-chunk updating:

[core] finished data: After all data is loaded, the columns should be emitted.
[chunks] columns initialized: Directly after the columns are created, they should be emitted.
[chunks] progress: Every time a chunk is parsed, an update callback should be invoked.

Proposed Interface

As the basic interface should be as simple as possible, only one promise should be returned. This promise should resolve when all data is parsed.

The two other event types should be handled using optional callbacks.

Also, the columns/generatedColumns options to load should be changed slightly:

columns should be of type ColumnHeader instead of DataType. This would allow renaming of columns, as well as a simpler interface (passing the detected column headers directly instead of mapping).
generatedColumns should be optional. This is only a quality-of-life improvement to avoid passing an empty array.

The minimal usage could look like this:

const detectedColumns = await loader.open(id);
const data = await loader.load({ columns: detectedColumns });

A more involved usage example, using custom columns and the chunk approach:

const detectedColumns = await loader.open(id);
// don't care about return value, it's already handled in init()
await loader.load({
    columns: detectedColumns
        .filter((column) => column.type === 'number')
        .map((column, i) => ({ name: `num_${i}`, type: column.type })),
    // callback function names are up to debate
    init: (columns) => renderer.initColumns(columns),
    update: (progress) => {
        console.log('progress:', progress);
        renderer.handleUpdate();
    }
});

newlines in quotes not supported

view interface is missing from ColorChunk and DateChunk

The accessor was added for number chunks, but it should be added for all chunks using typed arrays.

published version on npm does not contain the readme

Pass progress as percentage to onUpdate callback

Currently, the onUpdate callback is called with a progress parameter, which represents the number of parsed lines.
When implementing a GUI, I might want to display the loading state using a determinate progress bar/spinner including a percentage. It would be great, if the onUpdate callback would give me a progress object, such as

interface Progress {
    lines: number;
    percentage?: number;
}

where percentage is a number between 0-1 if it is known (undefined otherwise). This should work for all local file types, such as Blob or ArrayBuffer, since the size is already known when starting to parse. Remote streams should work as well, if the Content-Length header is set.
But actually, the percentage would depend on the number of parsed bytes rather than parsed lines, since the size of a buffer or the Content-Length header indicate the byte length. Therefore, the interface should probably be

interface Progress {
    bytes: number;
    lines: number;
    percentage?: number;
}

or maybe even

interface Progress {
    bytes: number;
    lines: number;
    percentage?: number;
    totalBytes?: number;
}

for completeness.

proposal for generic parser interface

Currently, the library supports CSV and TSV data, while external spreadsheet services (Google Sheets and Excel) deliver JSON data. In order to parse the JSON data several TransformStreams are used to decode the raw bytes to text, transform the JSON formats to CSV, and encode the text back to raw bytes. Finally, the transformed data are passed to the parser, which handles the data using the same CSV/TSV parser logic. This is obviously no sophisticated approach and could be improved.

Proposal

All logic that is related to parsing a specific format could be abstracted in a generic Parser interface. I think this would affect four functions that have no relation at the moment: parse, splitLine, splitLines, and maybe parseLine.

interface Parser {
    parse(chunks: ArrayBufferLike[], start: Position, end: Position): string[];
    parseLine(line: string[], types: DataType[]): unknown[];
    splitLine(line: string, delimiter: string): string[];
    splitLines(chunk: string, lines: string[], remainder = ''): string;
}

This would allow defining different parsers for different data sources: CsvParser implements Parser, JsonParser implements Parser, etc.
Additionally, we could expose the Parser interface, so users could define their own parser for tabular data with a custom format, e.g. parsing a table from a Markdown file. A possible usage could be:

import { Parser } from "@lukaswagner/csv-parser";

class MyCustomParser implements Parser {
   // implementations
}

// ...

await parser.open("[custom-file]", { parser: myCustomParser });

The custom parser passed could be an instance or a class. If the parser has no state, the methods could also be defined as static. But I'm open for opinions which API would be best.

Considerations

While this abstraction would enable using this library in more use cases, it would have more responsibilities than the name "CSV parser" suggests. It might be worth thinking about splitting the library into multiple packages, e.g. as main library data-parser and csv-parser, json-parser, etc. that could be integrated like plugins. Since we're already using a monorepo, introducing more packages should not be a big deal. A bigger problem would be the csv-parser library that currently contains the actual library and would become a plugin - so communicating that switching the library would be required. But maybe you have other thoughts/better ideas how to handle this case.

sparse loading

add an option to filter rows, e.g. load only every 100th row. would allow to get a quick overview over huge datasets.

Simplify internal handling of data sources

Currently, we map all different data source types to ArrayBuffer or ReadableStream. Instead of having logic for both interfaces, it might be sensible to map all data sources to ReadableStream directly. It should have no negative impact on the parsing performance, but would simplify the internal logic it takes to handle all data sources.

monitor ram usage

chrome only gives out 4gb of ram. the columns should report their size, and the loader should monitor the size of the currently loaded input chunks and the created output. based on this, the parsing has to be paused, or, if necessary, stopped.

Support for vector type

It would be nice to support some sort of compound types, such as vector/tuple, which could be built from multiple columns. A possible interface is proposed in this issue comment.

Implementing vector types includes

Adding a vector/tuple data type
Adding option to merge single number columns into a vector column in load() method
Adding option to split compound columns into single columns in load() method
Trying to auto-merge columns if sensible, e.g. merging three number columns with names x, y, and z

Use High Resolution Time API for perfMon helper

Currently, the perfMon helper relies on Date.now() to measure the performance of operations.
For higher accuracy and reliability it would be nice to use the Performance interface of the High Resolution Time API.
It would be possible to use performance.now() for the same logic with Date.now(), or to use performance.mark() and performance.measure() to get the delta.

SharedArrayBuffer seems to be unnecessary

From a quick look over the code, it should be possible to just use regular ArrayBuffers combined with postMessage's transfer parameter. This would remove the need for the specific headers.

Support `URL` object as data source

We already allow passing strings, which are URLs to remote data sources.
To be more flexible, we might also want to allow passing an URL object, which could be used for downloading remote data as well.

chunks based on buffers should provide access to the view member in addition to the buffer

Allow more options for Google/Excel data source

Currently, the parser always fetches the first table sheet the API provides and uses the whole range of data. It might also be limited to request data that starts in cell A1.
To be more flexible, it would be great if a user could optionally pass a sheet name or a range that should be fetched (and use the first sheet and the whole range as fallback).