Code Monkey home page Code Monkey logo

lazycsv's Introduction

CI

lazycsv

What's the lazycsv?

lazycsv is a c++17, posix-compliant, single-header library for reading and parsing csv files.
It's fast and lightweight and does not allocate any memory in the constructor or while parsing. It parses each row and cell just on demand on each iteration, that's why it's called lazy.

Quick usage

The latest version of the single header can be downloaded from include/lazycsv.hpp.

#include <lazycsv.hpp>

int main()
{
    lazycsv::parser parser{ "contacts.csv" };
    for (const auto row : parser)
    {
        const auto [id, name, phone] = row.cells(0, 1, 4); // indexes must be in ascending order
    }
}

Performance note

Parser doesn't keep state of already parsed rows and cells, iterating through them always associated with parsing cost. This is true with cells() member function too, geting all needed cells by a single call is recommended.
If it's necessary to return to the already parsed rows and cells, they can be stored in a container and used later without being parsed again (they are view objects and efficient to copy).

Features

Returned std::string_view by raw() and trimed() member functions are valid as long as the parser object is alive:

std::vector<std::string_view> cities;
for (const auto row : parser)
{
    const auto [city, state] = row.cells(0, 1);
    cities.push_back(city.trimed());
}

Iterate through rows and cells:

for (const auto row : parser)
{
    for (const auto cell : row)
    {
    }
}

Get header row and iterate through its cells:

auto header = parser.header();
for (const auto cell : header)
{
}

Find column index by its name:

auto city_index = parser.index_of("city");

row and cell are view objects on actual data in the parser object, they can be stored and used as long as the parser object is alive:

std::vector<lazycsv::parser<>::row> desired_rows;
std::vector<lazycsv::parser<>::cell> desired_cells;
for (const auto row : parser)
{
    const auto [city] = row.cells(6);
    desired_cells.push_back(city);

    if (city.trimed() == "Kashan")
        desired_rows.push_back(row);
}
static_assert(sizeof(lazycsv::parser<>::row) == 2 * sizeof(void*));  // i'm lightweight
static_assert(sizeof(lazycsv::parser<>::cell) == 2 * sizeof(void*)); // i'm lightweight too

Parser is customizable with the template parameters:

lazycsv::parser<
    lazycsv::mmap_source,           /* source type of csv data */
    lazycsv::has_header<true>,      /* first row is header or not */
    lazycsv::delimiter<','>,        /* column delimiter */
    lazycsv::quote_char<'"'>,       /* quote character */
    lazycsv::trim_chars<' ', '\t'>> /* trim characters of cells */
    my_parser{ "data.csv" };

By default parser uses lazycsv::mmap_source as its source of data, but it's possible to be used with any other types of contiguous containers:

std::string csv_data{ "name,lastname,age\nPeter,Griffin,45\nchris,Griffin,14\n" };

lazycsv::parser<std::string_view> parser_a{ csv_data };
lazycsv::parser<std::string> parser_b{ csv_data };

lazycsv's People

Contributors

ashtum avatar otreblan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

lazycsv's Issues

build tests fail with g++ 11.2.0 and clang++-13

Hi, I just tried using this project, sadly the tests seem to be broken, when building with g++ 11 or clang++ 13.

Expected result: no compiler error

Actual result:

/usr/bin/c++  -I.../build/_deps/lazycsv-src/include -isystem .../build/_deps/lazycsv-src/test/doctest -Wall -Wfatal-errors -Wextra -Wnon-virtual-dtor -pedantic -std=gnu++17 -o CMakeFiles/driver.dir/main.cpp.o -c .../build/_deps/lazycsv-src/test/main.cpp
In file included from .../build/_deps/lazycsv-src/test/main.cpp:4:
.../build/_deps/lazycsv-src/test/doctest/doctest.h:4032:47: error: size of array ‘altStackMem’ is not an integral constant-expression
 4032 |         static char             altStackMem[4 * SIGSTKSZ];
      |                                               ^
compilation terminated due to -Wfatal-errors.
clang++-13  -I.../build/_deps/lazycsv-src/include -isystem .../build/_deps/lazycsv-src/test/doctest -Wall -Wfatal-errors -Wextra -Wnon-virtual-dtor -pedantic -std=gnu++17 -o CMakeFiles/driver.dir/main.cpp.o -c .../build/_deps/lazycsv-src/test/main.cpp
In file included from .../build/_deps/lazycsv-src/test/main.cpp:4:
.../build/_deps/lazycsv-src/test/doctest/doctest.h:4032:33: fatal error: variable length array declaration not allowed at file scope
        static char             altStackMem[4 * SIGSTKSZ];
                                ^           ~~~~~~~~~~~~
1 error generated.

Compiler versions

g++ (Ubuntu 11.2.0-7ubuntu2) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Ubuntu clang version 13.0.0-2
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

Steps to reproduce:

  • clone current version
  • try to compile tests with g++ 11 or clang++ 13

Lazycsv can forget the name of the last csv column

In python, I generated a csv as follows:

import random
import csv

headercsv = ["x", "y", "z", "result"]

dataset = []
for i in range(1, 10000):
    x = round(random.uniform(-10, 10),3)
    y = round(random.uniform(-10, 10),3)
    z = round(random.uniform(-10, 10),3)

    result =  round(x * y * z, 3)

    dataset.append([x, y, z, result])

with open('data.csv', 'w', encoding='UTF8') as f:
    writer1 = csv.DictWriter(f, fieldnames=headercsv)
    writer1.writeheader()
    writer = csv.writer(f)
    writer.writerows(dataset)
head -n 5  data.csv
x,y,z,result
-1.925,5.298,-7.699,78.519
-0.437,2.224,-3.304,3.211
0.82,1.699,5.656,7.88
-6.65,5.465,-2.115,76.864

I then tried to parse the column names of the generated csv file "data.csv" using lazycsv:

#include <iostream>
#include "lazycsv.h"

int main() {
    lazycsv::parser<
            lazycsv::mmap_source,           /* source type of csv data */
            lazycsv::has_header<true>,      /* first row is header or not */
            lazycsv::delimiter<','>,        /* column delimiter */
            lazycsv::quote_char<'"'>,       /* quote character */
            lazycsv::trim_chars<' ', '\t'>> /* trim characters of cells */
    parser{"data.csv"};

    auto header = parser.header();

    for (const auto cell: header) {

        std::cout << cell.raw() << std::endl;
    }
    std::cout << "===========" << std::endl;

    for (const auto cell: header) {

        std::cout << cell.raw() << " vs" << " result" << std::endl;

        if (cell.raw() == "result") {
            std::cout << "catched result" << std::endl;
        }
    }


    return 0;
}

This code generates the following output:

x
y
z
result
===========
x vs result
y vs result
z vs result
 vs result

It is not possible to compare the last column via cell.raw() with another value ("catched result" is not printed).
Furthermore, the last line shows that if I try to print sentences composed of cell.raw() (returning the name of the last column) and sentence segments, it doesn't work.

If I move the "result" column to a place other than the last column, I can capture it.

Problem with instantiating lazycsv::parser

When compiling a project that requires lazy csv I get the following issue:

In instantiation of ‘class lazycsv::parser<lazycsv::mmap_source, lazycsv::has_header<false>, lazycsv::delimiter<','>, lazycsv::trim_chars<' ', '\011'> >’:

../subprojects/lazycsv/include/lazycsv.hpp:367:11: error: ‘value’ is not a member of ‘lazycsv::trim_chars<' ', '\011'>’ 367 | using cell_iterator = detail::fw_iterator<cell, detail::chunk_cells<delimiter::value, quote_char::value>>; | ^~~~~~~~~~~~~ At global scope:

I am not sure what to make of it so any help would be greatly appreciated.

Casting from cell value to other types

I'm trying to save the values read from a CSV to local variables.
I managed to do that for string variables, but for int/double/etc... it seems kind of tricky. Is there some functionality inside the library that already does it without having to cast the value directly during the reading ?

Problems with commas and double quotes

I don't think this should be called an issue, more an improvement request (or maybe I do not know if there is already a way to do it). I'm reading a csv in the following way:

    for (const auto row : parser)
    {
        StopTimes_struct tmp_stop_times;
        const auto [trip_id, arrival, departure, stop_id, stop_sequence] = row.cells(0, 1, 2, 3, 4); // indexes must be in ascending order
        tmp_stop_times.trip_id = trip_id.raw();
        tmp_stop_times.stop_id = stop_id.raw();
        tmp_stop_times.departure_time = departure.raw();
        tmp_stop_times.arrival_time = arrival.raw();
    }

Without entering into much details, I have GTFS format files, i.e., transit data. The header of the csv is:
stop_id,stop_code,stop_name,stop_desc,stop_lat,stop_lon,zone_id,stop_url,location_type,parent_station,stop_timezone

Now, I have two kinds of problems, which actually are connected:
if the header, together with all the rows, have double quotes containing each field value, the strings resulting from the reading have the "\"" symbol arount the actual string. Is there a way, with lazycsv, to extract directly the string, without having to do it manually at each extraction?

Other problem: it could happena that one field contains a comma. Example of a row oh this kind:
10018","","C.so Sempione, 83 prima di Via E. Filiberto","","45.4862832229375","9.15805393535531","","","","","",""

Thus, what should be the 3rd field, is splitted on the comma.

Is there someway to solve this issue ?

Thank you

parser does not decode appearance of "escaped" quote_char within individual cells

The project allows for quoting of individual cell contents (via quote_char) but does not provide for the recognition of embedded, quote_char characters within quoted cell value strings. Ideally the project should at least support RFC 4180 encoding or, better, the ability to specify escape character (e.g., quote_escape_char) to be used with literal appearances of quote_char in the quoted cell string. The latter will enable support for both RFC4180 encoding rules (quote_char = " and quote_escape_char = " )and the non-standard, but often used backslash escaping used by various database table exports (quote_char = " and quote_escape_char = \ ).

Windows compatibility

I noticed that, since the parsing is based on mmap functions, the library cannot work on non-unix-based system, e.g., windows. Is it planned a version of lazycsv for windows, maintaning the current speedness of parsing?

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.