Code Monkey home page Code Monkey logo

json2parquet's Issues

Specify schema

As with csv2parquet, specifying the target schema, rather than inferring it can be useful.

setting `--max-read-records 0` reads zero records

Thank you for this great tool!

The documentation notes that setting --max-read-records to 0 will stop schema inference and set the type for all columns to strings:

json2parquet/Readme.md

Lines 32 to 33 in 27dfa6a

--max-read-records <MAX_READ_RECORDS>
The number of records to infer the schema from. All rows if not present. Setting max-read-records to zero will stop schema inference and all columns will be string typed

The behavior I'm seeing is that json2parquet will only read the number of records set by --max-read-records.

Is there another way to stop schema inference?

Some examples demonstrating the behavior:

$ cat foo.json
{"key1":"value1"}
{"key2":"value2"}
{"key3":"value3"}
{"key4":"value4"}
{"key5":"value5"}

$ ./json2parquet -V
json2parquet 0.6.0

$ ./json2parquet --max-read-records 0 foo.json foo.parquet -p
Schema:
{
  "fields": []
}

$ du -h --apparent-size foo.parquet
183     foo.parquet

$ ./json2parquet --max-read-records 2 foo.json foo.parquet -p
Schema:
{
  "fields": [
    {
      "name": "key1",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key2",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    }
  ]
}

$ du -h --apparent-size foo.parquet
751     foo.parquet

$ ./json2parquet foo.json foo.parquet -p
Schema:
{
  "fields": [
    {
      "name": "key1",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key2",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key3",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key4",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key5",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    }
  ]
}

$ du -h --apparent-size foo.parquet
1.6K    foo.parquet

LZ4-compressed PQ files unreadable by Pandas and ClickHouse

Versions:

  • json2parquet 0.6.0
  • PyArrow 10.0.1
  • ClickHouse 22.13.1.1119
$ vi test.jsonl
{"area": 123, "geom": "", "centroid_x": -86.86346599122807, "centroid_y": 34.751296108771925, "h3_7": "872649315ffffff", "h3_8": "882649315dfffff", "h3_9": "892649315cbffff"}
$ json2parquet -c lz4 test.jsonl lz4.pq
$ ls -lth lz4.pq # 2.5K
$ hexdump -C lz4.pq | head; echo; hexdump -C lz4.pq | tail
00000000  50 41 52 31 15 00 15 1c  15 42 2c 15 02 15 00 15  |PAR1.....B,.....|
00000010  06 15 06 1c 58 08 7b 00  00 00 00 00 00 00 18 08  |....X.{.........|
00000020  7b 00 00 00 00 00 00 00  00 00 00 04 22 4d 18 44  |{..........."M.D|
00000030  40 5e 0e 00 00 80 02 00  00 00 02 01 7b 00 00 00  |@^..........{...|
00000040  00 00 00 00 00 00 00 00  e4 c0 1d d2 15 04 19 25  |...............%|
00000050  00 06 19 18 04 61 72 65  61 15 0a 16 02 16 6a 16  |.....area.....j.|
00000060  90 01 26 08 3c 58 08 7b  00 00 00 00 00 00 00 18  |..&.<X.{........|
00000070  08 7b 00 00 00 00 00 00  00 00 00 15 00 15 1c 15  |.{..............|
00000080  42 2c 15 02 15 00 15 06  15 06 1c 58 08 19 62 dc  |B,.........X..b.|
00000090  06 43 b7 55 c0 18 08 19  62 dc 06 43 b7 55 c0 00  |.C.U....b..C.U..|

00000970  41 42 51 41 45 41 41 4f  41 41 38 41 42 41 41 41  |ABQAEAAOAA8ABAAA|
00000980  41 41 67 41 45 41 41 41  41 42 67 41 41 41 41 67  |AAgAEAAAABgAAAAg|
00000990  41 41 41 41 41 41 41 42  41 68 77 41 41 41 41 49  |AAAAAAABAhwAAAAI|
000009a0  41 41 77 41 42 41 41 4c  41 41 67 41 41 41 42 41  |AAwABAALAAgAAABA|
000009b0  41 41 41 41 41 41 41 41  41 51 41 41 41 41 41 45  |AAAAAAAAAQAAAAAE|
000009c0  41 41 41 41 59 58 4a 6c  59 51 41 41 41 41 41 3d  |AAAAYXJlYQAAAAA=|
000009d0  00 18 19 70 61 72 71 75  65 74 2d 72 73 20 76 65  |...parquet-rs ve|
000009e0  72 73 69 6f 6e 20 32 33  2e 30 2e 30 00 fe 04 00  |rsion 23.0.0....|
000009f0  00 50 41 52 31                                    |.PAR1|
000009f5
$ ipython
In [1]: import pyarrow.parquet as pq

In [2]: pf = pq.ParquetFile('lz4.pq')

In [3]: pf
Out[3]: <pyarrow.parquet.core.ParquetFile at 0x10ca1cd90>

In [4]: pf.schema
Out[4]:
<pyarrow._parquet.ParquetSchema object at 0x10e74b280>
required group field_id=-1 arrow_schema {
  optional int64 field_id=-1 area;
  optional double field_id=-1 centroid_x;
  optional double field_id=-1 centroid_y;
  optional binary field_id=-1 geom (String);
  optional binary field_id=-1 h3_7 (String);
  optional binary field_id=-1 h3_8 (String);
  optional binary field_id=-1 h3_9 (String);
}

In [6]: pf.read()

# OSError: Corrupt Lz4 compressed data.
$ clickhouse client
CREATE TABLE pq_test (
    area Nullable(Int64),
    centroid_x Nullable(Float64),
    centroid_y Nullable(Float64),
    geom Nullable(String),
    h3_7 Nullable(String),
    h3_8 Nullable(String),
    h3_9 Nullable(String))
ENGINE = "Log";
$ clickhouse client \
    --query='INSERT INTO pq_test FORMAT Parquet' \
    < lz4.pq
Code: 33. DB::ParsingException: Error while reading Parquet data: IOError: Corrupt Lz4 compressed data.: While executing ParquetBlockInputFormat: data for INSERT was parsed from stdin: (in query: INSERT INTO pq_test FORMAT Parquet). (CANNOT_READ_ALL_DATA)

read jsonl from stdin

It would be nice if stdin was supported:

unzstd --memory=2048MB --stdout RS_2008-02.jsonl.zst | json2parquet out.parq

Performance improvement for compressing Snappy-compressed Parquet files?

I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet with json2parquet and ClickHouse. I found json2parquet to be 1.5x slower than ClickHouse when it came to converting the records into Snappy-compressed Parquet.

I converted the original GeoJSON into JSONL with three elements per record. The resulting JSONL file is 3 GB uncompressed and has 11,542,912 lines.

$ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \
    | jq -c '.properties * {geom: .geometry|tostring}' \
    > California.jsonl
$ head -n1 California.jsonl | jq .
{
  "release": 1,
  "capture_dates_range": "",
  "geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
}

I then converted that file into Snappy-compressed Parquet with ClickHouse which took 32 seconds and produced a file 793 MB in size.

$ cat California.jsonl \
    | clickhouse local \
          --input-format JSONEachRow \
          -q "SELECT *
              FROM table
              FORMAT Parquet" \
    > cali.snappy.pq

The following was compiled with rustc 1.66.0 (69f9c33d7 2022-12-12).

$ git clone https://github.com/domoritz/json2parquet/
$ cd json2parquet
$ RUSTFLAGS='-Ctarget-cpu=native' cargo build --release
$ /usr/bin/time -al \
        target/release/json2parquet \
        -c snappy \
        California.jsonl \
        California.snappy.pq

The above took 43.8 seconds to convert the JSONL into PQ with a file 815 MB in size. There are 12 row groups in this PQ file.

In [1]: import pyarrow.parquet as pq

In [2]: pf = pq.ParquetFile('California.snappy.pq')

In [3]: pf.schema
Out[3]: 
<pyarrow._parquet.ParquetSchema object at 0x109a11380>
required group field_id=-1 arrow_schema {
  optional binary field_id=-1 capture_dates_range (String);
  optional binary field_id=-1 geom (String);
  optional int64 field_id=-1 release;
}

In [4]: pf.metadata
Out[4]: 
<pyarrow._parquet.FileMetaData object at 0x10adf09f0>
  created_by: parquet-rs version 29.0.0
  num_columns: 3
  num_rows: 11542912
  num_row_groups: 12
  format_version: 1.0
  serialized_size: 7969

The ClickHouse-produced PQ file has 306 row groups.

In [1]: pf = pq.ParquetFile('cali.snappy.pq')

In [2]: pf.schema
Out[2]: 
<pyarrow._parquet.ParquetSchema object at 0x105ccc940>
required group field_id=-1 schema {
  optional int64 field_id=-1 release;
  optional binary field_id=-1 capture_dates_range;
  optional binary field_id=-1 geom;
}

In [3]: pf.metadata
Out[3]: 
<pyarrow._parquet.FileMetaData object at 0x1076705e0>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 3
  num_rows: 11542912
  num_row_groups: 306
  format_version: 1.0
  serialized_size: 228389

I'm not sure if the row group sizes played into the performance delta.

Is there anything I can do to my compilation settings to speed up json2parquet?

Reader stops after first batch

match reader.next() {

Thank you for writing these utilities! I was looking for precisely this capability.

I was consistently only getting 1024 records in my output until I realized the reader.next() function is only getting called once here. It needs to be wrapped in a loop to consume all the batches, right?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.