Code Monkey home page Code Monkey logo

Comments (18)

mhuang74 avatar mhuang74 commented on May 20, 2024 2

Thanks for all your help with tidying up the code @jqnatividad.

One more bug fix PR for validate. PR #151

Sure, this has been great for my Rust learning. Happy to pick up schema (#60) next.

from qsv.

mhuang74 avatar mhuang74 commented on May 20, 2024 1

Some good ideas for reference:

theodi/shared#142

from qsv.

mhuang74 avatar mhuang74 commented on May 20, 2024

@jqnatividad Please let me know if you have some thoughts on the primary usecases.

  • identify all schema/value errors in CSV and fix them
  • quickly determine if CSV data matches expected schema
  • generate schema from CSV

Here's 1st pass at docopt

Validate CSV data with JSON Schema.

Usage:
    qsv validate [options] [<input>] <json-schema>

fetch options:
    -c, --new-column <name>    Put error(s) in a new column instead.
    --fail-fast                Stops on first error.
    --generate-schema          Generate JSON Schema from data.


Common options:
    -h, --help                 Display this message
    -o, --output <file>        Write output to <file> instead of stdout.
    -n, --no-headers           When set, the first row will not be interpreted
                               as headers. Namely, it will be sorted with the rest
                               of the rows. Otherwise, the first row will always
                               appear as the header row in the output.
    -d, --delimiter <arg>      The field delimiter for reading CSV data.
                               Must be a single character. (default: ,)
    -q, --quiet                Don't show progress bars.

from qsv.

jqnatividad avatar jqnatividad commented on May 20, 2024

Hi @mhuang74,

Awesome! I'm keen to see validate implemented.

The primary usecases are the ones you listed, but I think we should break it up into three commands.

validate - 1) identify all schema/value errors and 2) determine if CSV data matches expected schema, with two modes:

  1. given a jsonschema file - validate using the schema. The schema can be a file or a url where the jsonschema can be found.
    It produces three outputs:
  • A valid-data CSV file. All records passing validation (add a -valid.csv suffix automatically if --valid option is not specified)
  • An invalid CSV file. All records failing validation (add an -invalid.csv suffix automatically if --invalid option is not specified)
  • A validation report. A human-readable, but also machine-readable file that can be used by another command to fix invalid rows. See fix-data below)

    csvlint.io is a good reference for this (and the other ODI ideas you pointed out - theodi/shared#142).

    It will also have a --fail-fast mode that doesn't produce these three outputs for this potentially expensive operation, but just gives a non-zero exit code (perhaps, the row of the first invalid record?)
  1. without a jsonschema - it just validates that a CSV is a valid per RFC4180 standard
  • for both modes, zero exit code if CSV file is valid

jsonschema or schema - generate schema from CSV. I was thinking we should leverage the output of stats and optionally frequency to determine the data types and the range/domain of each field.

fix-data (to be consistent with fix-lengths) - for the rows failing validation, an -invalid.csv with some error metadata columns is produced by validate. This command allows the corrections to be be processed at scale (e.g. invalid values set to NULL/None/empty or a given value for a field; invalid values set to Default value or Default error value (per schema?)), with safe default "corrections" if corrections are not explicitly specified.

Thoughts?

from qsv.

jqnatividad avatar jqnatividad commented on May 20, 2024

As for the docopt format, it looks good, but we can tackle that once we agree on the general approach.

However, with that said, I think this is where multi-threading can really make a big difference, as we can break up the validation into chunks as each row can be validated independently, so we should include the --jobs option.

from qsv.

mhuang74 avatar mhuang74 commented on May 20, 2024

@jqnatividad here's my POC for validate command using jasonschema-rs crate.

#145

Example run documented in Validate.md.

One problems that I need help with. Progress bar would panic if csv file has unequal number of columns (utils.rs:117). I am not sure how to refactor this. Any thoughts?

$ cat invalid.csv
header1,header2
1,2,3
foo,bar
quux,baz,foobar
hello, world

$ RUST_BACKTRACE=1 ./target/debug/qsv validate invalid.csv
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error(UnequalLengths { pos: Some(Position { byte: 16, line: 2, record: 1 }), expected_len: 2, len: 3 })', src/util.rs:117:53
stack backtrace:
   0: rust_begin_unwind
             at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/std/src/panicking.rs:498:5
   1: core::panicking::panic_fmt
             at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/core/src/panicking.rs:107:14
   2: core::result::unwrap_failed
             at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/core/src/result.rs:1613:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/core/src/result.rs:1295:23
   4: qsv::util::count_rows
             at ./src/util.rs:117:19
   5: qsv::cmd::validate::run
             at ./src/cmd/validate.rs:88:28
   6: qsv::Command::run
             at ./src/main.rs:327:34
   7: qsv::main
             at ./src/main.rs:177:28
   8: core::ops::function::FnOnce::call_once
             at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/core/src/ops/function.rs:227:5

works fine when progress bar is quieted

$ RUST_BACKTRACE=1 ./target/debug/qsv validate invalid.csv --quiet
CSV error: record 1 (line: 2, byte: 16): found record with 3 fields, but the previous record has 2 fields

from qsv.

jqnatividad avatar jqnatividad commented on May 20, 2024

@mhuang74 thanks for this!

The problem is the progress bar does a count ala qsv count. And since the file is invalid, it panics since the header doesn't match.

The fix is relatively simple, just set flexible to true setting up the Config ReaderBuilder.

And while we're on the progress bar, you may want to use the new update_cache_info! macro.

I'll add comments to the PR for both changes...

from qsv.

mhuang74 avatar mhuang74 commented on May 20, 2024

thanks @jqnatividad. Above problem resolved. Simple csv check integration tests passing now.

I need to add another test to properly check for output files.

Any suggestions on how to check for output files?

from qsv.

jqnatividad avatar jqnatividad commented on May 20, 2024

Great!

As for testing the output files, having a sample csv and corresponding, relatively complex jsonschema (the PublicToilets schema on csvlint seems interesting and relatively complex), with expected valid and invalid rows should be good enough.

from qsv.

mhuang74 avatar mhuang74 commented on May 20, 2024

@jqnatividad Need some help understanding standards around json schema, and whether choice of jsonschema crate is the right approach. Looks like it is working for dependent project cjval validating CityJSON files. But it doesn't work for the Public Toilet schema.

Looks like the Public Toilets json schema conforms to JSON Table Schema rather than the JSONSchema which jsonschema-rs crate assumes.

I attempted to run the public toilet schema through jsonschema-rs for validation, and it did not pick up and apply the constraints.

Specifically, the "keywordLocation" and "instanceLocation" are all blank.

what jsonschema-rs outputs with example People schema from JSON Schema site

[src/cmd/validate.rs:143] &validation_result = Object({
    "valid": Bool(
        false,
    ),
    "errors": Array([
        Object({
            "keywordLocation": String(
                "/properties/age/type",
            ),
            "instanceLocation": String(
                "/age",
            ),
            "absoluteKeywordLocation": String(
                "https://example.com/properties/age/type",
            ),
            "error": String(
                "\"21\" is not of type \"integer\"",
            ),
        }),
    ]),
})

snippet what jsonschema-rs outputs with Public Toilet schema

[src/cmd/validate.rs:143] &validation_result = Object({
    "valid": Bool(
        true,
    ),
    "annotations": Array([
        Object({
            "keywordLocation": String(
                "",
            ),
            "instanceLocation": String(
                "",
            ),
            "annotations": Object({
                "fields": Array([
                    Object({
                        "name": String(
                            "ExtractDate",
                        ),
                        "title": String(
                            "Extract Date",
                        ),
                        "description": String(
                            "The date that the data was last extracted from its source database or manually updated",
                        ),
                        "constraints": Object({
                            "required": String(
                                "true",
                            ),
                            "type": String(
                                "http://www.w3.org/2001/XMLSchema#string",
                            ),
                            "pattern": String(
                                "(^([0-2]\\d|[3][0-1])\\/([0]\\d|[1][0-2])\\/[1-2]\\d{3}((\\s([0-1]\\d|[2][0-3]):[0-5]\\d(:[0-5]\\d)?)?)|)$",
                            ),
                        }),
                    }),


from qsv.

jqnatividad avatar jqnatividad commented on May 20, 2024

@mhuang74 let's just create a test schema & a corresponding CSV we can use.

jsonschema seems to be the most mature library we can leverage so let's stick with it (not to mention the same author wrote the infers-jsonschema crate that we can leverage for the schema command).

Although JSON Table Schema is a specialized standard for tabular data (read CSVs), there seems to be no other crate we can leverage for the non-trivial validation task. But since JSON Table Schema is a subset of the IETF JSON Schema standard, it should be good enough.

What about creating a sample schema using the sample NYC 311 data that's used in qsv's benchmark script?

It has different data types and opportunities for exercising validation tasks that jsonschema can do; it's a good showcase example; and we can baseline validate's performance with the benchmark.

from qsv.

mhuang74 avatar mhuang74 commented on May 20, 2024

Thanks @jqnatividad. They all require hand crafting schema file, so I stuck with public toilet dataset.

Integration test uses real data and illustrates invalid row showing up in output.invalid file.

from qsv.

jqnatividad avatar jqnatividad commented on May 20, 2024

Public toilet it is! :)

Lemme know when you want me to merge the PR. Convert it to draft in the meantime?

from qsv.

mhuang74 avatar mhuang74 commented on May 20, 2024

Thanks @jqnatividad. I think it's ready for beta release. What do you think?

from qsv.

jqnatividad avatar jqnatividad commented on May 20, 2024

Yep! Prepping for 0.3.0 release... :)

Did some minor tweaks though - #148

from qsv.

mhuang74 avatar mhuang74 commented on May 20, 2024

Thanks for the fixups @jqnatividad . Good catch on progress bar. And just realized that named parameters are supported.

from qsv.

jqnatividad avatar jqnatividad commented on May 20, 2024

@mhuang74, our MSRV is 1.58.1, so might as well use the latest features.

And I tweaked validate further just now to create the error-report as a jsonl file - easier to digest for both humans and machines 😏 with the ability to convert it to csv too with the jsonl command - #149

from qsv.

jqnatividad avatar jqnatividad commented on May 20, 2024

@mhuang74 and 0.3.0 featuring validate is released and we can now close this issue!

However, I had to limit cross-compilation for now, so binaries for certain platforms are not published (c54a2f2).

Good news is that I managed to trace the problem to jsonschema, and the fix has been merged (Stranger6667/jsonschema-rs#336), and we'll be able to publish binaries for those platforms again in the next release.

Thanks heaps again for your contributions!!!

Are you up to taking on schema (#60)? 😉 It need not use stats and frequency for the first release, it can just use infers-jsonschema...

from qsv.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.