Comments (18)
Thanks for all your help with tidying up the code @jqnatividad.
One more bug fix PR for validate. PR #151
Sure, this has been great for my Rust learning. Happy to pick up schema (#60) next.
from qsv.
Some good ideas for reference:
from qsv.
@jqnatividad Please let me know if you have some thoughts on the primary usecases.
- identify all schema/value errors in CSV and fix them
- quickly determine if CSV data matches expected schema
- generate schema from CSV
Here's 1st pass at docopt
Validate CSV data with JSON Schema.
Usage:
qsv validate [options] [<input>] <json-schema>
fetch options:
-c, --new-column <name> Put error(s) in a new column instead.
--fail-fast Stops on first error.
--generate-schema Generate JSON Schema from data.
Common options:
-h, --help Display this message
-o, --output <file> Write output to <file> instead of stdout.
-n, --no-headers When set, the first row will not be interpreted
as headers. Namely, it will be sorted with the rest
of the rows. Otherwise, the first row will always
appear as the header row in the output.
-d, --delimiter <arg> The field delimiter for reading CSV data.
Must be a single character. (default: ,)
-q, --quiet Don't show progress bars.
from qsv.
Hi @mhuang74,
Awesome! I'm keen to see validate
implemented.
The primary usecases are the ones you listed, but I think we should break it up into three commands.
validate
- 1) identify all schema/value errors and 2) determine if CSV data matches expected schema, with two modes:
- given a jsonschema file - validate using the schema. The schema can be a file or a url where the jsonschema can be found.
It produces three outputs:
- A valid-data CSV file. All records passing validation (add a
-valid.csv
suffix automatically if--valid
option is not specified) - An invalid CSV file. All records failing validation (add an
-invalid.csv
suffix automatically if--invalid
option is not specified) - A validation report. A human-readable, but also machine-readable file that can be used by another command to fix invalid rows. See
fix-data
below)
csvlint.io is a good reference for this (and the other ODI ideas you pointed out - theodi/shared#142).
It will also have a--fail-fast
mode that doesn't produce these three outputs for this potentially expensive operation, but just gives a non-zero exit code (perhaps, the row of the first invalid record?)
- without a jsonschema - it just validates that a CSV is a valid per RFC4180 standard
- for both modes, zero exit code if CSV file is valid
jsonschema
or schema
- generate schema from CSV. I was thinking we should leverage the output of stats
and optionally frequency
to determine the data types and the range/domain of each field.
fix-data
(to be consistent with fix-lengths
) - for the rows failing validation, an -invalid.csv
with some error metadata columns is produced by validate
. This command allows the corrections to be be processed at scale (e.g. invalid values set to NULL/None/empty or a given value for a field; invalid values set to Default value or Default error value (per schema?)), with safe default "corrections" if corrections are not explicitly specified.
Thoughts?
from qsv.
As for the docopt format, it looks good, but we can tackle that once we agree on the general approach.
However, with that said, I think this is where multi-threading can really make a big difference, as we can break up the validation into chunks as each row can be validated independently, so we should include the --jobs
option.
from qsv.
@jqnatividad here's my POC for validate
command using jasonschema-rs crate.
Example run documented in Validate.md.
One problems that I need help with. Progress bar would panic if csv file has unequal number of columns (utils.rs:117). I am not sure how to refactor this. Any thoughts?
$ cat invalid.csv
header1,header2
1,2,3
foo,bar
quux,baz,foobar
hello, world
$ RUST_BACKTRACE=1 ./target/debug/qsv validate invalid.csv
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error(UnequalLengths { pos: Some(Position { byte: 16, line: 2, record: 1 }), expected_len: 2, len: 3 })', src/util.rs:117:53
stack backtrace:
0: rust_begin_unwind
at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/std/src/panicking.rs:498:5
1: core::panicking::panic_fmt
at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/core/src/panicking.rs:107:14
2: core::result::unwrap_failed
at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/core/src/result.rs:1613:5
3: core::result::Result<T,E>::unwrap
at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/core/src/result.rs:1295:23
4: qsv::util::count_rows
at ./src/util.rs:117:19
5: qsv::cmd::validate::run
at ./src/cmd/validate.rs:88:28
6: qsv::Command::run
at ./src/main.rs:327:34
7: qsv::main
at ./src/main.rs:177:28
8: core::ops::function::FnOnce::call_once
at /rustc/02072b482a8b5357f7fb5e5637444ae30e423c40/library/core/src/ops/function.rs:227:5
works fine when progress bar is quieted
$ RUST_BACKTRACE=1 ./target/debug/qsv validate invalid.csv --quiet
CSV error: record 1 (line: 2, byte: 16): found record with 3 fields, but the previous record has 2 fields
from qsv.
@mhuang74 thanks for this!
The problem is the progress bar does a count ala qsv count
. And since the file is invalid, it panics since the header doesn't match.
The fix is relatively simple, just set flexible
to true setting up the Config ReaderBuilder.
And while we're on the progress bar, you may want to use the new update_cache_info!
macro.
I'll add comments to the PR for both changes...
from qsv.
thanks @jqnatividad. Above problem resolved. Simple csv check integration tests passing now.
I need to add another test to properly check for output files.
Any suggestions on how to check for output files?
from qsv.
Great!
As for testing the output files, having a sample csv and corresponding, relatively complex jsonschema (the PublicToilets schema on csvlint seems interesting and relatively complex), with expected valid and invalid rows should be good enough.
from qsv.
@jqnatividad Need some help understanding standards around json schema, and whether choice of jsonschema crate is the right approach. Looks like it is working for dependent project cjval validating CityJSON files. But it doesn't work for the Public Toilet schema.
Looks like the Public Toilets json schema conforms to JSON Table Schema rather than the JSONSchema which jsonschema-rs crate assumes.
I attempted to run the public toilet schema through jsonschema-rs for validation, and it did not pick up and apply the constraints.
Specifically, the "keywordLocation" and "instanceLocation" are all blank.
what jsonschema-rs outputs with example People schema from JSON Schema site
[src/cmd/validate.rs:143] &validation_result = Object({
"valid": Bool(
false,
),
"errors": Array([
Object({
"keywordLocation": String(
"/properties/age/type",
),
"instanceLocation": String(
"/age",
),
"absoluteKeywordLocation": String(
"https://example.com/properties/age/type",
),
"error": String(
"\"21\" is not of type \"integer\"",
),
}),
]),
})
snippet what jsonschema-rs outputs with Public Toilet schema
[src/cmd/validate.rs:143] &validation_result = Object({
"valid": Bool(
true,
),
"annotations": Array([
Object({
"keywordLocation": String(
"",
),
"instanceLocation": String(
"",
),
"annotations": Object({
"fields": Array([
Object({
"name": String(
"ExtractDate",
),
"title": String(
"Extract Date",
),
"description": String(
"The date that the data was last extracted from its source database or manually updated",
),
"constraints": Object({
"required": String(
"true",
),
"type": String(
"http://www.w3.org/2001/XMLSchema#string",
),
"pattern": String(
"(^([0-2]\\d|[3][0-1])\\/([0]\\d|[1][0-2])\\/[1-2]\\d{3}((\\s([0-1]\\d|[2][0-3]):[0-5]\\d(:[0-5]\\d)?)?)|)$",
),
}),
}),
from qsv.
@mhuang74 let's just create a test schema & a corresponding CSV we can use.
jsonschema seems to be the most mature library we can leverage so let's stick with it (not to mention the same author wrote the infers-jsonschema crate that we can leverage for the schema
command).
Although JSON Table Schema is a specialized standard for tabular data (read CSVs), there seems to be no other crate we can leverage for the non-trivial validation task. But since JSON Table Schema is a subset of the IETF JSON Schema standard, it should be good enough.
What about creating a sample schema using the sample NYC 311 data that's used in qsv's benchmark script?
It has different data types and opportunities for exercising validation tasks that jsonschema can do; it's a good showcase example; and we can baseline validate
's performance with the benchmark.
from qsv.
Thanks @jqnatividad. They all require hand crafting schema file, so I stuck with public toilet dataset.
Integration test uses real data and illustrates invalid row showing up in output.invalid file.
from qsv.
Public toilet it is! :)
Lemme know when you want me to merge the PR. Convert it to draft in the meantime?
from qsv.
Thanks @jqnatividad. I think it's ready for beta release. What do you think?
from qsv.
Yep! Prepping for 0.3.0 release... :)
Did some minor tweaks though - #148
from qsv.
Thanks for the fixups @jqnatividad . Good catch on progress bar. And just realized that named parameters are supported.
from qsv.
@mhuang74, our MSRV is 1.58.1, so might as well use the latest features.
And I tweaked validate
further just now to create the error-report as a jsonl file - easier to digest for both humans and machines 😏 with the ability to convert it to csv too with the jsonl
command - #149
from qsv.
@mhuang74 and 0.3.0 featuring validate
is released and we can now close this issue!
However, I had to limit cross-compilation for now, so binaries for certain platforms are not published (c54a2f2).
Good news is that I managed to trace the problem to jsonschema, and the fix has been merged (Stranger6667/jsonschema-rs#336), and we'll be able to publish binaries for those platforms again in the next release.
Thanks heaps again for your contributions!!!
Are you up to taking on schema
(#60)? 😉 It need not use stats
and frequency
for the first release, it can just use infers-jsonschema...
from qsv.
Related Issues (20)
- `metadata`: new command to extract and infer metadata from a dataset, given an optional spec file, and create a .metadata file
- `slice`: add a `--json` flag for JSON output HOT 4
- BUG Incorrect delimiter in qsv sniff HOT 7
- qsv validate - valid pass feature request HOT 2
- What would it take to get `foreach` working on Windows? HOT 4
- `stats` & `frequency`: add a `--json` flag for JSON output
- `foreach`: add `--dry-run` option HOT 1
- qsv count HOT 1
- Locating.installing qsv HOT 2
- sql windows functions HOT 7
- add `--no-headers` support to qsv cat rowskey HOT 2
- group by HOT 1
- `frequency`: add `--other` option HOT 5
- `luau`: additional helper functions
- `search`: add preview and JSON options HOT 1
- `search` & `searchset`: when a CSV is indexed, parallelize search
- partition file into files with n rows each (except for last file) HOT 7
- `stats` command writes output file even when `--output` is not set HOT 7
- `stats`: Max precision for float types HOT 3
- sqlp selects wrong data when multiple tables have the same named column HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from qsv.