The qsv's discuss from jqnatividad

Adapt additional commands from @Yomguithereal xsv fork

Hi @Yomguithereal!

Your fork has several commands that are quite useful that I'd like to pull in to qsv.

https://github.com/Yomguithereal/xsv#readme

Would it be OK if I do so?

Seeded samples should always be reproducible regardless of sample size

see BurntSushi/xsv#255

Setup appveyor

RUSTSEC-2020-0159: Potential segfault in `localtime_r` invocations

Potential segfault in localtime_r invocations

Details
Package	`chrono`
Version	`0.4.19`
URL	chronotope/chrono#499
Date	2020-11-10

Impact

Unix-like operating systems may segfault due to dereferencing a dangling pointer in specific circumstances. This requires an environment variable to be set in a different thread than the affected functions. This may occur without the user's knowledge, notably in a third-party library.

Workarounds

No workarounds are known.

References

time-rs/time#293

See advisory page for additional details.

Rename fork from xsv to qsv

As per BurntSushi/xsv#267 (comment)

`stats` datetime type detection

BurntSushi/xsv#28

RUSTSEC-2020-0036: failure is officially deprecated/unmaintained

failure is officially deprecated/unmaintained

Details
Status	unmaintained
Package	`failure`
Version	`0.1.8`
URL	rust-lang-deprecated/failure#347
Date	2020-05-02

The failure crate is officially end-of-life: it has been marked as deprecated
by the former maintainer, who has announced that there will be no updates or
maintenance work on it going forward.

The following are some suggested actively developed alternatives to switch to:

See advisory page for additional details.

Add `QSV_DELIMITER` environment variable

Creating a new issue for QSV_DELIMITER implementation, as #47 was too generalized and not discretely actionable.

Replace docopt with clap

docopt is unmaintained (https://github.com/docopt/docopt.rs)

Add Logging

which is off by default but can be turned on with the QSV_LOGGING environment variable (#47).

Use log4rs (https://crates.io/crates/log4rs)

Add `validate` command

Checks a CSV against a jsonschema file.

The jsonschema file can be located on the filesystem or a URL.

Add CKAN integration example

Perhaps, with ckanapi CLI in the mix.

Feature Request: deduplicate columns/extract unique columns

Cross reference BurntSushi/xsv#283

We can use qsv dedup or the Unix command line tools sort and uniq to remove duplicate rows in plain text table, but I find myself wanting to do something similar with duplicated columns.

For example, after doing qsv join ... there will be at least one pair of duplicated columns (the values used for the join).

I am hoping for something like a column based version of the row based qsv dedup command (see #26).

I suspect I could workaround this via the qsv transpose command (see #3).

Setup Travis CI

For testing Linux and Mac builds

Add `--no-headers` option to `rename`

Discussed in #81

Originally posted by @tmtmtmtm in #81 (comment)

Add Environment variables

QSV_NO_HEADERS (in addition to the existing QSV_TOGGLE_HEADERS)
QSV_DELIMITER
QSV_FLEXIBLE (https://docs.rs/csv/1.1.6/csv/struct.ReaderBuilder.html#method.flexible)
QSV_REGEX_UNICODE

Note: implementation of each env var will be tracked by a separate issue and checked off here as they are done.

Support reading from stdin in join

BurntSushi/xsv#190

Publish to crates.io

"Heavy-duty" configurable `geocode` command

qsv bundles reverse-geocoder - a "lightweight" static, nearest city geonames geocoder.

But for real, street-level geocoding, we need a configurable geocoder that can use the user's geocoder backend of choice.

For the initial implementation of a heavy-weight geocoder, we'll start in order of implementation:

pelias (because it's open-source, and users can stand up their own customizable pelias geocoder instance; no ToS prohibiting caching results, etc.)
google geocoder

Other geocoder backends in the backlog:

US Census
ArcGIS World Geocoder (as most jurisdictions have a subscription)

This geocoder will be its own qsv command - geocode unlike the current lightweight one, which is just one of many apply operations.

2018 edition

Leverage this PR - BurntSushi/xsv#176

Maintainership status of qsv

Is qsv fork considered to be actively maintained or is it in "glacier" state as well as xsv? For example, there is already an additional feature pull reqeust to xsv.

Shall new feature requests be submitted to xsv or to qsv repository?

Add "normalize" command

To normalize data inside the CSV:

convert dates to ISO-8601 format
optionally add additional date-based columns to the CSV
- weekday
- week number
- year
- month
- day
- hour
- minute
- second
- timezone
convert null fields of specified columns to a specified value (e.g. "N/A", "None", "0", "Not specified", etc.)

Feature Request: Multiple modes for stats

BurntSushi/xsv#259

Support quartiles

BurntSushi/xsv#273

`fetch` command

fetch will allow qsv to fetch HTML or data from web pages or services, to enrich a CSV (e.g. geocoding, wikidata api, etc.)

It will support authentication, concurrent requests, thresholds, etc.

Reminiscent of OpenRefine's fetch url... (https://docs.openrefine.org/manual/columnediting#add-column-by-fetching-urls) , but optimized for the command line.

Create `schema` command

stats does a great job of not only getting descriptive stats about a CSV, it also infers the data type.
frequency compiles a frequency table.

The schema command will use the output of the stats, and optionally frequency (to specify the valid range of a field), to create a json schema file that can be used with the validate command (#46) to validate a CSV against the generated schema.

With the combo addition of schema and validate, qsv can be used in a more bullet-proof automated data pipeline that can fail gracefully when there are data quality issues:

use schema to create a json schema from a representative CSV file for a feed
adjust the schema to fine-tune the validation rules
use validate at the beginning of a data pipeline and fail gracefully when validate fails
for extra large files, use sample to validate against a sample
or alternatively, partition the CSV to break down the pipeline into smaller jobs

BOM handling issue in test_cat.rs

BurntSushi/xsv#227

Show stats date type detection in whirlwind tour

Perhaps, we can pull the founding date of each country from Wikidata, and join it with the existing CSV files we use for the tour...

Add --quote-never option to fmt

BurntSushi/xsv#107

but do so by leveraging the underlying csv writer

Remove py command

As it makes building qsv more difficult, with its various version, platform architecture dependencies.

Lua should be more than good enough as it has no external dependencies as its meant to be embeddable, and you can even call lua scripts.

Closes #55.

Add missing step to whirlwind demo

BurntSushi/xsv#246

Add conditional compilation logic for py command

The py command requires Python to be installed to compile successfully.

Add some checks in build.rs to check if Python is installed before even attempting to build.

Investigate using Hyperfine in benchmark scripts

The current benchmark script has been improved, but it still lacks the rigor of a proper benchmark.

Investigate using hyperfine, and perhaps, we can even automate the benchmarks as part of the release process.

Scan CSV for CSV Injection vulnerability

see https://owasp.org/www-community/attacks/CSV_Injection
https://securityboulevard.com/2019/09/data-extraction-to-command-execution-csv-injection/

And a recent CSV injection attack
https://threatpost.com/woocommerce-multi-currency-bug-pricing/169394/

Add --align option to table cmd

BurntSushi/xsv#209

Not following semantic versioning guidance

I haven't been strictly following semver rigorously as new features have been introduced in patch releases in the 0.16.x series.

After 0.16.4, will stick to semver.

Create CHANGELOG

Starting from v0.15.0, before publishing 0.15.1

Expand whirlwind tour with new commands

highlighting the new/modified qsv commands in action...

Setup Github Actions

Instead of Travis and Appveyor, just use GitHub Actions.

Closes #18.

Convert benchmark shell script to Criterion

The existing shell-based script is OK, but it'd be better if we use Criterion (https://github.com/bheisler/criterion.rs), so we can use tools like critcmp (https://github.com/BurntSushi/critcmp)

Parallelize search and searchset commands

for even faster performance.

search
searchset

https://github.com/rust-lang/regex/blob/master/PERFORMANCE.md

RUSTSEC-2020-0071: Potential segfault in the time crate

Potential segfault in the time crate

Details
Package	`time`
Version	`0.1.43`
URL	time-rs/time#293
Date	2020-11-18
Patched versions	`>=0.2.23`
Unaffected versions	`=0.2.0,=0.2.1,=0.2.2,=0.2.3,=0.2.4,=0.2.5,=0.2.6`

Impact

Unix-like operating systems may segfault due to dereferencing a dangling pointer in specific circumstances. This requires an environment variable to be set in a different thread than the affected functions. This may occur without the user's knowledge, notably in a third-party library.

The affected functions from time 0.2.7 through 0.2.22 are:

time::UtcOffset::local_offset_at
time::UtcOffset::try_local_offset_at
time::UtcOffset::current_local_offset
time::UtcOffset::try_current_local_offset
time::OffsetDateTime::now_local
time::OffsetDateTime::try_now_local

The affected functions in time 0.1 (all versions) are:

at
at_utc

Non-Unix targets (including Windows and wasm) are unaffected.

Patches

Pending a proper fix, the internal method that determines the local offset has been modified to always return None on the affected operating systems. This has the effect of returning an Err on the try_* methods and UTC on the non-try_* methods.

Users and library authors with time in their dependency tree should perform cargo update, which will pull in the updated, unaffected code.

Users of time 0.1 do not have a patch and should upgrade to an unaffected version: time 0.2.23 or greater or the 0.3. series.

Workarounds

No workarounds are known.

References

time-rs/time#293

See advisory page for additional details.

Add dedup cmd

BurntSushi/xsv#83

Apply rustfmt

With the release of 0.16.1, all the major pending pull requests from xsv have been merged into qsv.

IMHO, we can now apply rustfmt to the whole project and standardize the code to rustfmt standards.

This will put us in a better position to directly accept PRs.

Package qsv in conda-forge (as done for xsv)

I currently use xsv installed from conda via the conda-forge community package collection,

https://anaconda.org/conda-forge/xsv
https://github.com/conda-forge/xsv-feedstock/tree/master/recipe

I would like to do the same for qsv since I increasingly find myself wanting to use functionality only available in this more up to date fork (thank you!).

I am not familiar with rust, but it ought to be straightforward to package qsv with an almost cut-and-paste copy of that recipe, so I am willing to attempt this having previously contributed recipes for other packages to conda-forge.