jqnatividad / qsv Goto Github PK
View Code? Open in Web Editor NEWCSVs sliced, diced & analyzed.
License: The Unlicense
CSVs sliced, diced & analyzed.
License: The Unlicense
Hi @Yomguithereal!
Your fork has several commands that are quite useful that I'd like to pull in to qsv.
https://github.com/Yomguithereal/xsv#readme
Would it be OK if I do so?
Potential segfault in
localtime_r
invocations
Details | |
---|---|
Package | chrono |
Version | 0.4.19 |
URL | chronotope/chrono#499 |
Date | 2020-11-10 |
Unix-like operating systems may segfault due to dereferencing a dangling pointer in specific circumstances. This requires an environment variable to be set in a different thread than the affected functions. This may occur without the user's knowledge, notably in a third-party library.
No workarounds are known.
See advisory page for additional details.
As per BurntSushi/xsv#267 (comment)
failure is officially deprecated/unmaintained
Details | |
---|---|
Status | unmaintained |
Package | failure |
Version | 0.1.8 |
URL | rust-lang-deprecated/failure#347 |
Date | 2020-05-02 |
The failure
crate is officially end-of-life: it has been marked as deprecated
by the former maintainer, who has announced that there will be no updates or
maintenance work on it going forward.
The following are some suggested actively developed alternatives to switch to:
See advisory page for additional details.
Creating a new issue for QSV_DELIMITER implementation, as #47 was too generalized and not discretely actionable.
docopt is unmaintained (https://github.com/docopt/docopt.rs)
which is off by default but can be turned on with the QSV_LOGGING environment variable (#47).
Use log4rs (https://crates.io/crates/log4rs)
Checks a CSV against a jsonschema file.
The jsonschema file can be located on the filesystem or a URL.
Perhaps, with ckanapi CLI in the mix.
Cross reference BurntSushi/xsv#283
We can use qsv dedup
or the Unix command line tools sort
and uniq
to remove duplicate rows in plain text table, but I find myself wanting to do something similar with duplicated columns.
For example, after doing qsv join ...
there will be at least one pair of duplicated columns (the values used for the join).
I am hoping for something like a column based version of the row based qsv dedup
command (see #26).
I suspect I could workaround this via the qsv transpose
command (see #3).
For testing Linux and Mac builds
Originally posted by @tmtmtmtm in #81 (comment)
Note: implementation of each env var will be tracked by a separate issue and checked off here as they are done.
qsv bundles reverse-geocoder - a "lightweight" static, nearest city geonames geocoder.
But for real, street-level geocoding, we need a configurable geocoder that can use the user's geocoder backend of choice.
For the initial implementation of a heavy-weight geocoder, we'll start in order of implementation:
Other geocoder backends in the backlog:
This geocoder will be its own qsv command - geocode
unlike the current lightweight one, which is just one of many apply
operations.
Leverage this PR - BurntSushi/xsv#176
Is qsv fork considered to be actively maintained or is it in "glacier" state as well as xsv? For example, there is already an additional feature pull reqeust to xsv.
Shall new feature requests be submitted to xsv or to qsv repository?
To normalize data inside the CSV:
fetch
will allow qsv to fetch HTML or data from web pages or services, to enrich a CSV (e.g. geocoding, wikidata api, etc.)
It will support authentication, concurrent requests, thresholds, etc.
Reminiscent of OpenRefine's fetch url... (https://docs.openrefine.org/manual/columnediting#add-column-by-fetching-urls) , but optimized for the command line.
stats
does a great job of not only getting descriptive stats about a CSV, it also infers the data type.
frequency
compiles a frequency table.
The schema
command will use the output of the stats
, and optionally frequency
(to specify the valid range of a field), to create a json schema file that can be used with the validate
command (#46) to validate a CSV against the generated schema.
With the combo addition of schema
and validate
, qsv can be used in a more bullet-proof automated data pipeline that can fail gracefully when there are data quality issues:
schema
to create a json schema from a representative CSV file for a feedvalidate
at the beginning of a data pipeline and fail gracefully when validate
failssample
to validate against a samplepartition
the CSV to break down the pipeline into smaller jobsPerhaps, we can pull the founding date of each country from Wikidata, and join it with the existing CSV files we use for the tour...
but do so by leveraging the underlying csv writer
As it makes building qsv more difficult, with its various version, platform architecture dependencies.
Lua should be more than good enough as it has no external dependencies as its meant to be embeddable, and you can even call lua scripts.
Closes #55.
The py
command requires Python to be installed to compile successfully.
Add some checks in build.rs
to check if Python is installed before even attempting to build.
The current benchmark script has been improved, but it still lacks the rigor of a proper benchmark.
Investigate using hyperfine, and perhaps, we can even automate the benchmarks as part of the release process.
I haven't been strictly following semver rigorously as new features have been introduced in patch releases in the 0.16.x series.
After 0.16.4, will stick to semver.
Starting from v0.15.0, before publishing 0.15.1
highlighting the new/modified qsv commands in action...
Instead of Travis and Appveyor, just use GitHub Actions.
Closes #18.
The existing shell-based script is OK, but it'd be better if we use Criterion (https://github.com/bheisler/criterion.rs), so we can use tools like critcmp (https://github.com/BurntSushi/critcmp)
for even faster performance.
https://github.com/rust-lang/regex/blob/master/PERFORMANCE.md
Potential segfault in the time crate
Details | |
---|---|
Package | time |
Version | 0.1.43 |
URL | time-rs/time#293 |
Date | 2020-11-18 |
Patched versions | >=0.2.23 |
Unaffected versions | =0.2.0,=0.2.1,=0.2.2,=0.2.3,=0.2.4,=0.2.5,=0.2.6 |
Unix-like operating systems may segfault due to dereferencing a dangling pointer in specific circumstances. This requires an environment variable to be set in a different thread than the affected functions. This may occur without the user's knowledge, notably in a third-party library.
The affected functions from time 0.2.7 through 0.2.22 are:
time::UtcOffset::local_offset_at
time::UtcOffset::try_local_offset_at
time::UtcOffset::current_local_offset
time::UtcOffset::try_current_local_offset
time::OffsetDateTime::now_local
time::OffsetDateTime::try_now_local
The affected functions in time 0.1 (all versions) are:
at
at_utc
Non-Unix targets (including Windows and wasm) are unaffected.
Pending a proper fix, the internal method that determines the local offset has been modified to always return None
on the affected operating systems. This has the effect of returning an Err
on the try_*
methods and UTC
on the non-try_*
methods.
Users and library authors with time in their dependency tree should perform cargo update
, which will pull in the updated, unaffected code.
Users of time 0.1 do not have a patch and should upgrade to an unaffected version: time 0.2.23 or greater or the 0.3. series.
No workarounds are known.
See advisory page for additional details.
With the release of 0.16.1, all the major pending pull requests from xsv have been merged into qsv.
IMHO, we can now apply rustfmt to the whole project and standardize the code to rustfmt standards.
This will put us in a better position to directly accept PRs.
I currently use xsv
installed from conda via the conda-forge community package collection,
https://anaconda.org/conda-forge/xsv
https://github.com/conda-forge/xsv-feedstock/tree/master/recipe
I would like to do the same for qsv
since I increasingly find myself wanting to use functionality only available in this more up to date fork (thank you!).
I am not familiar with rust, but it ought to be straightforward to package qsv
with an almost cut-and-paste copy of that recipe, so I am willing to attempt this having previously contributed recipes for other packages to conda-forge.
Currently, the sort
cmd loads the entire CSV file into memory which will not work for extremely large files.
External sort is a way around this.
Already, there are several external sort crates that we can leverage:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.