data-liberation-front / csvlint.rb Goto Github PK

View Code? Open in Web Editor NEW

282.0 282.0 88.0 719 KB

The gem behind http://csvlint.io

License: MIT License

Ruby 86.81% Gherkin 13.07% Dockerfile 0.12%

csvlint.rb's People

Contributors

Stargazers

Watchers

Forkers

hooopo robmoore eduardodeoh rmalecky hoedic eduardobvale railsfactory-kumaresan resultadosdigitais xleninx peoplepattern urfolomeus glacier mchavarriagam wachunga kwent sordina jcaraballosc jonmartindell sorrell wjordan213 kevcha nsip odnodn femiagbabiaka paulfri logicly-au l00mi downhome andrew mtsmfm consultingmd quipper rudmer os6sense kotaro0522 gaybro8777 dicksonlabs instacart gss-cogs charly3x ohbarye ftrotter domon adamrdavid amco wherefour jasonchester thegeorgeous quintype gaurav-cf curefit reelmetrics messagexchange bradthurber fablic jezhiggins jespertp-systematic jppassalacqua mseverini hjeebus jnanendraveer dawesi renuo fabernovel amit-kando kplaricos mszyndel sclinede occupier matchbookmac primekobollc jbbonin youpy jrottenberg lyrasis rddimon d-system greg-myers-sb

csvlint.rb's Issues

Use Travelling Ruby to package as a binary

https://github.com/phusion/traveling-ruby

Would mean CSVlint could be installed via (for example) a package manager, and would run like any other *nix application.

Get total rows number about the CSV file that was validated

Hi,

Similar to the example below, i would love to be able to know how many rows my csv is without reparsing all the files via csv ruby library ...

# get some information about the CSV file that was validated
validator.encoding
validator.content_type
validator.extension
**validator.rows**

Regards,

They have some validator running if someone wants to inspect it for "inspiration"

https://github.com/densitydesign/raw/

Reported error positions are not massively useful

Presumably these are byte positions? It would be useful to have a row/col if possible.

When an encoding error is thrown the line content is put into the column field in the error object

I've made a fix for this bug. Please review & merge. Thanks!
#130

Write Schema & Fle rspec tests

Currently both schema_validation.feature && schema_spec.rb do not test for the full range of errors and warnings that are included in this gems readme
There are some well specified features in csvlint.rb - these will be incorporated into specs for the gem

Use explicit CSV parsing options

Currently we're using the default options of the read_line and CSV parser, probably best to be explicit about the defaults.

Internally we should probably use CSV DDF, ready to accept external schemas.

Need to:

specify a delimiter to read_line (this is what CSV DDF calls lineterminator
add an opts hash to CSV.parse_line.

See https://github.com/theodi/datapackage.rb/blob/master/lib/datapackage/validator.rb#L203 for example of mapping CSV DDF to Ruby CSV options.

Make sure we don't check schema column count and ragged row count together

The schema col count should be used if a schema is available, and the current simple ragged row check should not be performed.

Inconsistent values due to number format differences

This generates an inconsistent values warning:

https://raw.github.com/datasets/cpi-gb/master/data/cpi-uk-annual.csv

Its because some numbers include decimal points and some dont. The former get matched as alphanumeric (because of the full stop) the latter as numeric.

Eliminate some date and time formats (for speed)

Several date and time formats were blindly imported from Rails' Date/Time formats, but I'd expect several of those are never in a CSV (e.g. the 23 digit timestamp with nanoseconds). Eliminating the unused date formats will speed up #103 even further.

Can't pipe data to csvlint

It isn't currently possible to pipe data into csvlint on the command line. As a result it isn't possible to use it to compose workflow where one process in the chain is generating csv. For example - proc_that_generates_csv | csvlint

It should probably default to reading from stdin if no argument are supplied, and have a separate 'usage' trigger (--help, -h)

Enumerations

Does csvlint support enumerations? I did not see an explicit enum type in the documentation. This could probably be simulated with a regex constraint but it would be nice to have native support for enumerable fields.

Blank values shouldn't count as inconsistencies

Looking at this one:

http://csvlint.io/validation/5331acb86373767901030000

I think that the "inconsistent values" that are being highlighted as warnings are because some of the values are blank. Blank values shouldn't count as inconsistencies (ie only look at the non-blank values when working out what type of value the column contains).

Recover from some line ending problems

Currently schema validation for a line is by-passed if we get a :line_ending error triggered by a CSV::MalformedCSVError.

Ideally if we know is a line ending issue, then we should log the error and then attempt to re-process it with an alternative line ending, so we then get schema validation too.

xsd:int or xsd:integer?

In XML, it's a lot more usual to use xsd:integer than xsd:int, but we're using xsd:int rather than xsd:integer. Is this deliberate and if so why?

Validating a csv file with more headers than specified in the schema results in stack trace

I am supplying a pull request for a feature to fix this bug.

Allow validator to take in a CSV file encoding on initialization for validating a local csv file that isn't utf-8 encoded.

Currently validating local csv files that aren't encoded as utf-8 always fails with encoding error.
Validator currently sets the encoding of a file by using io.charset which doesn't exist for files that's read in.

Suggested implementation:
Use options={} parameter to take in a file encoding and merge it into @csv_options

@pezholio I'd like to implement this feature as I think it'll be helpful. I also will need this feature for my own use. Let me know if this makes sense in turns of the overall design of the gem.

Optimization: Stream CSV

Running CSVLint on large remote CSVs is slow, because it needs to download the file, then validate. It would be possible in Ruby to stream the response body, in which case it can be validated as it downloads, saving time and memory. Also, if :limit_lines is set, CSVLint can stop the download once those lines have been read.

Expose optional JSON table schema fields

JSON Table Schema defines the following optional fields:

type
title
description
format

If Csvlint::Schema could expose those, I can show them in the UI.

undefined method `[]' for nil:NilClass from fetch_error

The fetch_error method can trigger the above if it can't map the CSV error message to a known error type. The code should be defaulting to :unknown but its not doing that, we just get an error.

I've not tracked down the specific errors that aren't being handled, probably need to check over the ruby source.

Allow CSV parsing options to be configured as a parameter

Extend #5 to allow user to provide defaults. E.g. by providing a Ruby hash that contains a dialect key that conforms to CSV DDF.

(This style would let datapackage.rb delegate CSV validation to csvlint.rb)

BUG: :title_row warns if the first row has fewer fields than the average row?

That's the behavior. It seems like unusual behavior to me. Shouldn't it check if any row has more fields than the first row?

Get gem continuously deploying

As per #92, we've published this on RubyGems now, but still need to get this continuously deploying through Travis.

Improve feedback on inconsistent values

E.g:

http://csvlint.io/validate?url=http%3A%2F%2Fdata.defra.gov.uk%2Fenv%2Faqfg14-13but-201105.csv

Error is reported multiple times, but there's no indication of the column or row that triggered the warning.

BUG: Improperly implements skipInitialSpace

specifies how to interpret whitespace which immediately follows a delimiter; if False, it means that whitespace immediately after a delimiter should be treated as part of the following field.

By this definition, the following CSV:

foo,bar, baz,   bzz

should parse as follows if skipInitialSpace is true:

["foo", "bar", "baz", "bzz"]

What csvlint's code does is set the delimiter to , (the original delimiter, in this case ,, followed by a space). That parses the CSV as:

["foo,bar", "baz", "  bzz"]

Which is clearly incorrect.

Improving encoding detection

Ideally we want to detect what encoding a file is actually using, as this might differ from what is advertised.

See features in theodi/shared#120

For example, take the NY honours spreadsheet:

https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/246500/New_Year_Honours_2013_full_list.csv/preview

This is delivered with a Content-Type of text/csv, with no mime type. US-ASCII might be assumed to be the default, although I think UTF-8 is increasingly common and might be reasonable default assumption

However when trying to open the file, it apparently has invalid characters.

E.g.:

$ curl -v https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/246500/New_Year_Honours_2013_full_list.csv >/tmp/test.csv
$ file -bi /tmp/test.csv
text/plain; charset=unknown-8bit

Using charlock_holmes we get a bit more information:

$ irb
> require "charlock_holmes"
> contents = File.read("/tmp/test.csv")
> detection = CharlockHolmes::EncodingDetector.detect(contents)
 => {:type=>:text, :encoding=>"ISO-8859-1", :confidence=>61, :language=>"en"}

Confidence level is relatively low, but guessed encoding seems reasonable.

`format` not validating correctly

If my schema has, for example:

...
{
  "name": "Date",
  "type": "date",
  "description": "",
  "constraints": {
    "format": "yyyy-mm-dd"
    "required": true
  }
}
...

And I validate against a CSV that has an incorrect date pattern e.g. "23/09/1980", then I don't get any errors.

Also, http://dataprotocols.org/json-table-schema/#field-types says that every field that has type date should match ISO8601, unless format is specified.

Syntax of `pattern` constraint

What's the syntax of the pattern constraint? Is this using Ruby regexps or something more platform neutral such as those specified in XML Schema?

BUG: Incorrect inconsistent_values error on numeric columns

If a numeric column has numbers with 8 digits (date_number), 14 digits (dateTime_number) or 23 digits (dateTime_nsec) and if those digits happen to be valid dates >10% of the time, then an inconsistent_values validation error is incorrectly reported.

Unless these date/time formats are extant (do we know any software that reads or writes CSV dates in this format?), they should be eliminated.

Note that this happily helps solve #105, and it is in fact a significant speed improvement after #103 (like 25% faster with #103).

Include the column value in error message when field validation fails

CSV on the web support

From @JeniT:

Specs are at:

http://w3c.github.io/csvw/syntax/
http://w3c.github.io/csvw/metadata/

The examples at:

http://w3c.github.io/csvw/csv2json/#examples

are particularly useful.

There are tests are described at:

http://www.w3.org/2013/csvw/tests/

and are in Github at:

https://github.com/w3c/csvw/tree/gh-pages/tests

There’s a manifest in JSON at:

https://github.com/w3c/csvw/blob/gh-pages/tests/manifest.jsonld

which points to the manifest for the validation tests (which are the only ones I think we have to care about) which is at:

https://github.com/w3c/csvw/blob/gh-pages/tests/manifest-validation.jsonld

Creating something that maps that test manifest into whatever format tests are needed for csvlint would be good.

I suspect that you will want to distinguish between validation based on datapackage and validation based on CSV on the Web metadata: there are many similarities but they’re not identical.

Improve validation of URIs

UTF-8 BOM results in whitespace error

UTF-8 files with a Byte Order Mark have the BOM passed through to the content by default in Ruby, and the result is whitespace errors reported by csvlint.

Here's an example: http://csvlint.io/validation/543526f36373760fc6020000.

The BOM only needs to be filtered from the first line in the file, e.g.:

row.delete!("\xEF\xBB\xBF")

Duplicate column names

Support zipped CSV?

Should we support zipped CSV files? Would make it possible to validate, e.g. the data at http://smtm.labs.theodi.org/download/

Two possible things to consider:

Zip files containing one or more CSV files, we could just take the first
Files served with a gzip content-encoding. Not sure if the current code will handle that or not.

Spec difference for type field?

This http://dataprotocols.org/json-table-schema/ suggests the type descriptor is at the root level of a field descriptor. This project's README file says it should be inside constraints and it uses XMLSchema URLs. Please could you give some info on why the difference?

Return code is always 0 (except when it isn't)

Regardless of whether the input is Valid or Invalid, csvlint returns 0. This makes it hard to use it in a script to determine whether the input was valid or not.

> csvlint a_bad.csv; echo $?
> 0

> csvlint unicorn.csv; echo $?
> 0

usage, triggered by csvlint does return 1 as an exit code, but I think it is normal for usage/help to return 0.

> ls --help; echo $?
...
GNU coreutils home page: <http://www.gnu.org/software/coreutils/>
General help using GNU software: <http://www.gnu.org/gethelp/>
For complete documentation, run: info coreutils 'ls invocation'
0

> ls -?; echo $?
ls: invalid option -- '?'
Try `ls --help' for more information.
2

Include the failed constraints in error message when doing field validation

So then we can tell the user what actually failed, e.g. pattern, required, etc.

New lines in quoted fields are valid

Example files:

quoting issues, new-lines in quoted fields
https://raw.github.com/datasets/shell-oil-spills-niger-delta/master/data/data.csv

new-lines in quoted fields
https://raw.github.com/datasets/cofog/master/data/cofog.csv

These generate errors because there are newlines in some long comment fields. This should be valid. Its down to us manually parsing line by line

make Schema.load_from_json_table fail less silently

Could Schema.load_from_json_table throw an error or print a warning if a schema file cannot be parsed correctly? It took a while to realise that the reason my CSV file wasnt showing the errors I expected from schema validation was the schema wasnt loading.

Improve error handling in Schema loading

The Schema.from_json_table should be tolerant of missing fields key. Currently it'll blow up if there aren't any.

Wrongly reporting incorrect file extension

https://raw.github.com/datasets/cpi/master/data/cpi.csv

Generates two warnings:

Incorrect content type
Incorrect file extension

It definely has the wrong content type, but the extension is correct.

validate called in initialize, not consistent with docs

The README suggests to construct a Validator and then call validator.validate. This is not necessary as validate is called in initialize. This results in double the number of recorded issues.

Suggest either fix the docs or remove call to validate in initialize?

Header error detection

A few suggestions for detecting errors in headers are in jekyll/jekyll#2761, from @paulfitz. Might be nice to integrate.

I think it'd definitely be reasonable to treat the following cases as errors:
Blank cells in the alleged header.
Repeated cells in the alleged header.
Numeric-looking cells (integer, float) in the alleged header (this one is a bit less reasonable than the first two, but would catch a lot more headerless CSV files).
Anything that tries to be much smarter than that, it'd be great to have a configuration switch to turn off for when predictability is important.

The draft CSV syntax states:

The first line of a CSV+ file must contain a comma-separated list of names of columns. This is known as the header line and forms the columns of the table. Column names must be unique and must not be blank

We should ensure header rows are properly handled by:

checking content type for indication of whether header is present
checking dialect for presence of header
adding an error if either of the above indicates that there is no header row
calling Schema.validate_header to check column names on the header
extend the above to check whether column names are unique and not blank

data-liberation-front / csvlint.rb Goto Github PK

csvlint.rb's People

Contributors

Stargazers

Watchers

Forkers

csvlint.rb's Issues

Add Command Line Support

Recommend Projects

Recommend Topics

Recommend Org