Code Monkey home page Code Monkey logo

csvlint.rb's People

Contributors

adamc00 avatar andrew avatar cobbr2 avatar d-system avatar dependabot-preview[bot] avatar dependabot[bot] avatar domon avatar erikj avatar floppy avatar ftrotter avatar hoedic avatar jamesjefferies avatar jespertp-systematic avatar jezhiggins avatar jrottenberg avatar kotaro0522 avatar ldodds avatar mseverini avatar nickzoic avatar ohbarye avatar petergoldstein avatar pezholio avatar quadrophobiac avatar rbmrclo avatar rddimon avatar rmalecky avatar sclinede avatar sivteck avatar sordina avatar youpy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

csvlint.rb's Issues

Get total rows number about the CSV file that was validated

Hi,

Similar to the example below, i would love to be able to know how many rows my csv is without reparsing all the files via csv ruby library ...

# get some information about the CSV file that was validated
validator.encoding
validator.content_type
validator.extension
**validator.rows**

Regards,

Write Schema & Fle rspec tests

Currently both schema_validation.feature && schema_spec.rb do not test for the full range of errors and warnings that are included in this gems readme
There are some well specified features in csvlint.rb - these will be incorporated into specs for the gem

Eliminate some date and time formats (for speed)

Several date and time formats were blindly imported from Rails' Date/Time formats, but I'd expect several of those are never in a CSV (e.g. the 23 digit timestamp with nanoseconds). Eliminating the unused date formats will speed up #103 even further.

Can't pipe data to csvlint

It isn't currently possible to pipe data into csvlint on the command line. As a result it isn't possible to use it to compose workflow where one process in the chain is generating csv. For example - proc_that_generates_csv | csvlint

It should probably default to reading from stdin if no argument are supplied, and have a separate 'usage' trigger (--help, -h)

Enumerations

Does csvlint support enumerations? I did not see an explicit enum type in the documentation. This could probably be simulated with a regex constraint but it would be nice to have native support for enumerable fields.

Recover from some line ending problems

Currently schema validation for a line is by-passed if we get a :line_ending error triggered by a CSV::MalformedCSVError.

Ideally if we know is a line ending issue, then we should log the error and then attempt to re-process it with an alternative line ending, so we then get schema validation too.

Allow validator to take in a CSV file encoding on initialization for validating a local csv file that isn't utf-8 encoded.

Currently validating local csv files that aren't encoded as utf-8 always fails with encoding error.
Validator currently sets the encoding of a file by using io.charset which doesn't exist for files that's read in.

Suggested implementation:
Use options={} parameter to take in a file encoding and merge it into @csv_options

@pezholio I'd like to implement this feature as I think it'll be helpful. I also will need this feature for my own use. Let me know if this makes sense in turns of the overall design of the gem.

Optimization: Stream CSV

Running CSVLint on large remote CSVs is slow, because it needs to download the file, then validate. It would be possible in Ruby to stream the response body, in which case it can be validated as it downloads, saving time and memory. Also, if :limit_lines is set, CSVLint can stop the download once those lines have been read.

undefined method `[]' for nil:NilClass from fetch_error

The fetch_error method can trigger the above if it can't map the CSV error message to a known error type. The code should be defaulting to :unknown but its not doing that, we just get an error.

I've not tracked down the specific errors that aren't being handled, probably need to check over the ruby source.

BUG: Improperly implements skipInitialSpace

specifies how to interpret whitespace which immediately follows a delimiter; if False, it means that whitespace immediately after a delimiter should be treated as part of the following field.

By this definition, the following CSV:

foo,bar, baz,   bzz

should parse as follows if skipInitialSpace is true:

["foo", "bar", "baz", "bzz"]

What csvlint's code does is set the delimiter to (the original delimiter, in this case ,, followed by a space). That parses the CSV as:

["foo,bar", "baz", "  bzz"]

Which is clearly incorrect.

Improving encoding detection

Ideally we want to detect what encoding a file is actually using, as this might differ from what is advertised.

See features in theodi/shared#120

For example, take the NY honours spreadsheet:

https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/246500/New_Year_Honours_2013_full_list.csv/preview

This is delivered with a Content-Type of text/csv, with no mime type. US-ASCII might be assumed to be the default, although I think UTF-8 is increasingly common and might be reasonable default assumption

However when trying to open the file, it apparently has invalid characters.

E.g.:

$ curl -v https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/246500/New_Year_Honours_2013_full_list.csv >/tmp/test.csv
$ file -bi /tmp/test.csv
text/plain; charset=unknown-8bit

Using charlock_holmes we get a bit more information:

$ irb
> require "charlock_holmes"
> contents = File.read("/tmp/test.csv")
> detection = CharlockHolmes::EncodingDetector.detect(contents)
 => {:type=>:text, :encoding=>"ISO-8859-1", :confidence=>61, :language=>"en"} 

Confidence level is relatively low, but guessed encoding seems reasonable.

`format` not validating correctly

If my schema has, for example:

...
{
  "name": "Date",
  "type": "date",
  "description": "",
  "constraints": {
    "format": "yyyy-mm-dd"
    "required": true
  }
}
...

And I validate against a CSV that has an incorrect date pattern e.g. "23/09/1980", then I don't get any errors.

Also, http://dataprotocols.org/json-table-schema/#field-types says that every field that has type date should match ISO8601, unless format is specified.

Syntax of `pattern` constraint

What's the syntax of the pattern constraint? Is this using Ruby regexps or something more platform neutral such as those specified in XML Schema?

BUG: Incorrect inconsistent_values error on numeric columns

If a numeric column has numbers with 8 digits (date_number), 14 digits (dateTime_number) or 23 digits (dateTime_nsec) and if those digits happen to be valid dates >10% of the time, then an inconsistent_values validation error is incorrectly reported.

Unless these date/time formats are extant (do we know any software that reads or writes CSV dates in this format?), they should be eliminated.

Note that this happily helps solve #105, and it is in fact a significant speed improvement after #103 (like 25% faster with #103).

CSV on the web support

From @JeniT:

Specs are at:

http://w3c.github.io/csvw/syntax/
http://w3c.github.io/csvw/metadata/

The examples at:

http://w3c.github.io/csvw/csv2json/#examples

are particularly useful.

There are tests are described at:

http://www.w3.org/2013/csvw/tests/

and are in Github at:

https://github.com/w3c/csvw/tree/gh-pages/tests

There’s a manifest in JSON at:

https://github.com/w3c/csvw/blob/gh-pages/tests/manifest.jsonld

which points to the manifest for the validation tests (which are the only ones I think we have to care about) which is at:

https://github.com/w3c/csvw/blob/gh-pages/tests/manifest-validation.jsonld

Creating something that maps that test manifest into whatever format tests are needed for csvlint would be good.

I suspect that you will want to distinguish between validation based on datapackage and validation based on CSV on the Web metadata: there are many similarities but they’re not identical.

Support zipped CSV?

Should we support zipped CSV files? Would make it possible to validate, e.g. the data at http://smtm.labs.theodi.org/download/

Two possible things to consider:

  • Zip files containing one or more CSV files, we could just take the first
  • Files served with a gzip content-encoding. Not sure if the current code will handle that or not.

Return code is always 0 (except when it isn't)

Regardless of whether the input is Valid or Invalid, csvlint returns 0. This makes it hard to use it in a script to determine whether the input was valid or not.

> csvlint a_bad.csv; echo $?
> 0

> csvlint unicorn.csv; echo $?
> 0

usage, triggered by csvlint does return 1 as an exit code, but I think it is normal for usage/help to return 0.

> ls --help; echo $?
...
GNU coreutils home page: <http://www.gnu.org/software/coreutils/>
General help using GNU software: <http://www.gnu.org/gethelp/>
For complete documentation, run: info coreutils 'ls invocation'
0

> ls -?; echo $?
ls: invalid option -- '?'
Try `ls --help' for more information.
2

make Schema.load_from_json_table fail less silently

Could Schema.load_from_json_table throw an error or print a warning if a schema file cannot be parsed correctly? It took a while to realise that the reason my CSV file wasnt showing the errors I expected from schema validation was the schema wasnt loading.

validate called in initialize, not consistent with docs

The README suggests to construct a Validator and then call validator.validate. This is not necessary as validate is called in initialize. This results in double the number of recorded issues.

Suggest either fix the docs or remove call to validate in initialize?

Header error detection

A few suggestions for detecting errors in headers are in jekyll/jekyll#2761, from @paulfitz. Might be nice to integrate.

I think it'd definitely be reasonable to treat the following cases as errors:

Blank cells in the alleged header.
Repeated cells in the alleged header.
Numeric-looking cells (integer, float) in the alleged header (this one is a bit less reasonable than the first two, but would catch a lot more headerless CSV files).

Anything that tries to be much smarter than that, it'd be great to have a configuration switch to turn off for when predictability is important.

Inconsistent column bases

Sometimes 0-based, sometimes 1-based. Failing test in the inconsistent-column-base branch. Not sure where this is happening...

Publish on rubygems.org

Given you already have a gemspec, is there any reason you haven't published it as a gem on rubygems.org? Thanks.

Ensure header rows are properly handled and validated

We should ensure that header rows are properly handled:

The presence of a header might be indicated in HTTP response, or in the CSV dialect which can now include a header key to indicate if there is a header (defaulting to true)

The draft CSV syntax states:

The first line of a CSV+ file must contain a comma-separated list of names of columns. This is known as the header line and forms the columns of the table. Column names must be unique and must not be blank

We should ensure header rows are properly handled by:

  • checking content type for indication of whether header is present
  • checking dialect for presence of header
  • adding an error if either of the above indicates that there is no header row
  • calling Schema.validate_header to check column names on the header
  • extend the above to check whether column names are unique and not blank

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.