Code Monkey home page Code Monkey logo

csvw's Introduction

CSV on the Web Repository

The Past

This repository was originally used by W3Cโ€™s CSV on the Web Working Group. The group has defined and published 7 documents, namely:

  1. CSV on the Web: A Primer
  2. Model for Tabular Data and Metadata on the Web
  3. Metadata Vocabulary for Tabular Data
  4. Generating JSON from Tabular Data on the Web
  5. Generating RDF from Tabular Data on the Web
  6. Embedding Tabular Metadata in HTML
  7. Use Cases and Requirements

The group was chaired by Jeni Tennison and Dan Brickley. The W3C staff contact was Ivan Herman. The group completed its work in February 2016, and is officially closed in March 2016.

The repository includes a final release, representing the status of the repository when the Working Group was closed.

The Present

After the closure of the Working Group the activity around CSV (and, in more general, tabular) data on the Web has been taken up by the CSV on the Web Community Group. All discussions on new possible features, implementation experiences, possible issues, etc, are conducted in that community group, which is open to everyone. Although the Community Group is not allowed to make changes on the official documents, this repository is used for issue management, new documents, experimentations, etc. If you are interested by this area, please join the Community Group!

csvw's People

Contributors

6a6d74 avatar afs avatar ajtucker avatar andimou avatar danbri avatar darobin avatar davideceolin avatar dret avatar edsu avatar ericstephan avatar gkellogg avatar iherman avatar jeremytandy avatar prayagverma avatar rossjones avatar waingram avatar yakovsh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

csvw's Issues

Locating additional metadata when originally starting from a metadata document

What should processors do if they have been passed a metadata file and located a CSV file from that metadata file. Should they still check for metadata files related specifically to the CSV file itself? For example, say that an application has been pointed at a metadata file at http://example.org/metadata.json which references http://example.org/toilets.csv but there is also a metadata file at http://example.org/toilets.csv-metadata.json. If the processor had been originally pointed to http://example.org/toilets.csv then it would have located the file-specific metadata at http://example.org/toilets.csv-metadata.json, but coming via http://example.org/metadata.json means that file-specific metadata is skipped.

CSV Dialect Description

Do we want a way to describe the "dialect" of the CSV in the Metadata document (e.g. separator is ';')

Existing spec at http://dataprotocols.org/csv-dialect/ which we could reuse.

{
  "delimiter": ",",
  "doubleQuote": true,
  "lineTerminator": "\r\n",
  "quoteChar": "\"",
  "skipInitialSpace": true,
  "header": true
}

What do to with conversion if no column name is given?

The current metadata document says that the schema property is optional, or, even if set, column, row, or cell properties are all optionals. What this means is that the names of the columns may not be available. If so, the current RDF/JSON conversion algorithms fail. A possibility is to define (in the metadata document) a default fall-back for column naming if there is nothing providing them. (Or requiring the presense of schema/column).

The discussion on a telco on the 1st of October put forward the approach of letting the conversion fail altogether; this is now reflected in the conversion documents.

Is row by row processing sufficient?

In the Processing Model of the Generating RDF from Tabular Data on the Web doc, there is an issue raised stating:

"""
Independently processed rows - is this always the case?
"""

There are examples (see Use Case #24 - Expressing a hierarchy within occupational listings) where "blank" fields imply "ditto" to the field above (or the last time that field was not blank). At first glance, this seems pretty trivial, yet the example in the use case uses a multi-level hierarchy, and sometimes "blank" means "empty" (null) not "ditto". As such, the arbitrary processing required to "guess the behaviour applied to blank cells" is somewhat challenging.

As such, I recommend that we don't try to process this mode of behaviour during the transformation. If people have CSV data with "blanks that mean ditto", they need to fill in the blanks first.

Given that, I suggest that we stick with the model that processes each row independtly and does not require us to maintain state from row to row.

What line endings should be expected?

Section 4.1.1 of RFC2046 specifies that "The canonical form of any MIME "text" subtype must always represent a line break as a CRLF sequence. Similarly, any occurrence of CRLF in MIME "text" must represent a line break. Use of CR and LF outside of line break sequences is also forbidden."

Should we be defining application/csv instead, to prevent having to adhere to this rule, or should we stick to the CRLF rule?

Open and/or Closed validation requirement

R-CsvValidation

David Booth suggests:

"It sounds like the R-CsvValidation requirement may need to be split into two separate validation requirements:

R-CsvOpenValidation: Does the data in the CSV conform to the metadata, ignoring inapplicable metadata? For example, is every column in the CSV described by some metadata?

R-CsvClosedValidation: Does the metadata describe anything that does NOT appear in the CSV?

I suppose if the metadata had a notion of optional columns then both of these cases could be covered at once."

However, at this point the use cases only appear to relate to the closed validation case. Do we need another use case to support open validation?

What should be generated for a value with datatype in the case of JSON

There are some alternatives.

  1. either the datatype value is ignored, i.e., the original (string) value is used as a value
  2. like before but an extra @context entry is generated denoting the datatype (but that may not work because a cell-level metadata may set a specific datatype for a cell, and a @context would be valid for all values for a given key)
  3. a JSON object is generated of the form (borrowed from, and compatible with, JSON-LD):
"predicate" : { 
  "value" :  -- the original value of the cell as a string --
  "datatype: -- the datatype set for that specific cell --
}
  1. One can also imagine a combination of alts. 2 and 3: set a generic value for a column as part of a @context structure, and then store the data as an object only if there is a deviation from the @context.

Con alts. 2 or 3. means that we make the @context structure as integral part of the output almost every time; for a non-RDF, i.e., non-JSON-LD usage that seems to be an overkill.

Con alt. 3. The generated JSON increases in size, which may be an issue for large tables.

Personally, I would propose to go with #3.

Should the "@type" information be reflected in the generated JSON/RDF?

The current conversion follows the metadata specification, insofar as, for example, the "@type":"Row" is set (in the JSON version) optionally, depending on whether the original metadata contains that @type or not. On the other hand, the value ("Row", "Column') are fixed strings (or URI-s in the RDF version), so it may be simpler to add that information unconditionally on the output.

Should there be limitations on the syntax of column names?

What syntactic limitations should there be on column names to make them most useful when used as the basis of conversion into other formats, bearing in mind that different target languages such as JSON, RDF and XML have different syntactic limitations and common naming conventions.

How to interpret fixed string type values ("Table", "Row",...)

It is not clear how to interpret these values (at the moment, they are "@type": "Table", "@type": "Column", and "@type" : "Row" to keep the JSON/RDF versions in sync. I see several possibilities:

  • Assign URI-s in the form of csv:Table, etc., and these are the values in both RDF and JSON.
  • Use a URI like above for RDF, and use a string for JSON; for the latter, we rely on a possible @context to make the two versions in sync.

At the moment the documents follow the second alternative.

Lack of example for canonical mapping use case

Use Case #17 - Canonical mapping of CSV

This use case lacks a concrete example to further illustrate the concerns. Whilst the concerns outlined above are relevant for a gamut of open data publication, it would be useful to include an example dataset here - which might be subsequently used in test suites relating to the motivating requirements of this use case.

It may be appropriate to merge this use case with UC-IntelligentlyPreviewingCSVFiles as that is concerned with using an interim JSON encoding to support intelligent preview of unannotated tabular data.

Using JSON-LD for the metadata document

We are aiming for the JSON format to be interpretable as JSON-LD, but without any requirement to include context within the JSON itself (to save people from having to do boilerplate). We invite comments on the utility of this approach: is it useful for CSV metadata to be interpretable as JSON-LD? Is it helpful to be able to map it to RDF? Would it be better to rename some of the JSON-LD keywords, such as @id and @type?

Handling multiple titles

Copying @gkellogg's comment from #53 into this new issue:

The extracted header rows are different from the name of the column. I guess this is most closely related to the title rather than the name metadata. However, it does come into play for an uninformed-mapping, where there is no metadata, and presumably the column header is used to construct the default name.

If we have a CSV with two header rows, such as the Use Case #12:

5
methane molecule (in angstroms)
C        0.000000        0.000000        0.000000
H        0.000000        0.000000        1.089000
H        1.026719        0.000000       -0.363000
H       -0.513360       -0.889165       -0.363000
H       -0.513360        0.889165       -0.363000

I think that this shows four columns where the first column is the only one with a header cell, which is the two values ["5", "methane molecule (in angstroms)"].

The metadata document states the following:

If the column already has a title annotation (because a header row has been included in the original CSV file) then a validator must issue a warning if the existing title annotation is not the same as any of the possible column titles.

Is this one title or two? How must this correspond to title information from the metadata?

(This seems to be off-topic for this issue, but may need to be tracked someplace, anyway).

Marking tables as right-to-left in the media type

An alternative approach is for the CSV to be parsed into a table model in which the columns are numbered in the reverse, for tables which are either marked as or detected to be right-to-left tables. For example, we could introduce a bidi=rtl or similar media type parameter, and use this to determine whether the first column in table generated from the CSV is the text before the first comma in each line or the text after the last comma in the line.

Controlled vocabulary for table-direction

The values of table-direction should be a defined controlled vocabulary in JSON-LD, so that the values map on to URIs in the RDF version rather than strings. We invite comment on how to configure the JSON-LD context to enable these values to be interpreted in this way.

Mismatch between CSV files and tables

A CSV file might not be the same as the table that it contains. For example, a given CSV file might contain two tables (in different regions of the CSV file), or might contain a table that isn't positioned at the top left of the CSV file. We invite comment about whether we should assume that pre-processing is used to extract tables where there isn't a 1:1 correspondence between CSV file and table, or not.

language setting in columns, rows, etc...

At the moment it is possible to set "@language" in the context section. However, wouldn't it be necessary to be able to set "language" on the column, row, and cell level, too? A predominantly English table may have a column (or a row) in French...

Default charset for CSV files

RFC4180 defines the default charset as US-ASCII because that was (at the time RFC4180 was written) the default charset for all text/* media types. This has been superseded with RFC6657. Section 3 of RFC6657 states "new subtypes of the "text" media type should not define a default "charset" value. If there is a strong reason to do so despite this advice, they should use the "UTF-8" [RFC3629] charset as the default."

Do we have a strong reason to specify a default charset? Should IETF be defining application/csv instead, to avoid doing unrecommended things with a text/* media type.

[TDW] Fixing up file and streaming

Sec 3.3 (edd6183 : 2014-02-26)

To fix up CSV files in which different lines contain different numbers of values,
additional empty values should be added to the end of lines such that they all contain
the same number of values as the line with the most values.

To know the line with the most values, requires looking through the whole file. Fixes that depend on that means that you can not output any lines before checking for the longest possible row. It stops streaming.

A line longer than the header line, has no column names so again, you'd have to generate the extra columns before doing anything else.

Suggestion:

  • pad short lines to the length of the header row with empty fields
  • have the suggested API for processing have an "extra" return for lines found longer than the header row

geopoint definition in metadata vocabulary needs more consideration

The datatypes section within the Metadata Vocabulary for Tabular Data defines geopoint:

"""
a comma-separated longitude and latitude (ie values that after stripping leading and trailing whitespace are in the format longitude\s_,\s_latitude)
"""

GeoJSON defines "positions" as the basic element of geometry and uses a JSON array to hold the individual coordinate values.

It's important to specify the coordinate reference system (CRS) within which the lon/lat are provided; a default CRS is typically WGS 84.

Normally, this is defined using EPSG 4326 with the coordinate tuples as [lat, lon]

Alternatively, one may also choose to include altitude (from the geoid surface) using EPSG 4979 which defines coordinate tuples as [lat, lon, alt].

GeoJSON also allows the CRS to be user defined with the crs object.

My concern is that the specification over trivializes the geometry constructs & more consideration needs to be taken to incorporate best practice from existing geo data specifications.

type vs datatype in column metadata

Issue 1

  • type vs datatype in fields/columns schema
    • type is used in Tabular Data Format but conflicts with dublin core type
columns: [
  {
    "name": "A",
    "type": "string"
  } 
  ...

vs

columns: [
  {
    "name": "A",
    "datatype": "string"
  }

Issue 2

  • datatype vs dataType
    • Camel-case is standard for json
    • But could be confusing and maybe datatype is one word

How should class level qualified properties be transformed to RDF

Apart from a few cases ("title" and "language" at this moment), most of the generic, top level metadata are supposed to be qualified, meaning, are of the form prefix:reference. How should one transform this into proper URI-s for RDF? There seems to be two possibilities:

  • require the presence of a @context in the metadata that defines a URI for the prefix and use that
  • like before, but use also a set of predefined URI-s in case the prefix definition is missing.

For the second case, the predefined prefixes for the RDFa initial context[1] might be reused here.

[1] http://www.w3.org/2011/rdfa-context/rdfa-1.1

Organization of the rows and cells in the metadata document

At the moment, the organization of row and cell level metadata is organized as follows:

 "schema":
    "rows": [{
        "row": "index of row"
   ],[
       ...
  ]},

this structure seems to be very inefficient for implementations if the row level metadata array is very large and there are also a large number of rows in the data. Wouldn't it be more efficient to use something like:

"schema":
   "rows": {
     "index1" : { ... }
     "index2" : { ... }
   }

Where does row numbering begin?

At the moment (e.g., in the RDF/JSON conversions) there is a notion of a row number. It is not clear whether the header row (if present) should count or not, or whether row numbers begins at the first data row (with number 1).

There should be a "Fragment Model" section in the Data Model document

This issue probably overlaps with #9 . Generally speaking, the data model document should have a section talking about the fragment model (or about the fact that the data model does not provide a fragment model, and that such a model is left as an exercise to later standardization). For RFC 4180, RFC 7111 defined a fragment model, and it went through a couple of iterations before it was finished. It would be good if CSV+ provided a fragment model, or explicitly said that it didn't (and then maybe even why).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.