w3c / csvw Goto Github PK

Documents produced by the CSV on the Web Working Group

License: Other

CSS 0.09% JavaScript 0.51% HTML 98.83% Ruby 0.10% Python 0.03% Vim Snippet 0.32% Haml 0.12%

csvw's Introduction

CSV on the Web Repository

The Past

This repository was originally used by W3C’s CSV on the Web Working Group. The group has defined and published 7 documents, namely:

The group was chaired by Jeni Tennison and Dan Brickley. The W3C staff contact was Ivan Herman. The group completed its work in February 2016, and is officially closed in March 2016.

The repository includes a final release, representing the status of the repository when the Working Group was closed.

The Present

After the closure of the Working Group the activity around CSV (and, in more general, tabular) data on the Web has been taken up by the CSV on the Web Community Group. All discussions on new possible features, implementation experiences, possible issues, etc, are conducted in that community group, which is open to everyone. Although the Community Group is not allowed to make changes on the official documents, this repository is used for issue management, new documents, experimentations, etc. If you are interested by this area, please join the Community Group!

csvw's People

Contributors

Stargazers

Watchers

Forkers

davideceolin rossjones timrobertson100 rufuspollock seabre waingram taisukef 6a6d74 westurner edsu dret yaso fekaputra csarven dthume dbooth-boston sebneu vlequay prayagverma influencia0406 xiaoyixinx gkellogg stevestrongapp jjediny salewski gitstoy tirithcz acidburn0zzz ajtucker andrewiliadis sviggiano96 semanticbeeng smurp ipshithaganapa aayush17002 baky0905 an0wn isabella232 forschung siwargrati steve9324 bettman-latin paulin14 clim039 purplehobo sweattep temiyato ismetcsahin melchiottbakery rushout09

csvw's Issues

Locating additional metadata when originally starting from a metadata document

What should processors do if they have been passed a metadata file and located a CSV file from that metadata file. Should they still check for metadata files related specifically to the CSV file itself? For example, say that an application has been pointed at a metadata file at http://example.org/metadata.json which references http://example.org/toilets.csv but there is also a metadata file at http://example.org/toilets.csv-metadata.json. If the processor had been originally pointed to http://example.org/toilets.csv then it would have located the file-specific metadata at http://example.org/toilets.csv-metadata.json, but coming via http://example.org/metadata.json means that file-specific metadata is skipped.

CSV Dialect Description

Do we want a way to describe the "dialect" of the CSV in the Metadata document (e.g. separator is ';')

Existing spec at http://dataprotocols.org/csv-dialect/ which we could reuse.

{
  "delimiter": ",",
  "doubleQuote": true,
  "lineTerminator": "\r\n",
  "quoteChar": "\"",
  "skipInitialSpace": true,
  "header": true
}

Is an empty string value a null semantic value or an empty string semantic value?

Should an empty string value always count as a null semantic value within the core data model (ie when there is no other metadata telling the processor how to interpret that value), or should the semantic value be an empty string?

What do to with conversion if no column name is given?

The current metadata document says that the schema property is optional, or, even if set, column, row, or cell properties are all optionals. What this means is that the names of the columns may not be available. If so, the current RDF/JSON conversion algorithms fail. A possibility is to define (in the metadata document) a default fall-back for column naming if there is nothing providing them. (Or requiring the presense of schema/column).

The discussion on a telco on the 1st of October put forward the approach of letting the conversion fail altogether; this is now reflected in the conversion documents.

Is CSV packaging within scope?

R-CsvAsSubsetOfLargerDataset

Ivan says:

"Hm. Do we define CSV packaging in this Working Group? Is this in scope? We could decide it is, though we should check this with the charter."

Is row by row processing sufficient?

In the Processing Model of the Generating RDF from Tabular Data on the Web doc, there is an issue raised stating:

"""
Independently processed rows - is this always the case?
"""

There are examples (see Use Case #24 - Expressing a hierarchy within occupational listings) where "blank" fields imply "ditto" to the field above (or the last time that field was not blank). At first glance, this seems pretty trivial, yet the example in the use case uses a multi-level hierarchy, and sometimes "blank" means "empty" (null) not "ditto". As such, the arbitrary processing required to "guess the behaviour applied to blank cells" is somewhat challenging.

As such, I recommend that we don't try to process this mode of behaviour during the transformation. If people have CSV data with "blanks that mean ditto", they need to fill in the blanks first.

Given that, I suggest that we stick with the model that processes each row independtly and does not require us to maintain state from row to row.

What line endings should be expected?

Section 4.1.1 of RFC2046 specifies that "The canonical form of any MIME "text" subtype must always represent a line break as a CRLF sequence. Similarly, any occurrence of CRLF in MIME "text" must represent a line break. Use of CR and LF outside of line break sequences is also forbidden."

Should we be defining application/csv instead, to prevent having to adhere to this rule, or should we stick to the CRLF rule?

Open and/or Closed validation requirement

R-CsvValidation

David Booth suggests:

"It sounds like the R-CsvValidation requirement may need to be split into two separate validation requirements:

R-CsvOpenValidation: Does the data in the CSV conform to the metadata, ignoring inapplicable metadata? For example, is every column in the CSV described by some metadata?

R-CsvClosedValidation: Does the metadata describe anything that does NOT appear in the CSV?

I suppose if the metadata had a notion of optional columns then both of these cases could be covered at once."

However, at this point the use cases only appear to relate to the closed validation case. Do we need another use case to support open validation?

What should be generated for a value with datatype in the case of JSON

There are some alternatives.

either the datatype value is ignored, i.e., the original (string) value is used as a value
like before but an extra @context entry is generated denoting the datatype (but that may not work because a cell-level metadata may set a specific datatype for a cell, and a @context would be valid for all values for a given key)
a JSON object is generated of the form (borrowed from, and compatible with, JSON-LD):

"predicate" : { 
  "value" :  -- the original value of the cell as a string --
  "datatype: -- the datatype set for that specific cell --
}

One can also imagine a combination of alts. 2 and 3: set a generic value for a column as part of a @context structure, and then store the data as an object only if there is a deviation from the @context.

Con alts. 2 or 3. means that we make the @context structure as integral part of the output almost every time; for a non-RDF, i.e., non-JSON-LD usage that seems to be an overkill.

Con alt. 3. The generated JSON increases in size, which may be an issue for large tables.

Personally, I would propose to go with #3.

Should the "@type" information be reflected in the generated JSON/RDF?

The current conversion follows the metadata specification, insofar as, for example, the "@type":"Row" is set (in the JSON version) optionally, depending on whether the original metadata contains that @type or not. On the other hand, the value ("Row", "Column') are fixed strings (or URI-s in the RDF version), so it may be simpler to add that information unconditionally on the output.

Should there be limitations on the syntax of column names?

What syntactic limitations should there be on column names to make them most useful when used as the basis of conversion into other formats, bearing in mind that different target languages such as JSON, RDF and XML have different syntactic limitations and common naming conventions.

Regular expression syntax

What syntax should we use for regular expressions to test strings.

Verify regular expression in use case.

Use Case #24 - Expressing a hierarchy within occupational listings

The REGEXP needs to be verified as correct.

How to interpret fixed string type values ("Table", "Row",...)

It is not clear how to interpret these values (at the moment, they are "@type": "Table", "@type": "Column", and "@type" : "Row" to keep the JSON/RDF versions in sync. I see several possibilities:

Assign URI-s in the form of csv:Table, etc., and these are the values in both RDF and JSON.
Use a URI like above for RDF, and use a string for JSON; for the latter, we rely on a possible @context to make the two versions in sync.

At the moment the documents follow the second alternative.

Lack of example for canonical mapping use case

Use Case #17 - Canonical mapping of CSV

This use case lacks a concrete example to further illustrate the concerns. Whilst the concerns outlined above are relevant for a gamut of open data publication, it would be useful to include an example dataset here - which might be subsequently used in test suites relating to the motivating requirements of this use case.

It may be appropriate to merge this use case with UC-IntelligentlyPreviewingCSVFiles as that is concerned with using an interim JSON encoding to support intelligent preview of unannotated tabular data.

Using JSON-LD for the metadata document

We are aiming for the JSON format to be interpretable as JSON-LD, but without any requirement to include context within the JSON itself (to save people from having to do boilerplate). We invite comments on the utility of this approach: is it useful for CSV metadata to be interpretable as JSON-LD? Is it helpful to be able to map it to RDF? Would it be better to rename some of the JSON-LD keywords, such as @id and @type?

Ignoring invalid metadata files located through Link header

Suggest that we drop the requirement to ignore invalid metadata files, given that the correct metadata file should be identified through the type of the link.

Verify GeoJSON-LD example

Use Case #21 - Publication of Biodiversity Information from GBIF using the Darwin Core Archive Standard

The example "GeoJSON-LD" needs to be verified as correct.

Pattern string formats for parsing dates/numbers/durations

What pattern string formats should we use. There are pattern string formats defined in Unicode TR35.

Limit foreign key cross references to the same batch of processed resources

R-ForeignKeyReferences

AndyS suggests that: "The cross reference between files should be limited to files from one publisher - else they are just web links with no guarantee of whether the target of the link exists which 'foreign key' might imply."

This seems like a sensible recommendation - but needs confirmation from the group.

Locating data region(s) within a CSV file

Use Case #2 - Publication of National Statistics

Do we want to be able to assert within the CSV+ metadata that the "data" exists at a particular region within the CSV? When talking about multiple tables within a single CSV file, AndyS stated:

"Maybe put the location of the data table within a single CSV file into the associated metadata: a package description for a single file."

Spell out names & values of annotations permitted in annotated tabular data models

RTL examples to the document

Yakov collected some use cases for RTL type CSV files:

http://lists.w3.org/Archives/Public/public-csv-wg/2014Mar/0229.html
http://lists.w3.org/Archives/Public/public-csv-wg/2014Mar/0227.html

I believe these should be added to the Use Case document in some form. Also, these may have a direct effect on the model document, see:

http://lists.w3.org/Archives/Public/public-csv-wg/2014Mar/0232.html

Handling multiple titles

Copying @gkellogg's comment from #53 into this new issue:

The extracted header rows are different from the name of the column. I guess this is most closely related to the title rather than the name metadata. However, it does come into play for an uninformed-mapping, where there is no metadata, and presumably the column header is used to construct the default name.

If we have a CSV with two header rows, such as the Use Case #12:

5
methane molecule (in angstroms)
C        0.000000        0.000000        0.000000
H        0.000000        0.000000        1.089000
H        1.026719        0.000000       -0.363000
H       -0.513360       -0.889165       -0.363000
H       -0.513360        0.889165       -0.363000

I think that this shows four columns where the first column is the only one with a header cell, which is the two values ["5", "methane molecule (in angstroms)"].

The metadata document states the following:

If the column already has a title annotation (because a header row has been included in the original CSV file) then a validator must issue a warning if the existing title annotation is not the same as any of the possible column titles.

Is this one title or two? How must this correspond to title information from the metadata?

(This seems to be off-topic for this issue, but may need to be tracked someplace, anyway).

Hooks for Call-back functions or Promises when parsing data?

R-CellMicrosyntax

Do we want to provide a mechanism to hook user-defined call-back functions (or Promises) into the CSV parser to validate and process / convert embedded structured or semi-structured content?

Marking tables as right-to-left in the media type

An alternative approach is for the CSV to be parsed into a table model in which the columns are numbered in the reverse, for tables which are either marked as or detected to be right-to-left tables. For example, we could introduce a bidi=rtl or similar media type parameter, and use this to determine whether the first column in table generated from the CSV is the text before the first comma in each line or the text after the last comma in the line.

Rich Column Classes / Types (@type / @datatype on column)

Proposal: allow additional field (name to be determined but could be @datatype or @type) whose value could/would be a RDF class specifying the type of data in that column. This would complement the basic type stuff we have with datatype/type attribute (which is from xsd list).

schema: {
  columns: [
   {
     ...
     @(data)type: some-uri ...
   }
  ]
}

cf frictionlessdata/specs#89

Controlled vocabulary for table-direction

The values of table-direction should be a defined controlled vocabulary in JSON-LD, so that the values map on to URIs in the RDF version rather than strings. We invite comment on how to configure the JSON-LD context to enable these values to be interpreted in this way.

Mismatch between CSV files and tables

A CSV file might not be the same as the table that it contains. For example, a given CSV file might contain two tables (in different regions of the CSV file), or might contain a table that isn't positioned at the top left of the CSV file. We invite comment about whether we should assume that pre-processing is used to extract tables where there isn't a 1:1 correspondence between CSV file and table, or not.

Add GBIF use case

See http://lists.w3.org/Archives/Public/public-csv-wg/2014May/0016.html

Default attributes - schema.org vs dc terms vs ...

The current spec offers a set of standard terms people could use as attributes on resources (whilst allowing more). What should these default terms be ...

Take security issues in account for CSV conversions in general

Issue raised on the mailing list by Yakov, see:

http://lists.w3.org/Archives/Public/public-csv-wg/2014May/0126.html

for more details

language setting in columns, rows, etc...

At the moment it is possible to set "@language" in the context section. However, wouldn't it be necessary to be able to set "language" on the column, row, and cell level, too? A predominantly English table may have a column (or a row) in French...

The conversion should take into account the cell parsing rules for data

At the moment, only the datatype is taken into account, and that is also vague, but some of the parsing rules, described in the relevant section of the document should also be incorporated. For example, the date value should be transformed to ISO format for RDF,

How to determine language encoding

Use Case #19 - Supporting Right to Left (RTL) Directionality

From internationalization perspective how is the proper language encoding is determined?

Default charset for CSV files

RFC4180 defines the default charset as US-ASCII because that was (at the time RFC4180 was written) the default charset for all text/* media types. This has been superseded with RFC6657. Section 3 of RFC6657 states "new subtypes of the "text" media type should not define a default "charset" value. If there is a strong reason to do so despite this advice, they should use the "UTF-8" [RFC3629] charset as the default."

Do we have a strong reason to specify a default charset? Should IETF be defining application/csv instead, to avoid doing unrecommended things with a text/* media type.

Should we check integrity of CSV files in some way

Should we include properties that help in checking the integrity of the file: datapackage includes bytes and hash. We could reuse the Subresource Integrity work here.

Add HL7 example

See http://lists.w3.org/Archives/Public/public-csv-wg-comments/2014Apr/0000.html for details.

Suggested form for provenance data in RDF & JSON

See http://lists.w3.org/Archives/Public/public-csv-wg/2014May/0110.html for initial sketch of how to structure this.

Enabling markup in natural language strings in metadata documents

Would it be useful to enable some markup in natural language strings in the metadata document, for example by stating that they are interpreted as HTML or Markdown (eg for descriptions).

Should CSV parser validate data types?

R-SyntacticTypeDefinition

Is there any intent for the CSV parser to validate that a given cell conforms to the type declaration?

Trimming whitespace around cells

When parsing, should we:

always trim whitespace around cells?
always create empty cells for missing cells?

[TDW] Fixing up file and streaming

Sec 3.3 (edd6183 : 2014-02-26)

To fix up CSV files in which different lines contain different numbers of values,
additional empty values should be added to the end of lines such that they all contain
the same number of values as the line with the most values.

To know the line with the most values, requires looking through the whole file. Fixes that depend on that means that you can not output any lines before checking for the longest possible row. It stops streaming.

A line longer than the header line, has no column names so again, you'd have to generate the extra columns before doing anything else.

Suggestion:

pad short lines to the length of the header row with empty fields
have the suggested API for processing have an "extra" return for lines found longer than the header row

Should primary keys be skipped from cell level triple (or k/v pairs) generation?

At the moment both specification generate a triple, resp. a k/v pair, for all cells in a row, including those that belong to a primary key column (and whose value is therefore reflected in the id of the whole row). Should these cells be skipped?

geopoint definition in metadata vocabulary needs more consideration

The datatypes section within the Metadata Vocabulary for Tabular Data defines geopoint:

"""
a comma-separated longitude and latitude (ie values that after stripping leading and trailing whitespace are in the format longitude\s_,\s_latitude)
"""

GeoJSON defines "positions" as the basic element of geometry and uses a JSON array to hold the individual coordinate values.

It's important to specify the coordinate reference system (CRS) within which the lon/lat are provided; a default CRS is typically WGS 84.

Normally, this is defined using EPSG 4326 with the coordinate tuples as [lat, lon]

Alternatively, one may also choose to include altitude (from the geoid surface) using EPSG 4979 which defines coordinate tuples as [lat, lon, alt].

GeoJSON also allows the CRS to be user defined with the crs object.

My concern is that the specification over trivializes the geometry constructs & more consideration needs to be taken to incorporate best practice from existing geo data specifications.

type vs datatype in column metadata

Issue 1

type vs datatype in fields/columns schema
- type is used in Tabular Data Format but conflicts with dublin core type

columns: [
  {
    "name": "A",
    "type": "string"
  } 
  ...

vs

columns: [
  {
    "name": "A",
    "datatype": "string"
  }

Issue 2

datatype vs dataType
- Camel-case is standard for json
- But could be confusing and maybe datatype is one word

How should class level qualified properties be transformed to RDF

Apart from a few cases ("title" and "language" at this moment), most of the generic, top level metadata are supposed to be qualified, meaning, are of the form prefix:reference. How should one transform this into proper URI-s for RDF? There seems to be two possibilities:

require the presence of a @context in the metadata that defines a URI for the prefix and use that
like before, but use also a set of predefined URI-s in case the prefix definition is missing.

For the second case, the predefined prefixes for the RDFa initial context[1] might be reused here.

[1] http://www.w3.org/2011/rdfa-context/rdfa-1.1

Organization of the rows and cells in the metadata document

At the moment, the organization of row and cell level metadata is organized as follows:

 "schema":
    "rows": [{
        "row": "index of row"
   ],[
       ...
  ]},

this structure seems to be very inefficient for implementations if the row level metadata array is very large and there are also a large number of rows in the data. Wouldn't it be more efficient to use something like:

"schema":
   "rows": {
     "index1" : { ... }
     "index2" : { ... }
   }

Where does row numbering begin?

At the moment (e.g., in the RDF/JSON conversions) there is a notion of a row number. It is not clear whether the header row (if present) should count or not, or whether row numbers begins at the first data row (with number 1).

There should be a "Fragment Model" section in the Data Model document

This issue probably overlaps with #9 . Generally speaking, the data model document should have a section talking about the fragment model (or about the fact that the data model does not provide a fragment model, and that such a model is left as an exercise to later standardization). For RFC 4180, RFC 7111 defined a fragment model, and it went through a couple of iterations before it was finished. It would be good if CSV+ provided a fragment model, or explicitly said that it didn't (and then maybe even why).