Code Monkey home page Code Monkey logo

table2qb's Introduction

table2qb tesseract animation

Build Status

Build Statistical Linked-Data with CSV-on-the-Web

Create statistical linked-data by deriving CSV-on-the-Web annotations for your data tables using the RDF Data Cube Vocabulary.

Build up a knowledge graph from spreadsheets without advanced programming skills or RDF modelling knowledge.

Simply prepare CSV inputs according to the templates and table2qb will output standards-compliant CSVW or RDF.

Once you're happy with the results you can adjust the configuration to tailor the URI patterns to your heart's content.

Turn Data Tables into Data Cubes

Table2qb expects three types of CSV tables as input:

  • observations: a 'tidy data' table with one statistic per row (what the standard calls an observation)
  • components: another table defining the columns used to describe observations (what the standard calls component properties such as dimensions, measures, and attributes)
  • codelists: a further set of tables that enumerate and describe the values used in cells of the observation table (what the standard calls codes, grouped into codelists)

For example, the ONS says that:

In mid-2019, the population of the UK reached an estimated 66.8 million

This is a single observation value (66.8 million) with two dimensions (date and place) which respectively have two code values (mid-2019 and UK), a single measure (population estimate), and implicitly an attribute for the unit (people).

The regional-trade example goes into more depth. The colour-coded spreadsheet should help illustrate how the three types of table come together to describe a cube.

Each of these inputs is processed by it's own pipeline which will output CSVW - i.e. a processed version of the CSV table along with a JSON metadata annotation which describes the translation into RDF. Optionally you can also ask table2qb to perform the translation outputting RDF directly that can be loaded into a graph database and queried with SPARQL.

Table2qb also relies on a fourth CSV table for configuration:

  • columns: this describes how the observations table should be interpreted - i.e. which components and codelists should be used for each column in the observation tables

This configuration is designed to be used for multiple data cubes across a data collection (so that you can re-use e.g. a "Year" column without having to configure anew it each time) to encourage harmonisation and alignment of identifiers.

Ultimately table2qb provides a foundation to help you build a collection of interoperable statistical linked open data.

Install table2qb

Github release

Download the release from https://github.com/Swirrl/table2qb/releases.

Currently the latest is 0.3.0.

Once downloaded, unzip. The main 'table2qb' executable is in the directory ./target/table2qb-0.3.0 You can add this directory to your PATH environment variable, or just run it with the full file path on your system.

Clojure CLI

Clojure now distributes clojure and cli command-line programs for running clojure programs. To run table2qb through the clojure command, first install the Clojure CLI tools. Then create a file deps.edn containing the following:

deps.edn

{:deps {swirrl/table2qb {:git/url "https://github.com/Swirrl/table2qb.git"
                         :sha "8c4b22778db0c160b06f2f3b0b3df064d8f8452b"}
        org.apache.logging.log4j/log4j-api {:mvn/version "2.19.0"}
        org.apache.logging.log4j/log4j-core {:mvn/version "2.19.0"}
        org.apache.logging.log4j/log4j-slf4j-impl {:mvn/version "2.19.0"}}
 :aliases
 {:table2qb
  {:main-opts ["-m" "table2qb.main"]}}}

You can then run table2qb using

clojure -A:table2qb

More details about the clojure CLI and the format of the deps.edn file can be found on the Clojure website

Running table2qb

Table2qb is written in Clojure and uses tools.deps. It is recommended you use JDK 17 or later.

table2qb can be run via the clojure CLI tools through the :cli alias:

clojure -M:cli list

To get help on the available commands, type clojure -M:cli help.

To see the available pipelines (described in more detail below), type clojure -M:cli list.

To see the required command structure for one of the pipelines (for example the cube-pipeline), type clojure -M:cli describe cube-pipeline

How to run table2qb

See using table2qb for documentation on how to generate RDF data cubes using table2qb.

Example

The ./examples/employment directory provides an example of creating a data cube from scratch with table2qb.

License

Copyright © 2018 Swirrl IT Ltd.

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.

Acknowledgements

The development of table2qb was funded by Swirrl, by the UK Office for National Statistics and by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 693849 (the OpenGovIntelligence project).

table2qb's People

Contributors

ajtucker avatar alasdairgray avatar billswirrl avatar julijahansen avatar lkitching avatar rickmoynihan avatar ricswirrl avatar robsteranium avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

table2qb's Issues

Remove example-specific incidental config (derive from columns.csv)

Several functions hard-coded column headers to make the example work. These ought to be replaced with headers looked-up from the column configuration:

  • is-dimension?, is-attribute?, values and measures for determining column role
  • standardise-measure, slugise-columns, and replace-symbols for doing column-specific cleaning magic (hopefully #18 can resolve this)
  • observation-template for the ordering of component slugs in the URI

Option to serialise csvw to file

With #27 we're now passing the json metadata to csv2rdf with an in-memory map (rather than writing it out to disk). This is designed to make the whole process more memory efficient when called via grafter-server/ pmd.

Ideally we would also support the original use case - so that people could use table2qb as a standalone, writing the csvw output (i.e json and csv files). This would allow users to pick a different csv2rdf implementation, or to target csvw itself.

Configure column order

The order of columns in observations input data affects the order in which slugs are inserted into URIs. This means that 2 spreadsheets having different column orders could lead to 2 different URIs for the same observation. We need to make sure that the column order is consistent to avoid partial uploads (e.g. initial insert then subsequent update) creating duplicates.

In the past we have stored the initial order of the upload using the qb:order property of the component spec (this was then later retrieved to enable the re-ordering of subsequent uploads).

In this case we could define this by configuration - e.g. the order the columns are specified in the columns.csv. Once we store the configuration in the database (#21) this would need to be recorded as an explicit column order property.

This configuration would be used to reorder the columns before passing to csv2rdf. We could also create e.g. <compspec1> qb:order 1 triples.

Generating PMD Features

This could be:

  • a grafter pipeline that derives these from the contents of the database (e.g. with params for types etc)
  • a table2qb pipeline this could probably just provide an additional metadata-json for existing codelist csv

Specify csvw metadata in turtle to begin?

The json-ld standard is relatively new as is the support for it in rdf4j. Turtle should be equivalent. We know grafter and our other tools have good support for turtle.

  1. We would need to support json-ld eventually if we wanted to offer a clojure implementation of csv2rdf.

  2. We would at least need to read json-ld if we wanted to use the csv2rdf test cases (these would probably save us some effort since they provide a ready-made spec for TDD etc).

URI Templating

One thing I picked up reading the w3c csvw stuff is URI Templating: https://tools.ietf.org/html/rfc6570. I think this could be useful to us in a couple of ways...

Most obviously for grafter... there's a clojure implementation here we might adopt https://github.com/mwkuster/uritemplate-clj. e.g.

(uritemplate-clj.core/uritemplate "http://pmd.com/data/area{/gss}/period{/date}{/measure}{/unit}"
                              { "gss" "E010000001"
                                "date" "2017"
                                "measure" "count"
                                "unit" "people" })
#=> "http://pmd.com/data/area/E010000001/period/2017/count/people"

We could simply use it like a labelled add-path-segments or even extend further because the template would know which fields it needs to build a URI so it would a) not need to be told which fields to use and b) be able to validate and catch missing-slug bugs. Maybe we could even do compile-time checks?

Another possibility is for making URIs shorter. If we publish the template e.g. as part of a vocabulary, then we wouldn't necessarily have to explain the segments within in the URI. The above case would become http://pmd.com/data/{/gss}{/date}{/measure}{/unit}.

Plan for open-sourcing table2qb

  1. Allow application to set configuration (as csv or SPARQL endpoint) #20 & #21
  2. How to allow application to register transformations?
  3. How to link component->property-uri and columns.property_template (user needs to know URI convention, can we make this explicit?)
  4. Resolve performance issues #24, #26.
  5. Split core into namespaces?
  6. Review regional-trade example (perhaps use third-party codelist in place of the slugged services scheme)
  7. Write a getting started doc

BASEing URI's

Was thinking about better prefix strategies and using RDF’s @base earlier… Mentioned this to @ricroberts earlier, and we discovered you can do use BASE to prefix relative to path segments!

e.g. if we changed our naming strategy we could relativise names to a dataset and write more beautiful turtle/uri’s. The below sample is valid and parses in RDF4j perfectly:

@base <http://statistics.gov.scot/data/reconvictions> .
@prefix qb: <http://purl.org/linked-data/cube#> .
@prefix : <http://blah/> .

<> qb:structure </dsd> .

</dsd> a qb:DataStructureDefinition ;
    qb:component </comp-spec/refPeriod> ;
    qb:component </comp-spec/refArea> .

</comp-spec/refPeriod> :blah :blah . 
</comp-spec/refArea> :blah :blah .

</obs/cca113731b4867901a8d722973c3a9dc39562b8a> qb:dataSet <> .
</obs/f99028cacf3551e9863ce1b8a8d6c5c803efcad0> qb:dataSet <> .
</obs/136f1d181aec3d64e89dfe87a71697bf7369795e> qb:dataSet <> .

From RDF 1.1 Concepts and Abstract Syntax:

Relative IRIs: Some concrete RDF syntaxes permit relative IRIs as a convenient shorthand that allows authoring of documents independently from their final publishing location. Relative IRIs must be resolved against a base IRI to make them absolute. Therefore, the RDF graph serialized in such syntaxes is well-defined only if a base IRI can be established [RFC3986].

Implementations might have slightly different behaviours regarding BASE precedence. e.g. stardog/sparql lets you override a default base from the store by setting it in your construct query.

However when reading a file with RDF4j, the base from the file is used rather than from your code... i.e. the precedence is reverse that of SPARQL. This might be a bug in RDF4j that they're willing to change though.

Another super useful option is the idea that you can set BASE to nil to preserve the relative paths. JSONLD lets you do this in the spec itself; but RDF4j fails to read such graphs :-( . So this might be a bug or feature or outside the general spec of RDF behaviour; but worth investigating whether we can get this fixed/changed.

Some initial comments on the readme

General

  • We should look at CSV on the web as a way to provide the metadata. This is already JSON LD, so we don't need to invent our own
  • I like the idea of a strict mode, and an optional pre-cleansing step to get it into that format.

Deriving URIs

  • I think that CSV on the web might have mostly solved the problems we have here.
  • Deriving code list URIs: I think I'd prefer it if the vocabs needed to already exist, and the stats creation would fail if we used something that mapped to a non-existent concept etc.
  • Can we have a separate vocab pipeline, which would let us do hierarchical codelists (e.g. for ages), which we can't do (so easily) if we make them on the fly.

Import vs update with / without PMD

  • I'd like to focus on the business as usual case (the 'with pmd' case in your notes), where we assume a dataset already exists, rather than making datasets as a side effect of something.
  • For an initial import, creating a load of PMD-datasets with enough metadata to show up in PMD, could be tackled by just doing an initial pass on the data, which makes a load of empty PMD datasets.
  • The real only constraints at the moment around with-pmd operation are that
    • i) each grafter run produces single graph of dataset contents which goes in the data graph. (We currently also create vocabs too as a side effect but I'd prefer not to do that, as mentioned above).
    • ii) pmd's admin panel controls the pmd-specific metadata (in the metadata graph).
  • Note that its fine for the graft to emit triples about the dataset into the data graph (as we already do for creating DSDs and the dataset a qb:DataSet triple).

Dataset URIs

  • This line confused me a bit:

in order to namespace entity URIs we require a slug (which we have so far found by querying the dataset name).

  • For the append-to dataset operation, we shouldn't need to use a dataset name to derive a slug. For new datasets, we can certainly derive the slug (and therefore URI) from a name. But not the other way round, since you're allowed to change a dataset\s name after creating it.
  • At the moment, when running an append-pipeline, the pipeline gets passed the URI when it gets run, so we shouldn't need to look up the name.

Distinguish top concept in hierarchy

We currently create skos:topConceptOf and skos:hasTopConcept relations between every code and the codelist. We should only create those relations when the parent notation is null (i.e. only those codes without parents are at the top of the hierarchy).

As I understand it, csvw does not provide a way to control the output of one cell based upon the value of another. The conditional processing based upon cell values requirement (which may have supported our use case) was deferred.

I'd be delighted to know if there is a way to do this within csvw/ csv2rdf!

One possible workaround is to create a new column that only contains a non-null value when the row is a top concept.

components.csv

Based on the input csv I would define the configuration as follows.

Label,Description,Component Type,Codelist
Geography,Geographical reference Area,Dimension,
Quarter,Reference Period,Dimension,
Gender,The state of being male or female,Dimension,http://statistics.gov.scot/def/concept-scheme/gender
Measure Type,What is being measured,Dimension, 
### however doe not see how to indicate that this is the qb:measureType
Unit,Unit of Measurement,Attribute,
Value,Measured quantity,Measure,

In the example configuration.csv only Gender has been included.

How does the system know that geography and quarter are also dimensions?
What to do if the local URI standards are different from the uri pattern used over here e.g. having def# or ending with #id ?
What to do if one wants to use the sdmx dimension id's (refArea, refPeriod) instead of local ones?

Then the second line describes the values of the Measure Type dimension.
Does the conversion expect a fixed header 'Measure Type'?
What if your cube structure doesn't need a measure type?

Does the conversion expect a fixed header 'Value'? How does it know for the moment that this column contains the measured value?

Validate codelist pipeline inputs

At the moment you get an unhelpful Index Out of Bounds error if you provide an input with the wrong column headings.

We should provide some validation of the codelist input:

  • require Label field only - throw exception if missing
  • permit Notation, Parent Notation, Sort Priority and Description
  • if anything else is found, throw and exception reporting the column name

Add example for pipeline parameters

In the describe task, an example invocation for executing the pipeline is generated. The example values for pipeline parameters are based on the parameter name. Add an example value for the invocation to the parameter definition to be used instead.

Keeping columns configuration in the database

We currently have the configuration in columns.csv. It should be stored in the database and extracted by a query. This would allow users to define permissible columns in a table of observations by uploading the components.

It would also be useful to be able to run table2qb as a standalone service - i.e. we should also allow the configuration to be specified by the csv as a fallback (either defaults taken from the packaged resource or e.g. as another command line option).

Data Model

We could probably re-use the csvw vocabulary. This would mean creating a dataset of csvw:Column resources each having a csvw:title, csvw:name, csvw:propertyUrl, csvw:valueUrl and csv:datatype.

We could extract the "component_attachment" attribute (the predicate applied to a qb:ComponentSpec to identify the qb:ComponentProperty) which isn't in csvw with a query like:

?column                                          # e.g. :date_column
  csvw:propertyUrl/a ?property_type .            # e.g. qb:DimensionProperty

?component_attachment                            # e.g. qb:dimension
  rdfs:subPropertyOf qb:componentProperty;
  rdfs:range ?property_type .

Validation

When loading the columns configuration: we would need to ensure that columns we unambiguously identified by the csvw:title (i.e. only one column for a given title, although a given column could have more than one title).

When loading observations (i.e. when using the column configuration): We could permit columns to re-use the csvw:names across a database (e.g. two columns titled "Double" and "Integer" which both yield a column named "value" but apply different datatypes) but, since this is used as a reference within URI templates, only one column of each name would be permitted within the context of a given csv file/ csvw:tableSchema.

Upload

We ought to generate the csvw:Column when the component itself is created (by users) with a seed supplied by us for external properties (e.g. sdmx-dim:refPeriod). We could add the above csvw fields to the components pipeline, with defaults as follows:

That would mean adding another metadata file e.g. columns.json that would point at components.csv.

Wiring-up table2qb, csv2rdf.clj and grafter

The following provides a specification of what we'll need from this integration.

Retreive the configuration in the database

We're currently loading the configuration from a csv file when the project is initialised. Instead we ought to be able to load them from SPARQL when the transformation pipeline is called (#21).

This would allow users to modify how the transformation pipeline was configured. You would first upload your new column configurations (with a suitable pipeline), then upload a spreadsheet of observations using the columns you'd just configured. This will be critical for handling e.g. new dimensions given that, unlike past pmd transformation pipelines, we're no longer creating vocabularies at the same time as observations.

Specifically this requires:

  • runtime parametisation of e.g. name->component (via dependency injection, atom/redef, or configuration monad)
  • a columns/ configuration model+pipeline (i.e. to upload columns.csv and create e.g. csvw:Column - see issue #21)
  • a SPARQL query to extract the above (and functions to transform this into the lookups etc table2qb expects)

Similarly we will ultimately want to lookup codelists in order to a) lookup URIs for codes from labels, and (potentially) b) validate the inputs to fail early (before generating rdf).

Orchestrate the csv2rdf calls

table2qb will output multiple csv and json files. The integration will need to orchestrate this as per these calls - i.e. having 3 grafter pipelines:

  1. components pipeline
  2. codelists pipeline
  3. cube pipeline (note that this consistents of 6 json files and only 2 csv)

Necessarily we would need to pass the uploaded input csv in each case to table2qb, then take the resulting json and csv and pass those to csv2rdf.

Optionally we could persist the intermediate outputs potentially hosting them for remote retreival. Indeed it seems that the csv2rdf standard requires that the json metadata refer to the csv file with a url property and that implementations are expected to retreive their inputs from this - further this leads to outputs like e.g. <row> csvw:url <file://input.csv#row=1> in standard mode (i.e. the row URIs extend these urls). See #11 for more discussion.

The practical consequence of hosting the intermediate files is that we could use them for a) debugging (such that the validation report could link to the cell/ column/ row that violated a given rule) and b) as a pre-made tabular serialisation (although this might only be a subset of the observation in a cube).

Note that we might want to ignore this optional persistence until we've implemented csv2rdf standard mode (since tracking cell inputs isn't part of the minimal mode we're targeting at this point).

At the moment, table2qb groups together multiple files by writing/ reading from the same directory. If we're dealing with pipelines that use file-inputs we could a) submit a tar archive (which has the benefit of being a single request) or b) maintain state of partial requests (since the "job" would consist of several requests).

We'd need to be able to run this as a standalone grafter pipeline and within the context of grafter-server.

Validation

It makes sense to distinguish two opportunities for validation:

  • tabular - can refer to tabular features (cells/ rows/ columns) and parsing issues (infering types from string inputs etc)
  • graph - can refer to structural issues in the data, particular from the overall context of the whole database

We would want to run tabular validations during the table2qb phase. We could use csvw's tabular metadata specification for describing the validation. Since the definition of "valid" will depend upon the column configuration, we would need to generate such schema based upon what's in the database. Other validations (e.g. all cells are populated) would be context-free. Some examples of criteria:

  • Are all of the columns recognised (i.e. in the column configuration)?
  • (for multi-measure cubes) is at least one qb:MeasureProperty column provided?
  • (for measures-dimension cubes) is the qb:MeasureType property (and no qb:MeasurePropertys) provided?
  • Are cells provided for all values (where value is either a column without a component-attachment or one representing a measure) provided? We might allow exceptions where a data marker is provided (in another column).
  • Can all values be parsed correctly? We might also require that they have the correct datatype (rdfs:range of the measure).
  • Are cells provided for all components?
  • If the code is specified as a label, can the uri be found (see #18)?
  • Do the columns (or more specifically the component specs therefrom) match a pre-existing DSD (for partial uploads)?

We would want to run graph validations after the csv2rdf phase. We could use grafter-extra pmd/ qb validators (SPARQL ask), or something like the shapes report we've built for nhs. In any case, the validator should be able to query the live contents of the database.

  • Are the qb integrity constraints met?
  • Are pmd's requirements met?
  • Is all of the pre-requisite reference data loaded (i.e. are all values of components themselves subjects with labels)?
  • Are all codes members of codelists?
  • Given a codelist, does the cube provide an observation for each code (warning, not error)?
  • Do the new observations fit in the old cube (i.e. do the component-properties and order (since this determines observation URIs) match)?
  • Do any of the uploaded observations clobber old ones (and if so, do they have matching values)?
  • See draft nhs rdf-shapes for more examples...

Notice that there is some overlap - specifically in looking-up codes and ensuring partial-uploads have a common DSD (and thus observation-uri structure). It might make more sense to do this in the context of the graph (where you can point at the codelists/ DSD that are already loaded) but failing-fast (i.e at the table stage) may be preferable (and we've been able to return understandable errors at this stage in the past).

How to specify that cell's value should be treated as an object-node rather than a literal?

How should I specify that a cell's value e.g. "http://purl.org/linked-data/sdmx/2009/measure#obsValue" should be treated as an object-node rather than a literal?

If you give "valueUrl": "{colname}" then the string is you get <file://path/to/filename.csv/http%3A%2F%2Fpurl.org%2Flinked-data%2Fsdmx%2F2009%2Fmeasure%23obsValue>. This is node, but the URI has been encoded and appended to a base.

If we give "datatype": "anyURI" in the columns spec this leads to e.g. "http://purl.org/linked-data/sdmx/2009/measure#obsValue"^^xsd:anyURI i.e. a typed-literal.

I only seem to be able to do this by first creating a string value instead e.g. "obsValue" and specifying a template with "valueUrl": "http://purl.org/linked-data/sdmx/2009/measure#{colname}" to get <http://purl.org/linked-data/sdmx/2009/measure#obsValue> in the output.

But what if you've already got a URI in your csv cells? What about a CURIE?

EDN as well as JSON?

Rick has suggested that we might want to consider supporting edn alongside, if not instead of, json because edn allows comments.

Remove columns.csv resource

Remove resources/columns.csv. Some of the tests depend on this file so either re-work them to construct a new configuration inline or make it a test resource.

Add command-line interface

Allow pipelines to be run from the command line. Different pipelines take different arguments so we need a command to describe each pipeline as well as to run it. The suggested interface is:

  • table2qb list Lists all pipelines
  • table2qb describe [pipeline name] Describes the pipeline and its arguments
  • table2qb run [pipeline name] args Runs the pipeline with the specified arguments

An example usage with the current pipelines might be:

table2qb list

codelist-pipeline
components-pipeline
cube-pipeline

table2qb describe cube-pipeline

Outputs a datacube for a dataset from an input file
Arguments:
  --input-csv                    The CSV file containing observation data
  --dataset-name             Name of the dataset
  --dataset-slug               URI slug for the dataset
  --output-file                   File to write the generated RDF to

The --output-file parameter will be common to all pipelines.

table2qb run cube-pipeline --input-csv input.csv --dataset-name 'Test Dataset' --dataset-slug test-dataset --output-file out.ttl

Allow data to be pulled via HTTP GET

Currently, CSV observation data submitted to table2qb online pipelines is via HTTP POST and multipart/form-data.
It would be good to have the option to pass a URL to table2qb and have the CSV observation data fetched by HTTP GET.
This would allow:

  • provenance information to be properly linked back to the source - more RESTful as we're putting all our resources on the web;
  • easier async hand-off between a client "uploader" and the table2qb "processor";
  • faster failure as all the data need not be uploaded before a problem is found;
  • easier clients :) (I'm currently having to chunk the uploads)
  • potentially easier server-side as streaming/back-pressure could be simplified.

URI vs Slug vs String

The new loading architecture suggests to me a possible solution to the stringy-identifiers problem:

  • if the column configuration identifies the datatype as string:
    • if the column configuration includes a value-template, then cell value should be treated as ready for that template (i.e. already slugged or a code without spaces)
    • else if the column configuration points at a component property that specifies a qb:codeList then use the value in that cell to lookup a code by it's label
    • else treat the value as a string literal (e.g. observation label)
  • else if the datatype is number:
    • parse the string as a number (as per csvw)
  • else if the datatype is xsd:anyURI
    • parse the string as a URI (as per csvw), I think this would also allow curies if the csv2rdf process already recognises the prefix

Given that multiple column names could theoretically be used to populate the same component, this gives us quite a bit of flexibility. For example, for specifying a reference period you might provide three columns all having sdmx-dim:refPeriod in their property_template configuration, but each with a different value_template:

  • "Year": "http://reference.data.gov.uk/id/year/{year}" accepting values like "2018"
  • "Government Year":"http://reference.data.gov.uk/id/government-year/{government-year}" accepting values like "2017-2018"
  • "Reference Period": "http://reference.data.gov.uk/id/{reference_period}" (a generic fallback) accepting values like "government-quarter/2009-2010/Q1"

Thus the uploader would indicate which kind of date they'd provided by using the appropriate column heading. They could then provide more than one kind of interval by either splitting the upload or by providing multiple (non-overlapping) columns in the table (i.e. some rows having Years, others Government Years but none both).

One implication of this is that we wouldn't be slugging anything in the observations input csv! We could still need to do so as part of the code-list pipeline (i.e. to create an skos:notation where none was provided).

Output CSVW in standard mode within pipelines

Pipelines currently invoke csv2rdf in standard mode which includes statements about the table structure in the output RDF - specify minimal mode instead. In future, the output should be configurable.

Ensure column interpreted correctly in codelist-pipeline

My assumption was that csv2rdf would identify columns on the basis of the csvw:title field in the column schema. It seems like they're being interpreted on the basis of column order. We had to change the flow-directions.csv slightly to make sure it worked.

If the assumption is correct then we ought to patch csv2rdf to find columns by title not position.

If the assumption is mistaken then:

  • we'll need to alter table2qb to ensure that the intermediate code csv files are written in the same order as the metadata;
  • I'm not sure what the purpose of csvw:title is!
  • the spec might be suffering from a problem since json-ld doesn't guarantee the order of vectors (thus the RDF translation of a metadata schema could break)

Applying CSVW transformations to parameters

Lifting this issue out of slack, for more general discussion/clarification. It may or may not be a problem.

Generally is CSVW perhaps best understood as a method for describing things after the fact, and therefore it’s almost purely declarative. It’s not really about processing, so there’s no notion of process or any real notion of inputs/outputs or parameterisable arguments etc.

It seems like it might be best thought of as a markup format for google to understand CSV files rather than a method to transform data. Obviously at some level they're the same thing, but specific areas where this distinction is perhaps important are that processes can be reused with different inputs.

Yet CSVW appears to have no notion of how you might parameterise a CSVW transformation process.

How do "process ideas" such as parameters/arguments/environments/contexts apply here?

Generated metadata contains an invalid common property key

Running demo.sh causes the following warnings to be logged:

At path ["tableSchema" "columns" 9]: Invalid common property key 'value'

This is emitted by csv2rdf due to an invalid property key in the metadata document. Remove or replace this key with a valid value.

Create stub for CLI

Create and package a table2qb shell script for running the CLI which wraps the java -jar path/to/jar invocation.

Remove demo namespace

Remove table2qb.demo namespace and extract the example into a shell script with the corresponding CLI invocations.

Tidy-up intermediate files

With #25 we're no longer deleting the temporary files that are used to record the intermediate representation that table2qb passes to csv2rdf.

Since these are required until the complete sequence of output quads has been created (and table2qb doesn't control that), we don't really know when it's safe to delete these files.

Support measure column(s)

We currently take the measures in the values of a measure type column. This is consistent with the measures-dimension approach.

We could also look to support the multi-measure approach within table2qb. That would mean providing one or more columns like e.g.

{
  :title "Observation Value"
  :component_attachment qb:measure
  :property_template sdmx-dim:obsValue
}

In the single-measure case (as above but only one such column) we could decide to create a qb:measureType dimension anyway (with only one value) although allowing the user to specify would make this easier to configure.

Configure logging for dev/uberjar

Configure a logging framework for use during development and in the uberjar (if only to remove the SLF4J warnings on startup). Consider using the logging framework for outputting from the CLI.

Include non-observation data in the json-ld?

The csvw rdf-cube example includes the DSD in the metadata.json as json-ld. The immediate output attaches the DSD to the identifier given in the @id key - at this stage it's just a csvw:Table. It doesn't become a qb:DataSet until after the qb-normalisation post processing step. Indeed setting a @type key seems to cause the RDF::Tabular implementation, at least, to output nothing.

Rather than putting the DSD etc into json, then requiring a post-processing step, perhaps we should just create this RDF directly as an output of the preparation step.

Create release for legacy version

Create a release for the 'legacy' version of table2qb which contains an embedded columns.csv specifying the column configuration for ONS.

Setup travis build

Create travis build for running tests, building uberjar and creating releases.

Comments...

test issue for now...

table2qb/example.md

Lines 10 to 30 in 30e4ae5

```
## 1. Load metadata
### 1.1 Create Component
Load the sdmx/ qb vocabularies (which include a definition of `sdmx-dimension:refPeriod`)
Create the "count" measure-property; either via a UI or a pipeline with something like the following csv upload:
```csv
Measure
count
```
This would create a property (e.g. `eg:count`), label, and class (for the `rdfs:range` of the property) in a measures ontology.
### 1.2 Create Codelists
Load a codelist for years. Perhaps something like:

Remove explicit requires between components

Remove explicit require’s between components & use ig/load-namespaces to ensure components in config map are loaded instead. e.g. main.clj shouldn’t need to :require [table2qb.core]

Remove private dependencies

Some dependencies currently only exist in the private swirrl repositories. Deploy any required versions to a public repository and update dependencies. Remove the private repository configuration from the build.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.