zazuko / xrm Goto Github PK

View Code? Open in Web Editor NEW

1.0 6.0 0.0 834 KB

A friendly language for mappings to RDF

License: MIT License

Xtend 29.01% Java 70.86% Shell 0.13%

csvw dsl kg knowledge-graph r2rml rdf rml xtext

xrm's Introduction

Welcome to Expressive RDF Mapper (XRM) repository

This is the main code repository for the XRM project.

The XRM editor allows you to write data mappings to RDF in a friendly domain specific language (DSL) and generates output in R2RML, RML, CARML and CSV on the Web format.

Installation and usage

For further information, including installation and usage instructions of XRM, see the user-oriented expressive-rdf-mapper repository.

xrm's People

Contributors

Stargazers

Watchers

xrm's Issues

Proposals don't properly respect prefix ending with delimiter

Proposals for prefix ex. also contain candidates beginning with exPLZ.

I had already fixed that in the initial prototype, but apparently forgot to copy the respective code when I recreated the repository in the zazuko organisation.

https://github.com/mchlrch/experimental-rmdsl/tree/master/com.zazuko.experimental.rmdsl.parent/com.zazuko.experimental.rmdsl.ui

./src/com/zazuko/experimental/rmdsl/ui/RdfMappingUiModule.xtend
./src/com/zazuko/experimental/rmdsl/ui/contentassist/RdfMappingPrefixMatcher.xtend

Interaction with #12

DatatypeDefinitions should be named

DatatypeDefinitions should be named, same as vocabularies

ex.plz from GDE_PLZ_PA with datatype xsd.int

~~Maybe datatypes should just be declared inside vocabularies as well, instead of having separate DatatypeDefinitions~~ Keeping them separate for now for selective vocabulary treatment

Typo in grammar: referencables

https://github.com/zazuko/rdf-mapping-dsl/blob/master/com.zazuko.rdfmapping.dsl.parent/com.zazuko.rdfmapping.dsl/src/com/zazuko/rdfmapping/dsl/RdfMapping.xtext#L50

The variable should be named to referenceables

Eclipse Update Site

The "New Xtext Project" wizard created a couple of modules for providing an update site. Need to have a look and check what's missing.

com.zazuko.rdfmapping.dsl.feature
com.zazuko.rdfmapping.dsl.repository

Also, we need to document installation instructions for users.

Another thing to consider is that we will likely have breaking changes in the DSL sooner or later. How can we enable the user to choose when to update? Do we need to maintain multiple versions on the update site?

Sometimes output files don't get re-generated on save

For example if changes are made in files that contain only logical-sourcedefinitions, for example changing value of a Referenceable "stop" in runtime-EclipseXtext/airport-mapping/airport-sources.xrm

From stop "stop":

logical-source airport {
	type csv
	source "http://www.example.com/Airport.csv"
	
	referenceables
		id
		stop "stop"
		latitude
		longitude null "N/A"
		ownership "airport öwnership" null "X"
}

To stop "flop":

logical-source airport {
	type csv
	source "http://www.example.com/Airport.csv"
	
	referenceables
		id
		stop "flop"
		latitude
		longitude null "N/A"
		ownership "airport öwnership" null "X"
}

Could be because of how the incremental-builder works

Updatesite: Include integration-tests in Gitlab CI

Building the updatesite in gitlab CI.

Current situation: run only mvn package. Runs only unit-tests, no integration-tests

Goal: run mvn integration-test

Xtext integration-tests launch an OSGi runtime (an Eclipse) and execute the project's tests in that runtime (for details, see tycho-surefire:test).

To get this working requires an appropriate docker image that includes xvfb (an in-memory display server) or similar.

Some pointers:

Create CSV on the Web JSON output

Include an additional generator into the existing codebase that generates CSV on the Web JSON output.

To start, the generator should filter Mappings on map.source.type.name == 'csv', so CSV on the Web output only includes mappings from CSV sources, even if there are other mappings and sources present in the mapping project. I'm assuming here, that we have source-types { csv referenceFormulation "ql:CSV" } defined in the mapping project. Later on, we will probably extend the DSL and adjust the filter criteria to make this more explicit that string matching.

Samples for the output: https://github.com/zazuko/blv-tierseuchen-ld/tree/master/metadata

I suggest the following approach:

First only add a generator, without extending the DSL. Generate as much as we can based on the information that the DSL can currently describe.
Identify the gaps and figure out where the additionally necessary information fits best into the DSL. Some of the extensions will probably also be useful for the other output formats.
Extend the DSL as necessary
Extend the generators

Support carml XML namespace extensions

That's pretty vital in carml, in my experience one does not get any results without properly adding it to the mapping.

https://github.com/carml/carml#xml-namespace-extension

csvw: Validation of dialect usage against source type

Emit validation warning if dialect is assigned in sources with a type other than CSV. In SourceGroup and LogicalSource.

Something like Dialect declaration has no effect. Not a CSV source.

Include template proposals

https://www.eclipse.org/Xtext/documentation/310_eclipse_support.html#templates
https://kthoms.wordpress.com/2012/05/22/xtext-content-assist-filtering-keyword-proposals/

source-group with logical-source
logical-source
language-tags (de en fr it)
datatypes
vocabulary
map

Prettify generated csvw output

The generated code looks odd due to the line separator ',' not being added at the end of line.
(source https://zulip.zazuko.com/#narrow/stream/34-rdf-workbench/topic/rdf-mapping-dsl/near/58668)

"dialect": {
                "delimiter": ","
                ,"commentPrefix": "#"
                ,"doubleQuote": true
                ,"encoding": "utf-8"
                ,"header": true
                ,"headerRowCount": "1"
                ,"lineTerminators": "\r\n"
                ,"quoteChar": "\""
                ,"skipBlankRows": false
                ,"skipInitialSpace": false
                ,"trim": false
            }
// ...

Source types as Enum Rules

Represent source types as Xtext Enum Rules, instead of user definition in the model

https://www.eclipse.org/Xtext/documentation/301_grammarlanguage.html

[csv] Code-Assist for columns

Allow for autocomplete with correct columns from CSV header.

For this, we can do a standalone CLI tool that reads CSV header, generates the appropriate EMF object model for LogicalSource and then serializes into DSL text.

xrm-cli -extract-sources -csv mytable1.csv mytable2.csv > foobar-sources.xrm

Use shapes as a blueprint for mapping

If one or many shapes could be declared for a map, then proposals for RdfClass/RdfProperty/Datatype inside the map could be based on the content of those shapes. E.g. proposals would only contain properties that are defined in one of the shapes.

map AirportMapping from airport {
	fit AirportShape

	subject template "http://airport.example.com/{0}" with id;

	types
		transit:Stop

	properties
		transit:route from stop with datatype xsd:int;
		...
}

Housekeeping LINE_END ruleCalls

Proposal from @nnamtug

Proposal: Grammer changes, do housekeeping

Due to issue 12 (Syntax involving prefixes with ':' instead of '.'), existing DSL files must be migrated. When doing so, this is a good opportunity to clean up the grammar. Terminal rules as 'LINE_END' (was ';' before issue 10 (formatter)), should not be declared as optional:

make LINE_END ruleCall mandatory

on section 'types' remove unneded LINE_END, since it only appears on the last line and is therefore awkward. example:
types
bdb:Sector
skos:Concept;

make LINE_END ruleCall mandatory
on section 'types' remove unneded LINE_END

Autocomplete to rdf-vocabularies based ontologies

Support ontologies from rdf-vocabularies out-of-the-box.

One approach to get this working is to build up the object-model for the vocabularies and then serializing [1] the object-model to rdf-mapping (.xrm) files. The serialized files should end up a in a dedicated project rdf-mapping-dsl-vocabularies publicly shared on github. Mapping users then can clone that project and add it as a dependency to their mapping project.

This approach doesn't require deep integration and no service calls are required at mapping-time. Mapping users can also easily curate their own set of vocabularies, by copy/pasting vocabularies to their own project. It would also allow us to deal with incompatible grammar versions if needed, by having a branch per major version in rdf-mapping-dsl-vocabularies.

1: https://mleduc.xyz/xtext/emf/2018/02/28/xtext-serialization.html

support carml extensions

carml has some useful proprietary features that we would like to expose in the DSL:

carml:Stream
carml:multiReference

Additionally to the existing standard RML output, this adds another RML output flavor.

Handle duplicate refs in SubjectTypeMapping

Proposal from @nnamtug

Proposal: Handling of duplicated refs
Wherever it is possible to refer the same element more than once, think about:

not completing already referenced elements

if applicable: validate duplicated references.
this applies for 'types' and 'properties' section. example:

This makes sense for SubjectTypeMapping, as duplicate assignment makes no sense (and has no effect in the end). So, for type:

not propose/complete already referenced elements
validate duplicate refs with WARNING

Properties on the other hand can be referenced multiple times. For example in permit-mapping.xrm we have multiple assignments for the different languages:

properties
	schema:name from AGENCY_NAME_DE with language-tag de;
	schema:name from AGENCY_NAME_FR with language-tag fr;
	schema:name from AGENCY_NAME_EN with language-tag en;
	schema:name from AGENCY_NAME_IT with language-tag it;

CSV: Headers with spaces and umlauts do not work

It looks like the source specification for CSV files is quite strict, I can neither use quotes for headers with spaces or reference headers that use umlauts. So Länder will not be accepted and neither is "Something Withspace".

In reality both will happen so it should be supported in the DSL

Implement formatting rules

https://www.eclipse.org/Xtext/documentation/303_runtime_concepts.html#formatting

Syntax involving prefixes with ':' instead of '.'

Currently, the default qualified-name mechanism of Xtext is used, with . as the delimiter.

To look as expected for people with RDF experience, the syntax involving prefixes should be like types skos:Concept instead of types skos.Concept.

Adding support for SQL Queries

At this moment R2RML is only possible on a db table name. We would like to add the sqlQuery property.

Support for using TemplateValuedTerm for literals (rr:termType)

By default, rr:template generates IRIs [1]. If a literal is to be created instead, then the term type [2] has to be set.

1: https://www.w3.org/TR/r2rml/#from-template
2: https://www.w3.org/TR/r2rml/#termtype

DSL: Currently, the DSL doesn't support choosing the termType.

Generator: Currently, the generator doesn't write rr:termType, so the result relies on the default behavior described in the spec [2].

Example R2RML output for producing a literal based on a template:

rr:predicateObjectMap [
	rr:predicate skos:notation ;
	rr:objectMap [
		rr:template "{CODE}" ;
		rr:termType rr:Literal;
	];
]

Template re-use, templates as top-level elements

Avoid duplication of IRI templates when multiple mappings involve the same resource.

Modify the grammar to allow LinkedResourceTerm for subjectIriMapping as well.
Current grammar: subjectIriMapping=TemplateValuedTerm
This also needs a validation rule to detect circular dependencies between mappings

Introduce templates as top-level elements that can be referenced from within mappings for re-use. Using inline templates is still possible, pulling-out templates for re-use is optional.

Example for template definition and re-use:

output rml

template airportIri "http://airport.example.com/{0}"

map AirportMapping from airport {
	subject template airportIri with id;

	properties
		wgs84_pos:lat from latitude;
		wgs84_pos:long from longitude;
}

map AirportOwnership from airportowners {
	subject template airportIri with id;

	properties
		ex:owner from ownership;
}


map AirlineAtAirport from airlineairport {
	subject template "http://airline.example.com/{0}" with id;

	properties
		ex:airportServed template airportIri with airportId;
}

This obsoletes the LinkedResourceTerm, that can be removed from the grammar.

Harmonize grammar style

Remove the curly braces for referenceables, inside LogicalSource:

referenceables
	id "id"
	stop "stop"
	latitude "latitude"
	longitude "longitude"

Keywords cannot be used within a vocabulary definition

I've imported the whole schema.org namespace as vocabulary {} and I got some erros as I had a bunch of properties that are used as keywords so it stumbles over the use of them in a schema.

encoding
map
query

Support for creating JSON-LD context

While RML supports mapping from JSON as well by using JSON selectors we mostly use JSON-LD context for that. The big benefit is that a JSON-LD parser is all we need and that's pretty common by now.

Example, given this input:

{
   "version":"1.0",
   "timestamp":1548925887242,
   "eventType":"INGESTION_LOAD",
   "source":"EE_DE",
   "sourceView":"Party",
   "record":{
      "dateOfBirth":"29APR1940:00:00:00",
      "naturalPersonId":"f0096f7e-a423-11e0-8142-530f6a1c02e3",
      "householdRole":"Vorstand",
      "firstName":"Ursula",
      "id":"f0095fd4-a423-11e0-8142-530f6a1c02e3",
      "maritalStatus":"unbekannt",
      "flagDeceased":"N",
      "personName":"Krumpl",
      "gender":"weiblich",
      "householdId":"f0096f4c-a423-11e0-8142-530sdfsdf02e3"
   }
}

With the following JSON-LD context we would already get useful RDF out of it:

{
   "@context":{
      "@vocab":"http://ontologies.example.org/core/",
      "@base":"http://data.example.org/id/party/",
      "dateOfBirth":"http://schema.org/birthDate",
      "personName":"hasLastName",
      "firstName":"hasFirstName",
      "gender":"definesGenderOf",
      "naturalPersonId":"@id"
   }
... data 
}

Note that the only missing thing in this example are classes, they seem to be at a strange position in JSON-LD, it would be part of the ... data part:

  "@type": "NaturalPerson"

[rdb] DB metadata extractor

Inspecting DB metadata in order to reduce the work for the user to define SourceGroup/LogicalSource manually.

Must: CLI - because of firewall issues in corporate networks, it's not always possible to connect directly from workbench on developer machine
Should: Eclipse Integration as alternative interface to the CLI

Approach: Read out DB metadata, build AST model and serialize to the DSL.

Goal: A standalone CLI tool that reads out DB metadata, generates the appropriate EMF object model for a SourceGroup and then serializes into DSL text.

xrm-cli -extract-sources -rdb db.properties > foobar-sources.xrm

For db.properties, we re-use Stardogs properties for mappings https://www.stardog.com/docs/#_available_properties:

jdbc.*
sql.schemas
default.mapping.include.tables
default.mapping.exclude.tables

For serializing to the DSL, is has to be considered that naming rules for the identifiers are strict. See Handling invalid identifiers for how this is handled.

Validate that prefix declarations are unique

code assist has isssues

I noticed a couple of issues with code assist ATM.

Invoking code-assist in empty .xrm file leads to keywords proposed that shouldn't be shown, for example the attributes of CSV dialect, like commentPrefix:

Invoking code-assist in the PredicateObjectMapping, after the property shows properties as proposals:

Validate that templates are satisfied

Add validation that checks if the correct amount of referenceables is declared.

subject template "http://city.example.com/{0}/{1}/{2}" with continent;
should lead to validation WARNING

csvw: only propose dialect for CSV sources (code-assist)

Code assist should only propose dialect if type of source is CSV. In LogicalSource and SourceGroup.

(x) marks the spot where code assist is invoked:

logical-source EMPLOYEE {
	type csv
	source "EMP"
	(x)

Model samples for development

In this repo ~/git/rdf-mapping-dsl/runtime-EclipseXtext

copy samples from rdf-mapping-dsl-user
samples from seco-bdb-data
elcom
...

Validation rules for LogicalSource#foo

Some values can be defined on either LogicalSource or SourceGroup level.

ModelAccess: sourceResolved(), typeResolved()

ERROR if not defined
WARNING if shadowed

csvw: Support null property

Support null property as optional attribute of Referenceable (inside LogicalSource)

For null property see http://w3c.github.io/csvw/primer/#mixed-datatypes

Sample DSL snippet:

logical-source airport {
	type csv
	source "http://www.example.com/Airport.csv"
	
	referenceables
		id
		ownership "airport ownership" null "X"
		ownership null "X"
}

To clarify:

Is there an equivalent of null property in RML or R2RML? If not, then content-assist and validation should vary depending on the source type

csvw: Allow assignment of dialect on SourceGroup level

Currently, dialect can only be assigned on LogicalSource. It should be possible to assign dialect also on SourceGroup level. If multiple CSV files in a project share the same dialect, it would make sense to put them in a group.

Resolution of dialect should work hierarchical, the same way as for type and source:
https://github.com/zazuko/rdf-mapping-dsl/blob/master/com.zazuko.rdfmapping.dsl.parent/com.zazuko.rdfmapping.dsl/src/com/zazuko/rdfmapping/dsl/generator/ModelAccess.xtend#L19

xpath support

XPath functions for where it's needed (split-equivalent)
XPath completion when possible (create an index of all paths)

Goal: A standalone CLI tool that reads out XML paths, generates the appropriate EMF object model for a LogicalSource and then serializes into DSL text.

xrm-cli -extract-sources -xml foobar.xml > foobar-sources.xrm

Support rml:iterator

XML and JSON mappings usually require an rml:iterator to define the repetition pattern.

The DSL should allow optional definition of iterator in the logical-source:

logical-source airport {
	type xml
	source "http://www.example.com/Airport.xml"
	iterator "/Airports/Airport"
	
	referenceables
		...
}

Validating the generated mapping files

We don't have automated tests yet. We only have the generated R2RML and RML for the sample mappings in version control, where we would notice if something in the generator accidentally changes/breaks.

We should verify:

that the output is valid turtle
that the output is valid R2RML / RML / ...

Handle ID incompatible vocabulary content

Add optional value to RdfProperty, RdfClass, Datatype

Same as already done with Referenceable

Provide quickfix for missing classes and properties

For example with the following mapping (based on mapping-examples/airport-mapping)

map AirportMapping from airport {
	subject template "http://airport.example.com/{0}" with id
	
	types transit.Stop
	
	properties
	ex.bla
}

We get a linking error in the mapping definition: couldn't resolve reference to RdfProperty 'ex.bla'
Because neither vocabulary ex nor property bla is defined.

Ideally, we could provide a quickfix

... that would add a new (property|class) to an already existing vocabulary (eg. for transit.Foo)
... that would add a new vocabulary and (property|class) otherwise (eg. for ex.bla)

csvw: deterministic order of tableSchema and columns in generated output

To be able to effectively diff versions of generated CSVW JSON, the tableSchemas and columns need to be written in a deterministic order.

Validation on duplicate mapping names

At this moment it is possible to have duplicated mapping names:

map BuildingMapping from parceldata.t_parcelscommplumbing {
	subject template "http://data.monroein.firegraph.store/data/fd/objects/{0}/building/{1}" with pin_18 commdetail;
	types oeg.PhysicalObject bot.Building
	
	properties
		prop.showerUnit from showerunit
}

map BuildingMapping from parceldata.t_parcelcommwalls {
	subject template "http://data.monroein.firegraph.store/data/fd/objects/{0}/building/{1}" with pin_18 commdetail;
	types oeg.PhysicalObject bot.Building
	
	properties
		prop.levels from floornumbe
}

This causing weird errors when trying to execute the mapping. We could solve this by adding a validating rule, which checks for duplicate names.

Explicit choice of the output format in the DSL

Currently, the three output formats r2rml, rml and csvw are generated opportunistically. The user cannot choose the desired output.

There is interdependence of the output format with the source and the mapping, for example:

csvw output requires csv source
rr:termType or rr:parentTriplesMap in r2rml or rml output require respective specification in the mapping but is irrelevant for csvw output

The DSL user currently has no guidance and needs to rely on (inexistant) documentation to handle the variability. Instead of that, the DSL should support the user with code-assist and validation that is aware of the output format.

Solution proposal:

introduce a header to choose the output format: eg. output r2rml
generate only the chosen output
make code-assist aware/dependent on the output-format
emit validation errors on unsupported combinations: eg. output csvw from mappings on xml sources
emit validation warnings on superfluous mapping information

output r2rml

map Profession from easygov_professions {
	...
	
	properties		
		skos.notation template "{0}" with CODE as Literal;
	...
}

Simplify generator templates by introducing a context

Preprocessing the model before generation and building up a generator context allows to simplify the generator templates. Especially in places where currently Xtend functions are used to figure out the context ad-hoc (for example the conditional in https://github.com/zazuko/rdf-mapping-dsl/blob/master/com.zazuko.rdfmapping.dsl.parent/com.zazuko.rdfmapping.dsl/src/com/zazuko/rdfmapping/dsl/generator/CsvwDialectGenerator.xtend#L82).

Fix outline provider

See Outline View

Hide information below the Mapping:

Show source-types instead of unnamed for the SourceTypesDefinition:

Show language-tags instead of unnamed for the LanguageTagDefinition.
Hide the Prefix:

Update documentation (csvw, ...)

Update documentation and mapping-examples in the rdf-mapping-dsl-user repository. Prepare the content on branch upcoming

Include descriptions from vocabularies in code-assist

Provide "tooltip" for properties and classes in the code-assist, show description from vocabularies.

See Label Provider

This also requires to expand the grammar, in order to include the descriptions in the vocabularies.

Parent triple map support

To make nested structures, parenttriple maps need to be added to the DSL, example:

ex:Object rdf:type class:Object ;
prop:lotSize [ 	schema:value 11050.07095540000;
			schema:unitCode 'FTK' ];

Support for R2RML views (conditional mappings)

Supporting R2RML views would help in dealing with conditional mappings, cases where the mapping rules depend on conditions in the source data (eg. if a flag is set or not, etc.)

Would be interesting to try rr:sqlQuery in SD virtual graphs