Code Monkey home page Code Monkey logo

xrm's Introduction

Welcome to Expressive RDF Mapper (XRM) repository

This is the main code repository for the XRM project.

The XRM editor allows you to write data mappings to RDF in a friendly domain specific language (DSL) and generates output in R2RML, RML, CARML and CSV on the Web format.

Installation and usage

For further information, including installation and usage instructions of XRM, see the user-oriented expressive-rdf-mapper repository.

xrm's People

Contributors

ktk avatar mchlrch avatar nicky508 avatar nnamtug avatar tpluscode avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

xrm's Issues

Proposals don't properly respect prefix ending with delimiter

Proposals for prefix ex. also contain candidates beginning with exPLZ.

image

I had already fixed that in the initial prototype, but apparently forgot to copy the respective code when I recreated the repository in the zazuko organisation.

https://github.com/mchlrch/experimental-rmdsl/tree/master/com.zazuko.experimental.rmdsl.parent/com.zazuko.experimental.rmdsl.ui

./src/com/zazuko/experimental/rmdsl/ui/RdfMappingUiModule.xtend
./src/com/zazuko/experimental/rmdsl/ui/contentassist/RdfMappingPrefixMatcher.xtend

Interaction with #12

DatatypeDefinitions should be named

DatatypeDefinitions should be named, same as vocabularies

ex.plz from GDE_PLZ_PA with datatype xsd.int

Maybe datatypes should just be declared inside vocabularies as well, instead of having separate DatatypeDefinitions Keeping them separate for now for selective vocabulary treatment

Eclipse Update Site

The "New Xtext Project" wizard created a couple of modules for providing an update site. Need to have a look and check what's missing.

com.zazuko.rdfmapping.dsl.feature
com.zazuko.rdfmapping.dsl.repository

Also, we need to document installation instructions for users.

Another thing to consider is that we will likely have breaking changes in the DSL sooner or later. How can we enable the user to choose when to update? Do we need to maintain multiple versions on the update site?

Sometimes output files don't get re-generated on save

For example if changes are made in files that contain only logical-sourcedefinitions, for example changing value of a Referenceable "stop" in runtime-EclipseXtext/airport-mapping/airport-sources.xrm

From stop "stop":

logical-source airport {
	type csv
	source "http://www.example.com/Airport.csv"
	
	referenceables
		id
		stop "stop"
		latitude
		longitude null "N/A"
		ownership "airport öwnership" null "X"
}

To stop "flop":

logical-source airport {
	type csv
	source "http://www.example.com/Airport.csv"
	
	referenceables
		id
		stop "flop"
		latitude
		longitude null "N/A"
		ownership "airport öwnership" null "X"
}

Could be because of how the incremental-builder works

Updatesite: Include integration-tests in Gitlab CI

Building the updatesite in gitlab CI.

Current situation: run only mvn package. Runs only unit-tests, no integration-tests

Goal: run mvn integration-test

Xtext integration-tests launch an OSGi runtime (an Eclipse) and execute the project's tests in that runtime (for details, see tycho-surefire:test).

To get this working requires an appropriate docker image that includes xvfb (an in-memory display server) or similar.

Some pointers:

Create CSV on the Web JSON output

Include an additional generator into the existing codebase that generates CSV on the Web JSON output.

To start, the generator should filter Mappings on map.source.type.name == 'csv', so CSV on the Web output only includes mappings from CSV sources, even if there are other mappings and sources present in the mapping project. I'm assuming here, that we have source-types { csv referenceFormulation "ql:CSV" } defined in the mapping project. Later on, we will probably extend the DSL and adjust the filter criteria to make this more explicit that string matching.

Samples for the output: https://github.com/zazuko/blv-tierseuchen-ld/tree/master/metadata

I suggest the following approach:

  1. First only add a generator, without extending the DSL. Generate as much as we can based on the information that the DSL can currently describe.
  2. Identify the gaps and figure out where the additionally necessary information fits best into the DSL. Some of the extensions will probably also be useful for the other output formats.
  3. Extend the DSL as necessary
  4. Extend the generators

[csv] Code-Assist for columns

Allow for autocomplete with correct columns from CSV header.

For this, we can do a standalone CLI tool that reads CSV header, generates the appropriate EMF object model for LogicalSource and then serializes into DSL text.

xrm-cli -extract-sources -csv mytable1.csv mytable2.csv > foobar-sources.xrm

Use shapes as a blueprint for mapping

If one or many shapes could be declared for a map, then proposals for RdfClass/RdfProperty/Datatype inside the map could be based on the content of those shapes. E.g. proposals would only contain properties that are defined in one of the shapes.

map AirportMapping from airport {
	fit AirportShape

	subject template "http://airport.example.com/{0}" with id;

	types
		transit:Stop

	properties
		transit:route from stop with datatype xsd:int;
		...
}

Housekeeping LINE_END ruleCalls

Proposal from @nnamtug

Proposal: Grammer changes, do housekeeping

Due to issue 12 (Syntax involving prefixes with ':' instead of '.'), existing DSL files must be migrated. When doing so, this is a good opportunity to clean up the grammar. Terminal rules as 'LINE_END' (was ';' before issue 10 (formatter)), should not be declared as optional:

  • make LINE_END ruleCall mandatory
  • on section 'types' remove unneded LINE_END, since it only appears on the last line and is therefore awkward. example:
    types
    bdb:Sector
    skos:Concept;
  • make LINE_END ruleCall mandatory
  • on section 'types' remove unneded LINE_END

Autocomplete to rdf-vocabularies based ontologies

Support ontologies from rdf-vocabularies out-of-the-box.

One approach to get this working is to build up the object-model for the vocabularies and then serializing [1] the object-model to rdf-mapping (.xrm) files. The serialized files should end up a in a dedicated project rdf-mapping-dsl-vocabularies publicly shared on github. Mapping users then can clone that project and add it as a dependency to their mapping project.

This approach doesn't require deep integration and no service calls are required at mapping-time. Mapping users can also easily curate their own set of vocabularies, by copy/pasting vocabularies to their own project. It would also allow us to deal with incompatible grammar versions if needed, by having a branch per major version in rdf-mapping-dsl-vocabularies.

1: https://mleduc.xyz/xtext/emf/2018/02/28/xtext-serialization.html

support carml extensions

carml has some useful proprietary features that we would like to expose in the DSL:

  • carml:Stream
  • carml:multiReference

Additionally to the existing standard RML output, this adds another RML output flavor.

Handle duplicate refs in SubjectTypeMapping

Proposal from @nnamtug

Proposal: Handling of duplicated refs
Wherever it is possible to refer the same element more than once, think about:

  • not completing already referenced elements
  • if applicable: validate duplicated references.
    this applies for 'types' and 'properties' section. example:

This makes sense for SubjectTypeMapping, as duplicate assignment makes no sense (and has no effect in the end). So, for type:

  • not propose/complete already referenced elements
  • validate duplicate refs with WARNING

Properties on the other hand can be referenced multiple times. For example in permit-mapping.xrm we have multiple assignments for the different languages:

properties
	schema:name from AGENCY_NAME_DE with language-tag de;
	schema:name from AGENCY_NAME_FR with language-tag fr;
	schema:name from AGENCY_NAME_EN with language-tag en;
	schema:name from AGENCY_NAME_IT with language-tag it;

CSV: Headers with spaces and umlauts do not work

It looks like the source specification for CSV files is quite strict, I can neither use quotes for headers with spaces or reference headers that use umlauts. So Länder will not be accepted and neither is "Something Withspace".

In reality both will happen so it should be supported in the DSL

Syntax involving prefixes with ':' instead of '.'

Currently, the default qualified-name mechanism of Xtext is used, with . as the delimiter.

To look as expected for people with RDF experience, the syntax involving prefixes should be like types skos:Concept instead of types skos.Concept.

Support for using TemplateValuedTerm for literals (rr:termType)

By default, rr:template generates IRIs [1]. If a literal is to be created instead, then the term type [2] has to be set.

1: https://www.w3.org/TR/r2rml/#from-template
2: https://www.w3.org/TR/r2rml/#termtype

DSL: Currently, the DSL doesn't support choosing the termType.

Generator: Currently, the generator doesn't write rr:termType, so the result relies on the default behavior described in the spec [2].

Example R2RML output for producing a literal based on a template:

rr:predicateObjectMap [
	rr:predicate skos:notation ;
	rr:objectMap [
		rr:template "{CODE}" ;
		rr:termType rr:Literal;
	];
]

Template re-use, templates as top-level elements

Avoid duplication of IRI templates when multiple mappings involve the same resource.

Modify the grammar to allow LinkedResourceTerm for subjectIriMapping as well.
Current grammar: subjectIriMapping=TemplateValuedTerm
This also needs a validation rule to detect circular dependencies between mappings

Introduce templates as top-level elements that can be referenced from within mappings for re-use. Using inline templates is still possible, pulling-out templates for re-use is optional.

Example for template definition and re-use:

output rml

template airportIri "http://airport.example.com/{0}"

map AirportMapping from airport {
	subject template airportIri with id;

	properties
		wgs84_pos:lat from latitude;
		wgs84_pos:long from longitude;
}

map AirportOwnership from airportowners {
	subject template airportIri with id;

	properties
		ex:owner from ownership;
}


map AirlineAtAirport from airlineairport {
	subject template "http://airline.example.com/{0}" with id;

	properties
		ex:airportServed template airportIri with airportId;
}

This obsoletes the LinkedResourceTerm, that can be removed from the grammar.

Harmonize grammar style

Remove the curly braces for referenceables, inside LogicalSource:

referenceables
	id "id"
	stop "stop"
	latitude "latitude"
	longitude "longitude"

Support for creating JSON-LD context

While RML supports mapping from JSON as well by using JSON selectors we mostly use JSON-LD context for that. The big benefit is that a JSON-LD parser is all we need and that's pretty common by now.

Example, given this input:

{
   "version":"1.0",
   "timestamp":1548925887242,
   "eventType":"INGESTION_LOAD",
   "source":"EE_DE",
   "sourceView":"Party",
   "record":{
      "dateOfBirth":"29APR1940:00:00:00",
      "naturalPersonId":"f0096f7e-a423-11e0-8142-530f6a1c02e3",
      "householdRole":"Vorstand",
      "firstName":"Ursula",
      "id":"f0095fd4-a423-11e0-8142-530f6a1c02e3",
      "maritalStatus":"unbekannt",
      "flagDeceased":"N",
      "personName":"Krumpl",
      "gender":"weiblich",
      "householdId":"f0096f4c-a423-11e0-8142-530sdfsdf02e3"
   }
}

With the following JSON-LD context we would already get useful RDF out of it:

{
   "@context":{
      "@vocab":"http://ontologies.example.org/core/",
      "@base":"http://data.example.org/id/party/",
      "dateOfBirth":"http://schema.org/birthDate",
      "personName":"hasLastName",
      "firstName":"hasFirstName",
      "gender":"definesGenderOf",
      "naturalPersonId":"@id"
   }
... data 
}

Note that the only missing thing in this example are classes, they seem to be at a strange position in JSON-LD, it would be part of the ... data part:

  "@type": "NaturalPerson"

[rdb] DB metadata extractor

Inspecting DB metadata in order to reduce the work for the user to define SourceGroup/LogicalSource manually.

  • Must: CLI - because of firewall issues in corporate networks, it's not always possible to connect directly from workbench on developer machine
  • Should: Eclipse Integration as alternative interface to the CLI

Approach: Read out DB metadata, build AST model and serialize to the DSL.

Goal: A standalone CLI tool that reads out DB metadata, generates the appropriate EMF object model for a SourceGroup and then serializes into DSL text.

xrm-cli -extract-sources -rdb db.properties > foobar-sources.xrm

For db.properties, we re-use Stardogs properties for mappings https://www.stardog.com/docs/#_available_properties:

jdbc.*
sql.schemas
default.mapping.include.tables
default.mapping.exclude.tables

For serializing to the DSL, is has to be considered that naming rules for the identifiers are strict. See Handling invalid identifiers for how this is handled.

code assist has isssues

I noticed a couple of issues with code assist ATM.

Invoking code-assist in empty .xrm file leads to keywords proposed that shouldn't be shown, for example the attributes of CSV dialect, like commentPrefix:

image

Invoking code-assist in the PredicateObjectMapping, after the property shows properties as proposals:

image

Validate that templates are satisfied

Add validation that checks if the correct amount of referenceables is declared.

subject template "http://city.example.com/{0}/{1}/{2}" with continent;
should lead to validation WARNING

Model samples for development

In this repo ~/git/rdf-mapping-dsl/runtime-EclipseXtext

  • copy samples from rdf-mapping-dsl-user
  • samples from seco-bdb-data
  • elcom
  • ...

Validation rules for LogicalSource#foo

Some values can be defined on either LogicalSource or SourceGroup level.

ModelAccess: sourceResolved(), typeResolved()

  • ERROR if not defined
  • WARNING if shadowed

csvw: Support null property

Support null property as optional attribute of Referenceable (inside LogicalSource)

For null property see http://w3c.github.io/csvw/primer/#mixed-datatypes

Sample DSL snippet:

logical-source airport {
	type csv
	source "http://www.example.com/Airport.csv"
	
	referenceables
		id
		ownership "airport ownership" null "X"
		ownership null "X"
}

To clarify:

  • Is there an equivalent of null property in RML or R2RML? If not, then content-assist and validation should vary depending on the source type

csvw: Allow assignment of dialect on SourceGroup level

Currently, dialect can only be assigned on LogicalSource. It should be possible to assign dialect also on SourceGroup level. If multiple CSV files in a project share the same dialect, it would make sense to put them in a group.

Resolution of dialect should work hierarchical, the same way as for type and source:
https://github.com/zazuko/rdf-mapping-dsl/blob/master/com.zazuko.rdfmapping.dsl.parent/com.zazuko.rdfmapping.dsl/src/com/zazuko/rdfmapping/dsl/generator/ModelAccess.xtend#L19

xpath support

  • XPath functions for where it's needed (split-equivalent)
  • XPath completion when possible (create an index of all paths)

Goal: A standalone CLI tool that reads out XML paths, generates the appropriate EMF object model for a LogicalSource and then serializes into DSL text.

xrm-cli -extract-sources -xml foobar.xml > foobar-sources.xrm

Support rml:iterator

XML and JSON mappings usually require an rml:iterator to define the repetition pattern.

The DSL should allow optional definition of iterator in the logical-source:

logical-source airport {
	type xml
	source "http://www.example.com/Airport.xml"
	iterator "/Airports/Airport"
	
	referenceables
		...
}

Validating the generated mapping files

We don't have automated tests yet. We only have the generated R2RML and RML for the sample mappings in version control, where we would notice if something in the generator accidentally changes/breaks.

We should verify:

  • that the output is valid turtle
  • that the output is valid R2RML / RML / ...

Provide quickfix for missing classes and properties

For example with the following mapping (based on mapping-examples/airport-mapping)

map AirportMapping from airport {
	subject template "http://airport.example.com/{0}" with id
	
	types transit.Stop
	
	properties
	ex.bla
}

We get a linking error in the mapping definition: couldn't resolve reference to RdfProperty 'ex.bla'
Because neither vocabulary ex nor property bla is defined.

Ideally, we could provide a quickfix

  • ... that would add a new (property|class) to an already existing vocabulary (eg. for transit.Foo)
  • ... that would add a new vocabulary and (property|class) otherwise (eg. for ex.bla)

Validation on duplicate mapping names

At this moment it is possible to have duplicated mapping names:

map BuildingMapping from parceldata.t_parcelscommplumbing {
	subject template "http://data.monroein.firegraph.store/data/fd/objects/{0}/building/{1}" with pin_18 commdetail;
	types oeg.PhysicalObject bot.Building
	
	properties
		prop.showerUnit from showerunit
}

map BuildingMapping from parceldata.t_parcelcommwalls {
	subject template "http://data.monroein.firegraph.store/data/fd/objects/{0}/building/{1}" with pin_18 commdetail;
	types oeg.PhysicalObject bot.Building
	
	properties
		prop.levels from floornumbe
}

This causing weird errors when trying to execute the mapping. We could solve this by adding a validating rule, which checks for duplicate names.

Explicit choice of the output format in the DSL

Currently, the three output formats r2rml, rml and csvw are generated opportunistically. The user cannot choose the desired output.

There is interdependence of the output format with the source and the mapping, for example:

  • csvw output requires csv source
  • rr:termType or rr:parentTriplesMap in r2rml or rml output require respective specification in the mapping but is irrelevant for csvw output

The DSL user currently has no guidance and needs to rely on (inexistant) documentation to handle the variability. Instead of that, the DSL should support the user with code-assist and validation that is aware of the output format.

Solution proposal:

  • introduce a header to choose the output format: eg. output r2rml
  • generate only the chosen output
  • make code-assist aware/dependent on the output-format
  • emit validation errors on unsupported combinations: eg. output csvw from mappings on xml sources
  • emit validation warnings on superfluous mapping information
output r2rml

map Profession from easygov_professions {
	...
	
	properties		
		skos.notation template "{0}" with CODE as Literal;
	...
}

Fix outline provider

See Outline View

Hide information below the Mapping:

image

Show source-types instead of unnamed for the SourceTypesDefinition:

image

Show language-tags instead of unnamed for the LanguageTagDefinition.
Hide the Prefix:

image

Update documentation (csvw, ...)

Update documentation and mapping-examples in the rdf-mapping-dsl-user repository. Prepare the content on branch upcoming

  • explicit source-types declaration by the user became unnecessary #42
  • csvw: add separate chapter with specifics for CSVW output
    • null property: #39
    • optional dialect specification (see also #10 (comment))
    • ...
  • ConstantValueTerm: airport-mapping.xrm#L15
  • rr:sqlQuery (R2RML views) #29
  • rr:parentTriplesMap #31
  • Add link to the CHANGELOG in the README
  • Mandatory LINE_ENDS in map #56
    * [ ] RdfPrefixedName with : #12
  • (some) XML mapping support - issues
  • formatter #10
  • quickfix for missing classes and properties #1 🐙
  • rr:termType #30
  • specify output format #48
  • carml extensions #15 #34
  • DatatypeDefinitions inside vocabulary #66 also fix examples to use xsd:integer
  • remplate re-use replaces LinkedResourceTerm #2

Parent triple map support

To make nested structures, parenttriple maps need to be added to the DSL, example:

ex:Object rdf:type class:Object ;
prop:lotSize [ 	schema:value 11050.07095540000;
			schema:unitCode 'FTK' ];

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.