Code Monkey home page Code Monkey logo

ehri-rest's Introduction

Build Status

The EHRI Data Backend

A business layer and JAX-RS resource classes for managing EHRI data.

Integrates with the Neo4j graph database via a server plugin.

The raison d'être of the EHRI web service backend is to make the job of the front-end easier by performing the following functions:

  • serialising and deserialising domain-specific object graphs
  • handling cascade-delete scenarios for objects that are dependent on one another
  • calculating and enforcing access control and action-based permissions on both individual items and item-classes in two hierarchical dimensions: user/group roles, and parent-child scopes
  • maintaining an audit log of all data-mutating actions, with support for idempotent updates
  • providing a GraphQL API
  • providing an OAI-PMH repository

For documentation (a work-in-progress, but better than nothing) see the docs:

For getting up and running quickly, Docker is the recommended approach. A local server can be started on port 7474 and an administrative user account "mike" with the following command:

sudo docker run --publish 7474:7474 --env ADMIN_USER=mike -it ehri/ehri-rest

ehri-rest's People

Contributors

bencomp avatar dependabot[bot] avatar lindareijnhoudt avatar mikesname avatar paulboon avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ehri-rest's Issues

Clean up the ACL manager class

It's quite scary and a nightmare of hard-to-reason about, imperative gunk. At least a fair number of tests should aid maintaining the existing behaviour.

Clean up code for release

I think it would be nice to be clear on and consistent in:

  • javadoc for at least each class
  • javadoc for each package?
  • consistent references to authors
  • include a licence statement in each file

There are files that start with a comment "to change this template, ...". That should go, of course :)

What do you think?

Creating multiple access points in parallel fails...

Inserting a new action at the head of the event queue consistently causes a relationship not found error when the current head-to-event link is deleted:

 Relationship 1638 not found
Caused by: org.neo4j.graphdb.NotFoundException: Relationship 1638 not found
    at org.neo4j.kernel.impl.core.NodeManager.getRelationshipForProxy(NodeManager.java:675)
    at org.neo4j.kernel.InternalAbstractGraphDatabase$7.lookupRelationship(InternalAbstractGraphDatabase.java:783)
    at org.neo4j.kernel.impl.core.RelationshipProxy.delete(RelationshipProxy.java:62)
    at com.tinkerpop.blueprints.impls.neo4j.Neo4jGraph.removeEdge(Neo4jGraph.java:490)
    at com.tinkerpop.frames.FramedGraph.removeEdge(FramedGraph.java:357)
    at eu.ehri.project.persistence.ActionManager.replaceAtHead(ActionManager.java:429)
    at eu.ehri.project.persistence.ActionManager.logEvent(ActionManager.java:317)
    at eu.ehri.project.persistence.ActionManager.logEvent(ActionManager.java:363)
    at eu.ehri.project.views.DescriptionViews.create(DescriptionViews.java:58)
    at eu.ehri.extension.DescriptionResource.createAccessPoint(DescriptionResource.java:139)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

This doesn't appear to be an artefact of Blueprints TX handling, since even when a TX is explicitly started it still occurs.

Make references to SomeClass<T> parametrised or remove <T>

Eclipse warns about references to raw types, because javac warns about them. And javac warns about them, because references to raw types cause unchecked conversions or unchecked assignments. At compile time it's impossible to say if items added to, for instance, a raw List are of the type that they are cast to when they come out of the object.

The three types that generate the most warnings are:

  • AbstractImporter
  • List
  • GraphReindexer

I'm not sure what the function of the type parameter in AbstractImporter is. Maybe we should remove it?

Getting rid of raw type Lists is harder, because of the way we construct graph surrogates that are fed to the Importers.

Store date precision with the dates on import

For indexing and searching, we convert any dates other than full dates (i.e. YYYY-MM or YYYY) to full dates. We need to allow just years or year+month (done) and store the precision.

EHRI/ehri-frontend#128 will allow free text (e.g. "Spring 1943") for display, but it still needs a date to some precision. This must also be supported - display text and machine-actionable formats.

Importers do not import hierarchies with correct IDs...

Importing hierarchical EAD assigns child nodes the wrong identifiers.

Identifiers should be hierarchical, i.e, if repository r1 in country us has doc c1, which in turn has child doc c1-2, the graph ID for c1-2 should be us-r1-c1-c1-2. The EAD importer is not respecting this at the moment.

The main difficulty implementing this is that when parsing EAD the child nodes are "finished" before the parent. We should be able to work around this using a proxy permission scope node to handle cases where the parent node does not yet exist.

Add more provenance information to imported data

The maintenanceEvent type already exists (and is used by repositories) for this purpose, so we can add maintenanceEvents for things like:

  • data from USHMM
  • converted by EHRI

This will then be combined (on the frontend) with the online history, which includes such data such as:

  • imported by Ben
  • annotated by Mike

Write a thorough check task for the event structure...

The event/action schema is the most complex thing in the graph and we want to be able to validate it thoroughly to ensure that nothing untoward has occurred. This will mean traversing all events from the root and ensuring that subjects and actioners are valid. We might also want to rename the relationship label between events and their predecessors because its currently not descriptive enough.

Language name to language code translation is locale dependent

http://ehridev.dans.knaw.nl/jenkins/job/ehri-rest/lastBuild/ehri-project$ehri-importers/testReport/eu.ehri.project.importers.util/HelpersTest/testIso639DashTwoCode/ fails, because Jenkins runs on a server with Dutch locale. The conversion depends on the server's locale, so on the server only "Engels" would be converted.

Do we want to only translate English names to codes, or from all languages? Let me look at the references other than in the test...

Creating units within units should require update perms for the parent item

Currently it doesn't check this. Case in point:

  • doc unit create perm for a country allows creating doc units in any of the repositories (this is correct)
  • it also allows creating doc units inside other doc units within that country (this is dubious.)

Creating a child item should be seen as a modification.

Add command for creating arbitrary relationships...

This should have some semantics that differ from Neo4j's built in tools:

  • uses our index to allow referring to items by name
  • doesn't (by default) allow creating duplicate relationships
  • allows creating "single" relationships such as doc heldBy repo

Figure out how to do merge importing...

At present, Description nodes have a dependent relationship the their parent item. This means that if you createOrUpdate an item bundle (as we do when importing from EAD) all descriptions that are not present in the input bundle will be deleted. This is an unavoidable consequence of having to update a full subtree (or at least I can't figure out how to do it otherwise.) The corollary is that descriptions are deleted when an item is deleted, like an SQL CASCADE.

This will obviously cause problems in the case where additional descriptions are created manually via the web interface, on items that were harvested. Those manually created descriptions will be zapped if/when we re-harvest, because they don't belong in the source data.

The way I think we should handle this (without putting egregious hacks in the core persistence code) is as follows:

  • agree on a property convention to distinguish manually-created descriptions from automatically harvested ones, i.e. "manual=true"
  • check if an item already exists on import
  • if it is already there, serialize the existing node as a bundle and, using the pre-agreed discriminator for manually-created descriptions, copy them to the new import bundle
  • update the bundle

Because the existing descriptions won't have changed they won't actually be "updated", since the persistence code checks if any changes have actually occurred. However they also won't be deleted.

What do you think @bencomp and @lindareijnhoudt ?

Create a test to check annotations and links are untouched on description update

I couldn't find existing tests for this.

Question: will links and annotations on -units- descriptions remain untouched when a new version of a description is imported?

Test:

  1. import two EAD files
  2. create a link between -units- descriptions from the EAD files
  3. import changed versions of the EAD files
  4. assert that the link still exists and works and that its content is unchanged

However, what should happen when the units in v1 of the EAD files are not in v2 of the files? Are descriptions removed too? Or did we ignore this case?

Add more provenance information to imported data

The maintenanceEvent type already exists (and is used by repositories) for this purpose, so we can add maintenanceEvents for things like:

  • data from USHMM
  • converted by EHRI

This will then be combined (on the frontend) with the online history, which includes such data such as:

  • imported by Ben
  • annotated by Mike

On the backend side we need to capture pre-import history with maintenanceEvents.

Restructure importers/import managers...

The importer and import manager classes have a fair few warts, mostly going back to their earliest genesis as an EAD-specific hack:

  • CsvImportManager extends XmlImportManager
  • code dup between CsvImportManager and SaxImportManager
  • the ImportManager interface doesn't do much, and isn't really used
  • the importFile, importFiles signatures and overrides are pretty much just as confusing as when I first wrote them (sorry)

Now that we have a more generic import system it would be nice to tidy this up a bit before looking at things like issue #10.

I am happy to file this under "things to do when bored of hacking CSS", but don't want to tread on your toes @bencomp and @lindareijnhoudt. Thoughts?

Scoped owner permissions => wrongness

This is a classic nasty bug affecting how permissions were displayed but not how they were calculated.

Article 1:

  • in the crud view we assign owner perms to newly created items
  • we do so with the view's scope, and the permission granted thus has that scope
  • this makes no sense, because owner perms should not have a scope (what would that mean? You're only the owner in some cases? Nada, it's wrong.)

Article 2:

  • when we calculate permissions, we assume that if the user has a grant that has a scope shared by the item, its target is a content type (not an item)
  • if this isn't the case (see Article 1) we tell the client that all items in that scope have the given permission

Luckily, because there's so much redundancy in the AclManager at the moment, the read bug didn't manifest itself as a write bug (i.e. in the case, we didn't let people do stuff they shouldn't have been able to do.)

Find cause of sporadically failing permissions-related REST client tests...

Some tests in the PermissionsRestClientTest fail on a sporadic basis with java.net.SocketException: Socket Closed. There appears to be no apparent reason why these tests should consistently (if irregularly) have these failures.

Possibly investigate the way the JSON structures are marshalled/unmarshalled, since this is the only really difference in this resource from others that also send JSON back and forth.

EAD handler should handle mixed content in `<p>`

Many CHIs use either <p>content</p> or <p><emph>content</emph></p> in elements like <bioghist> and <scopecontent>. Some, however, use mixed content, such as

<p>Content <emph>more content</emph> and then some more.</p>

If we mark both /p/ and /p/emph/ as 'include', we end up with a list with the content of <emph> first. If we only mark /p/ as 'include', the content of <emph> is lost.

The parser should support this mixed content and transform elements to what we want when it encounters them. (The ugly solution is to convert mixed content to Markdown equivalents before parsing, which is what I'll do for now.)

Fix metadata handling

Currently, in the TP2.0 tradition, metadata on nodes (such as relationship count caches) is distinguished by an underscore prefix on the key. As of 712468e this no longer gets zapped when items are routinely updated, but the fix is a nasty one. Ultimately it'd be better to maintain metadata like this on a separate node with, say, a hasMeta relationship. This would naturally persist through bundle updates and be a neater solution, albeit perhaps slightly slower when serializing large trees.

Allow unicode in slugs...

If people are going to use titles for identifiers (which sadly they will) we need to provide unicode-friendly slugs, otherwise slugs for titles in non-latin character sets will end up empty.

Overloaded relationship for ActionManager hasEvent...

The "hasEvent" relationship label is currently overloaded, with the SystemEvent node being pointed to from both the action and event link nodes. This should be fixed to prevent confusion and ease debugging the complex event relationship stuff.

Weird importer test bugs...

When run as a suite with all the other tests (via mvn test) all the importer tests pass. However, when run independently, using, say:

 mvn test -Dtest=eu.ehri.project.importers.IcaAtomEadSingleEad -DfailIfNoTests=false

there are failures with the number of nodes created on import. On inspection, it seems that the problem lies with one more relation node being created than should be the case. i.e. for IcaAtomEadSingleEad test there are five relations created instead of four, two of them having the name Test Name. The erroneous additional relation has a corporateBodyAccess type instead of createAccess.

Clearly there is something wrong with the test isolation here!

Find a better way of testing the REST API...

Using Jersey client is horribly verbose and repetitive. Preferably we'd use a nice DSL to remove the boilerplate. Preferably one without too many additional dependencies (like Groovy, etc).

Inconsistent behaviour with Xerces XML parser.

If Xerces' SaxParser is on the classpath it'll be used by the various stream importers. Unfortunately there are a few implementation differences:

  1. localName for attributes doesn't work when namespace support is disabled
  2. whitespace around entities seems to be getting lost

While 1 is easy to fix (use qName instead), 2 is trickier.

Change AccessDenied error to ItemNotFound

This is a bit of an open question. If a user tries to access something they can't read due to access restrictions the current API throws an AccessDenied error. This is straight up on what actually happened, but it allows reasoning about inaccessible material.

Throwing an ItemNotFound error would be more consistent with the behaviour that we'd get if we use the acl graph wrapper and arguably be more consistent overall.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.