ehri / ehri-rest Goto Github PK

View Code? Open in Web Editor NEW

2.0 5.0 7.0 17.29 MB

Web service and business logic for managing EHRI collection metadata.

License: European Union Public License 1.2

Shell 0.47% Java 94.23% XSLT 1.58% Python 1.17% Ruby 2.48% Groovy 0.04% Dockerfile 0.02%

java neo4j ehri backend rest graphql-server oai-pmh

ehri-rest's Introduction

The EHRI Data Backend

A business layer and JAX-RS resource classes for managing EHRI data.

Integrates with the Neo4j graph database via a server plugin.

The raison d'être of the EHRI web service backend is to make the job of the front-end easier by performing the following functions:

serialising and deserialising domain-specific object graphs
handling cascade-delete scenarios for objects that are dependent on one another
calculating and enforcing access control and action-based permissions on both individual items and item-classes in two hierarchical dimensions: user/group roles, and parent-child scopes
maintaining an audit log of all data-mutating actions, with support for idempotent updates
providing a GraphQL API
providing an OAI-PMH repository

For documentation (a work-in-progress, but better than nothing) see the docs:

For getting up and running quickly, Docker is the recommended approach. A local server can be started on port 7474 and an administrative user account "mike" with the following command:

sudo docker run --publish 7474:7474 --env ADMIN_USER=mike -it ehri/ehri-rest

ehri-rest's People

Contributors

Stargazers

Watchers

Forkers

lindareijnhoudt kepajrodriguez bencomp mikesname fgelati dumbbond isabella232

ehri-rest's Issues

Clean up the ACL manager class

It's quite scary and a nightmare of hard-to-reason about, imperative gunk. At least a fair number of tests should aid maintaining the existing behaviour.

Add "imported" flag to imported data/descriptions

Defaults to false.

Clean up code for release

I think it would be nice to be clear on and consistent in:

javadoc for at least each class
javadoc for each package?
consistent references to authors
include a licence statement in each file

There are files that start with a comment "to change this template, ...". That should go, of course :)

What do you think?

Add git tag command to Fabric script

Should be easy: add a command to the Fabfile that does git tag on the deployed code, but only when the deploy worked.

Creating multiple access points in parallel fails...

Inserting a new action at the head of the event queue consistently causes a relationship not found error when the current head-to-event link is deleted:

 Relationship 1638 not found
Caused by: org.neo4j.graphdb.NotFoundException: Relationship 1638 not found
    at org.neo4j.kernel.impl.core.NodeManager.getRelationshipForProxy(NodeManager.java:675)
    at org.neo4j.kernel.InternalAbstractGraphDatabase$7.lookupRelationship(InternalAbstractGraphDatabase.java:783)
    at org.neo4j.kernel.impl.core.RelationshipProxy.delete(RelationshipProxy.java:62)
    at com.tinkerpop.blueprints.impls.neo4j.Neo4jGraph.removeEdge(Neo4jGraph.java:490)
    at com.tinkerpop.frames.FramedGraph.removeEdge(FramedGraph.java:357)
    at eu.ehri.project.persistence.ActionManager.replaceAtHead(ActionManager.java:429)
    at eu.ehri.project.persistence.ActionManager.logEvent(ActionManager.java:317)
    at eu.ehri.project.persistence.ActionManager.logEvent(ActionManager.java:363)
    at eu.ehri.project.views.DescriptionViews.create(DescriptionViews.java:58)
    at eu.ehri.extension.DescriptionResource.createAccessPoint(DescriptionResource.java:139)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

This doesn't appear to be an artefact of Blueprints TX handling, since even when a TX is explicitly started it still occurs.

Figure out source of PermGen leaks...

Primary suspect: frames (and the use of JavaHandler Impls). Perhaps try seeing if upgrading to Frames 2.4.1 fixes this.

Do some work on virtual collections...

Should be a content type (they're not currently) with a web service interface. The behaviour of VCs will probably evolve as we figure out how they'll work in practice.

cc: @lindareijnhoudt

Make references to SomeClass<T> parametrised or remove <T>

Eclipse warns about references to raw types, because javac warns about them. And javac warns about them, because references to raw types cause unchecked conversions or unchecked assignments. At compile time it's impossible to say if items added to, for instance, a raw List are of the type that they are cast to when they come out of the object.

The three types that generate the most warnings are:

AbstractImporter
List
GraphReindexer

I'm not sure what the function of the type parameter in AbstractImporter is. Maybe we should remove it?

Getting rid of raw type Lists is harder, because of the way we construct graph surrogates that are fed to the Importers.

Store date precision with the dates on import

For indexing and searching, we convert any dates other than full dates (i.e. YYYY-MM or YYYY) to full dates. We need to allow just years or year+month (done) and store the precision.

EHRI/ehri-frontend#128 will allow free text (e.g. "Spring 1943") for display, but it still needs a date to some precision. This must also be supported - display text and machine-actionable formats.

Importers do not import hierarchies with correct IDs...

Importing hierarchical EAD assigns child nodes the wrong identifiers.

Identifiers should be hierarchical, i.e, if repository r1 in country us has doc c1, which in turn has child doc c1-2, the graph ID for c1-2 should be us-r1-c1-c1-2. The EAD importer is not respecting this at the moment.

The main difficulty implementing this is that when parsing EAD the child nodes are "finished" before the parent. We should be able to work around this using a proxy permission scope node to handle cases where the parent node does not yet exist.

Plain-text numbered 'headers' in EAD are converted to single-item numbered lists

Found in ITS's description of their fonds R 2.

The EAD has:

<p>1. First header</p>

Paragraph text and <lb/>, then

<p>2. Second 'header'</p>

more text, and headers.

It comes out as shown on the acceptance server: as an ordered list of 1 item each, thus showing 1. title, text, 1. title, text, etc.

I'm not sure whether this is a front-end bug or backend bug.

Add more provenance information to imported data

The maintenanceEvent type already exists (and is used by repositories) for this purpose, so we can add maintenanceEvents for things like:

data from USHMM
converted by EHRI

This will then be combined (on the frontend) with the online history, which includes such data such as:

imported by Ben
annotated by Mike

Spaces around XML entities being stripped...

When importing XML data containing Something & Something the spaces are stripped resulting in "Something&Something".

Seems to relate to this Xerces bug:

https://issues.apache.org/jira/browse/XERCESJ-1005

Write a thorough check task for the event structure...

The event/action schema is the most complex thing in the graph and we want to be able to validate it thoroughly to ensure that nothing untoward has occurred. This will mean traversing all events from the root and ensuring that subjects and actioners are valid. We might also want to rename the relationship label between events and their predecessors because its currently not descriptive enough.

Language name to language code translation is locale dependent

http://ehridev.dans.knaw.nl/jenkins/job/ehri-rest/lastBuild/ehri-project$ehri-importers/testReport/eu.ehri.project.importers.util/HelpersTest/testIso639DashTwoCode/ fails, because Jenkins runs on a server with Dutch locale. The conversion depends on the server's locale, so on the server only "Engels" would be converted.

Do we want to only translate English names to codes, or from all languages? Let me look at the references other than in the test...

Document the REST API

Write some markdown documenting use of the API via Curl.

Figure out a better way of representing external references

@lindareijnhoudt Thinking of adding an externalRef type. The only mandatory field will be a URI, but we can optionally store and cache other data, like a display name and/or geocodes.

Creating units within units should require update perms for the parent item

Currently it doesn't check this. Case in point:

doc unit create perm for a country allows creating doc units in any of the repositories (this is correct)
it also allows creating doc units inside other doc units within that country (this is dubious.)

Creating a child item should be seen as a modification.

Add command for creating arbitrary relationships...

This should have some semantics that differ from Neo4j's built in tools:

uses our index to allow referring to items by name
doesn't (by default) allow creating duplicate relationships
allows creating "single" relationships such as doc heldBy repo

Figure out how to do merge importing...

At present, Description nodes have a dependent relationship the their parent item. This means that if you createOrUpdate an item bundle (as we do when importing from EAD) all descriptions that are not present in the input bundle will be deleted. This is an unavoidable consequence of having to update a full subtree (or at least I can't figure out how to do it otherwise.) The corollary is that descriptions are deleted when an item is deleted, like an SQL CASCADE.

This will obviously cause problems in the case where additional descriptions are created manually via the web interface, on items that were harvested. Those manually created descriptions will be zapped if/when we re-harvest, because they don't belong in the source data.

The way I think we should handle this (without putting egregious hacks in the core persistence code) is as follows:

agree on a property convention to distinguish manually-created descriptions from automatically harvested ones, i.e. "manual=true"
check if an item already exists on import
if it is already there, serialize the existing node as a bundle and, using the pre-agreed discriminator for manually-created descriptions, copy them to the new import bundle
update the bundle

Because the existing descriptions won't have changed they won't actually be "updated", since the persistence code checks if any changes have actually occurred. However they also won't be deleted.

What do you think @bencomp and @lindareijnhoudt ?

Rename "provenance" property to "processInfo"...

Using provenance is too general. Process info is thought to be useful/interesting for the user.

Create a test to check annotations and links are untouched on description update

I couldn't find existing tests for this.

Question: will links and annotations on -units- descriptions remain untouched when a new version of a description is imported?

Test:

import two EAD files
create a link between -units- descriptions from the EAD files
import changed versions of the EAD files
assert that the link still exists and works and that its content is unchanged

However, what should happen when the units in v1 of the EAD files are not in v2 of the files? Are descriptions removed too? Or did we ignore this case?

Add more provenance information to imported data

The maintenanceEvent type already exists (and is used by repositories) for this purpose, so we can add maintenanceEvents for things like:

data from USHMM
converted by EHRI

This will then be combined (on the frontend) with the online history, which includes such data such as:

imported by Ben
annotated by Mike

On the backend side we need to capture pre-import history with maintenanceEvents.

Set appropriate cache-control headers for CRUD endpoints

... this would be better than doing it ad-hoc on the client.

Big question: to what extent can this play nicely with access control restrictions?

Restructure importers/import managers...

The importer and import manager classes have a fair few warts, mostly going back to their earliest genesis as an EAD-specific hack:

CsvImportManager extends XmlImportManager
code dup between CsvImportManager and SaxImportManager
the ImportManager interface doesn't do much, and isn't really used
the importFile, importFiles signatures and overrides are pretty much just as confusing as when I first wrote them (sorry)

Now that we have a more generic import system it would be nice to tidy this up a bit before looking at things like issue #10.

I am happy to file this under "things to do when bored of hacking CSS", but don't want to tread on your toes @bencomp and @lindareijnhoudt. Thoughts?

Scoped owner permissions => wrongness

This is a classic nasty bug affecting how permissions were displayed but not how they were calculated.

Article 1:

in the crud view we assign owner perms to newly created items
we do so with the view's scope, and the permission granted thus has that scope
this makes no sense, because owner perms should not have a scope (what would that mean? You're only the owner in some cases? Nada, it's wrong.)

Article 2:

when we calculate permissions, we assume that if the user has a grant that has a scope shared by the item, its target is a content type (not an item)
if this isn't the case (see Article 1) we tell the client that all items in that scope have the given permission

Luckily, because there's so much redundancy in the AclManager at the moment, the read bug didn't manifest itself as a write bug (i.e. in the case, we didn't let people do stuff they shouldn't have been able to do.)

Compress/concat event stream when there are repeated events with same user, type, item

Compress a stream of, say, 10 user item modification events down to one with appropriate metadata.

Almost by definition, we should only concat events when they share:

the same subject
the same scope
the same (or no) log message
the same actioner
the same type

ehri/entities/listByGraphId gives a 500 instead of a 404 if the graph id doesn't exist...

Need to fix this stat...

Find cause of sporadically failing permissions-related REST client tests...

Some tests in the PermissionsRestClientTest fail on a sporadic basis with java.net.SocketException: Socket Closed. There appears to be no apparent reason why these tests should consistently (if irregularly) have these failures.

Possibly investigate the way the JSON structures are marshalled/unmarshalled, since this is the only really difference in this resource from others that also send JSON back and forth.

Find a better transliterator for the slugification of IDs

Stripping all non-ascii chars ain't good enough.

EAD handler should handle mixed content in ``

Many CHIs use either content or <emph>content</emph> in elements like <bioghist> and <scopecontent>. Some, however, use mixed content, such as

<p>Content <emph>more content</emph> and then some more.</p>

If we mark both /p/ and /p/emph/ as 'include', we end up with a list with the content of <emph> first. If we only mark /p/ as 'include', the content of <emph> is lost.

The parser should support this mixed content and transform elements to what we want when it encounters them. (The ugly solution is to convert mixed content to Markdown equivalents before parsing, which is what I'll do for now.)

Listing events for a user is suspiciously slow...

Only anecdotal evidence for this thus far.

Import paragraph ordering is "unreliable"...

At some point  nodes must be getting put into a map, because the order they come out doesn't seem to reflect the source data!

Crash trying to serialise event actioner with null type...

Somehow, there's a link to a null actioner. Need to investigate how the DB got into this state.

Move ehri-rest repo to EHRI organisation

The owner of the repository is the only one who can transfer ownership to an organisation.

Fix metadata handling

Currently, in the TP2.0 tradition, metadata on nodes (such as relationship count caches) is distinguished by an underscore prefix on the key. As of 712468e this no longer gets zapped when items are routinely updated, but the fix is a nasty one. Ultimately it'd be better to maintain metadata like this on a separate node with, say, a hasMeta relationship. This would naturally persist through bundle updates and be a neater solution, albeit perhaps slightly slower when serializing large trees.

Allow unicode in slugs...

If people are going to use titles for identifiers (which sadly they will) we need to provide unicode-friendly slugs, otherwise slugs for titles in non-latin character sets will end up empty.

Allowing providing a set of members when creating groups...

... can be given in the same fashion as the accessors.

Includeprops doesn't work on list/page views...

The withCache call is not preserving other serialization parameters.

Overloaded relationship for ActionManager hasEvent...

The "hasEvent" relationship label is currently overloaded, with the SystemEvent node being pointed to from both the action and event link nodes. This should be fixed to prevent confusion and ease debugging the complex event relationship stuff.

Fetching event subjects breaks if one of the subjects has been deleted.

The item at the end of the action chain is an event link, and not a subject.

Allow passing import logMessage via text file in cmdline and web service importers

This will allow for more verbose log messages.

Weird importer test bugs...

When run as a suite with all the other tests (via mvn test) all the importer tests pass. However, when run independently, using, say:

 mvn test -Dtest=eu.ehri.project.importers.IcaAtomEadSingleEad -DfailIfNoTests=false

there are failures with the number of nodes created on import. On inspection, it seems that the problem lies with one more relation node being created than should be the case. i.e. for IcaAtomEadSingleEad test there are five relations created instead of four, two of them having the name Test Name. The erroneous additional relation has a corporateBodyAccess type instead of createAccess.

Clearly there is something wrong with the test isolation here!

Change watch/follow REST API to be more amenable to caching...

GET /userProfile/{userId}/watching
PUT /userProfile/{userId}/watching?id={id}
DELETE /userProfile/{userId}/watching?id={id}

This should more easily handle invalidation.

Allow specifying properties file in web service importers.

This replaces/augments the ability to specify the handler/importer classes.

Find a better way of testing the REST API...

Using Jersey client is horribly verbose and repetitive. Preferably we'd use a nice DSL to remove the boilerplate. Preferably one without too many additional dependencies (like Groovy, etc).

Use Sequential UUID (squuids) instead of true UUIDs

This should result in less index fragmentation and allow us to get the millisecond creation time of an item.

Datomic has a nice Peer.squuid function but unfortunately it's not open source.

Inconsistent behaviour with Xerces XML parser.

If Xerces' SaxParser is on the classpath it'll be used by the various stream importers. Unfortunately there are a few implementation differences:

localName for attributes doesn't work when namespace support is disabled
whitespace around entities seems to be getting lost

While 1 is easy to fix (use qName instead), 2 is trickier.

Doesn't compile

Since 76e132e for EHRI-CMDLINE and last one

Change AccessDenied error to ItemNotFound

This is a bit of an open question. If a user tries to access something they can't read due to access restrictions the current API throws an AccessDenied error. This is straight up on what actually happened, but it allows reasoning about inaccessible material.

Throwing an ItemNotFound error would be more consistent with the behaviour that we'd get if we use the acl graph wrapper and arguably be more consistent overall.