ehri / ehri-rest Goto Github PK
View Code? Open in Web Editor NEWWeb service and business logic for managing EHRI collection metadata.
License: European Union Public License 1.2
Web service and business logic for managing EHRI collection metadata.
License: European Union Public License 1.2
This replaces/augments the ability to specify the handler/importer classes.
Inserting a new action at the head of the event queue consistently causes a relationship not found error when the current head-to-event link is deleted:
Relationship 1638 not found
Caused by: org.neo4j.graphdb.NotFoundException: Relationship 1638 not found
at org.neo4j.kernel.impl.core.NodeManager.getRelationshipForProxy(NodeManager.java:675)
at org.neo4j.kernel.InternalAbstractGraphDatabase$7.lookupRelationship(InternalAbstractGraphDatabase.java:783)
at org.neo4j.kernel.impl.core.RelationshipProxy.delete(RelationshipProxy.java:62)
at com.tinkerpop.blueprints.impls.neo4j.Neo4jGraph.removeEdge(Neo4jGraph.java:490)
at com.tinkerpop.frames.FramedGraph.removeEdge(FramedGraph.java:357)
at eu.ehri.project.persistence.ActionManager.replaceAtHead(ActionManager.java:429)
at eu.ehri.project.persistence.ActionManager.logEvent(ActionManager.java:317)
at eu.ehri.project.persistence.ActionManager.logEvent(ActionManager.java:363)
at eu.ehri.project.views.DescriptionViews.create(DescriptionViews.java:58)
at eu.ehri.extension.DescriptionResource.createAccessPoint(DescriptionResource.java:139)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
This doesn't appear to be an artefact of Blueprints TX handling, since even when a TX is explicitly started it still occurs.
Some tests in the PermissionsRestClientTest
fail on a sporadic basis with java.net.SocketException: Socket Closed
. There appears to be no apparent reason why these tests should consistently (if irregularly) have these failures.
Possibly investigate the way the JSON structures are marshalled/unmarshalled, since this is the only really difference in this resource from others that also send JSON back and forth.
Found in ITS's description of their fonds R 2.
The EAD has:
<p>1. First header</p>
Paragraph text and <lb/>
, then
<p>2. Second 'header'</p>
more text, and headers.
It comes out as shown on the acceptance server: as an ordered list of 1 item each, thus showing 1. title
, text, 1. title
, text, etc.
I'm not sure whether this is a front-end bug or backend bug.
... can be given in the same fashion as the accessors.
This should result in less index fragmentation and allow us to get the millisecond creation time of an item.
Datomic has a nice Peer.squuid function but unfortunately it's not open source.
@lindareijnhoudt Thinking of adding an externalRef
type. The only mandatory field will be a URI, but we can optionally store and cache other data, like a display name and/or geocodes.
Primary suspect: frames (and the use of JavaHandler Impls). Perhaps try seeing if upgrading to Frames 2.4.1 fixes this.
This should have some semantics that differ from Neo4j's built in tools:
I couldn't find existing tests for this.
Question: will links and annotations on -units- descriptions remain untouched when a new version of a description is imported?
Test:
However, what should happen when the units in v1 of the EAD files are not in v2 of the files? Are descriptions removed too? Or did we ignore this case?
Defaults to false.
When importing XML data containing Something & Something
the spaces are stripped resulting in "Something&Something".
Seems to relate to this Xerces bug:
The owner of the repository is the only one who can transfer ownership to an organisation.
Eclipse warns about references to raw types, because javac
warns about them. And javac
warns about them, because references to raw types cause unchecked conversions or unchecked assignments. At compile time it's impossible to say if items added to, for instance, a raw List
are of the type that they are cast to when they come out of the object.
The three types that generate the most warnings are:
I'm not sure what the function of the type parameter in AbstractImporter is. Maybe we should remove it?
Getting rid of raw type List
s is harder, because of the way we construct graph surrogates that are fed to the Importers.
Somehow, there's a link to a null actioner. Need to investigate how the DB got into this state.
When run as a suite with all the other tests (via mvn test
) all the importer tests pass. However, when run independently, using, say:
mvn test -Dtest=eu.ehri.project.importers.IcaAtomEadSingleEad -DfailIfNoTests=false
there are failures with the number of nodes created on import. On inspection, it seems that the problem lies with one more relation
node being created than should be the case. i.e. for IcaAtomEadSingleEad
test there are five relations created instead of four, two of them having the name Test Name
. The erroneous additional relation has a corporateBodyAccess
type instead of createAccess
.
Clearly there is something wrong with the test isolation here!
For indexing and searching, we convert any dates other than full dates (i.e. YYYY-MM or YYYY) to full dates. We need to allow just years or year+month (done) and store the precision.
EHRI/ehri-frontend#128 will allow free text (e.g. "Spring 1943") for display, but it still needs a date to some precision. This must also be supported - display text and machine-actionable formats.
The maintenanceEvent
type already exists (and is used by repositories) for this purpose, so we can add maintenanceEvent
s for things like:
This will then be combined (on the frontend) with the online history, which includes such data such as:
Many CHIs use either <p>content</p>
or <p><emph>content</emph></p>
in elements like <bioghist>
and <scopecontent>
. Some, however, use mixed content, such as
<p>Content <emph>more content</emph> and then some more.</p>
If we mark both /p/
and /p/emph/
as 'include', we end up with a list with the content of <emph>
first. If we only mark /p/
as 'include', the content of <emph>
is lost.
The parser should support this mixed content and transform elements to what we want when it encounters them. (The ugly solution is to convert mixed content to Markdown equivalents before parsing, which is what I'll do for now.)
At present, Description
nodes have a dependent relationship the their parent item. This means that if you createOrUpdate
an item bundle (as we do when importing from EAD) all descriptions that are not present in the input bundle will be deleted. This is an unavoidable consequence of having to update a full subtree (or at least I can't figure out how to do it otherwise.) The corollary is that descriptions are deleted when an item is deleted, like an SQL CASCADE.
This will obviously cause problems in the case where additional descriptions are created manually via the web interface, on items that were harvested. Those manually created descriptions will be zapped if/when we re-harvest, because they don't belong in the source data.
The way I think we should handle this (without putting egregious hacks in the core persistence code) is as follows:
Because the existing descriptions won't have changed they won't actually be "updated", since the persistence code checks if any changes have actually occurred. However they also won't be deleted.
What do you think @bencomp and @lindareijnhoudt ?
If Xerces' SaxParser is on the classpath it'll be used by the various stream importers. Unfortunately there are a few implementation differences:
localName
for attributes doesn't work when namespace support is disabledWhile 1 is easy to fix (use qName
instead), 2 is trickier.
Should be a content type (they're not currently) with a web service interface. The behaviour of VCs will probably evolve as we figure out how they'll work in practice.
cc: @lindareijnhoudt
... this would be better than doing it ad-hoc on the client.
Big question: to what extent can this play nicely with access control restrictions?
http://ehridev.dans.knaw.nl/jenkins/job/ehri-rest/lastBuild/ehri-project$ehri-importers/testReport/eu.ehri.project.importers.util/HelpersTest/testIso639DashTwoCode/ fails, because Jenkins runs on a server with Dutch locale. The conversion depends on the server's locale, so on the server only "Engels" would be converted.
Do we want to only translate English names to codes, or from all languages? Let me look at the references other than in the test...
Should be easy: add a command to the Fabfile that does git tag
on the deployed code, but only when the deploy worked.
This is a classic nasty bug affecting how permissions were displayed but not how they were calculated.
Article 1:
Article 2:
Luckily, because there's so much redundancy in the AclManager
at the moment, the read bug didn't manifest itself as a write bug (i.e. in the case, we didn't let people do stuff they shouldn't have been able to do.)
Need to fix this stat...
Only anecdotal evidence for this thus far.
Importing hierarchical EAD assigns child nodes the wrong identifiers.
Identifiers should be hierarchical, i.e, if repository r1
in country us
has doc c1
, which in turn has child doc c1-2
, the graph ID for c1-2
should be us-r1-c1-c1-2
. The EAD importer is not respecting this at the moment.
The main difficulty implementing this is that when parsing EAD the child nodes are "finished" before the parent. We should be able to work around this using a proxy permission scope node to handle cases where the parent node does not yet exist.
Using Jersey client is horribly verbose and repetitive. Preferably we'd use a nice DSL to remove the boilerplate. Preferably one without too many additional dependencies (like Groovy, etc).
If people are going to use titles for identifiers (which sadly they will) we need to provide unicode-friendly slugs, otherwise slugs for titles in non-latin character sets will end up empty.
The withCache
call is not preserving other serialization parameters.
It's quite scary and a nightmare of hard-to-reason about, imperative gunk. At least a fair number of tests should aid maintaining the existing behaviour.
The importer and import manager classes have a fair few warts, mostly going back to their earliest genesis as an EAD-specific hack:
CsvImportManager
extends XmlImportManager
CsvImportManager
and SaxImportManager
ImportManager
interface doesn't do much, and isn't really usedimportFile
, importFiles
signatures and overrides are pretty much just as confusing as when I first wrote them (sorry)Now that we have a more generic import system it would be nice to tidy this up a bit before looking at things like issue #10.
I am happy to file this under "things to do when bored of hacking CSS", but don't want to tread on your toes @bencomp and @lindareijnhoudt. Thoughts?
Using provenance is too general. Process info is thought to be useful/interesting for the user.
The "hasEvent" relationship label is currently overloaded, with the SystemEvent
node being pointed to from both the action and event link nodes. This should be fixed to prevent confusion and ease debugging the complex event relationship stuff.
Currently it doesn't check this. Case in point:
create
perm for a country allows creating doc units in any of the repositories (this is correct)Creating a child item should be seen as a modification.
This is a bit of an open question. If a user tries to access something they can't read due to access restrictions the current API throws an AccessDenied
error. This is straight up on what actually happened, but it allows reasoning about inaccessible material.
Throwing an ItemNotFound
error would be more consistent with the behaviour that we'd get if we use the acl graph wrapper and arguably be more consistent overall.
Since 76e132e for EHRI-CMDLINE and last one
Compress a stream of, say, 10 user item modification events down to one with appropriate metadata.
Almost by definition, we should only concat events when they share:
Write some markdown documenting use of the API via Curl.
The item at the end of the action chain is an event link, and not a subject.
/userProfile/{userId}/watching
/userProfile/{userId}/watching?id={id}
/userProfile/{userId}/watching?id={id}
This should more easily handle invalidation.
I think it would be nice to be clear on and consistent in:
There are files that start with a comment "to change this template, ...". That should go, of course :)
What do you think?
Stripping all non-ascii chars ain't good enough.
Currently, in the TP2.0 tradition, metadata on nodes (such as relationship count caches) is distinguished by an underscore prefix on the key. As of 712468e this no longer gets zapped when items are routinely updated, but the fix is a nasty one. Ultimately it'd be better to maintain metadata like this on a separate node with, say, a hasMeta relationship. This would naturally persist through bundle updates and be a neater solution, albeit perhaps slightly slower when serializing large trees.
The maintenanceEvent
type already exists (and is used by repositories) for this purpose, so we can add maintenanceEvent
s for things like:
This will then be combined (on the frontend) with the online history, which includes such data such as:
On the backend side we need to capture pre-import history with maintenanceEvents.
This will allow for more verbose log messages.
The event/action schema is the most complex thing in the graph and we want to be able to validate it thoroughly to ensure that nothing untoward has occurred. This will mean traversing all events from the root and ensuring that subjects and actioners are valid. We might also want to rename the relationship label between events and their predecessors because its currently not descriptive enough.
At some point <p>
nodes must be getting put into a map, because the order they come out doesn't seem to reflect the source data!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.