dataoneorg / d1_cn_index_processor Goto Github PK
View Code? Open in Web Editor NEWThe CN index processor component
The CN index processor component
Create the mapping from schema.org Dataset descriptions to the DataONE Solr index.
This mapping will first be documented in this repo, for development usage, and then added to the DataONE API documentation (i.e. https://github.com/DataONEorg/api-documentation/tree/master/source/design) so that users understand how SO:Dataset descriptions are indexed.
TODO:
Solr 'date' field types contain date and time information.
For fields that contain dates in the index, time of day info is ignored, even if though it may be available in the source metadata. However, the value inserted in these Solr fields is the full dateTime format, i.e. "YYYY-MM-DDThh:mm:ssZ".
For example, for beginDate
for pid: doi:10.18739/A2599Z17N (EML 2.1.1), time info is included in the metdata, but not captured by the indexer:
beginDate
from Solr
beginDate
from the source EML:<temporalCoverage>
<rangeOfDates>
<beginDate>
<calendarDate>2017-06-09</calendarDate>
<time>20:48:59</time>
</beginDate>
...
The <time>
component is ignored by the indexer, even though the time is inserted into the Solr field (as zeros).
As this field is described as 'The starting date of the temporal range...', is capturing just the date sufficient?
Verify that the parsing performed by the JsonLdSubprocessor handles the guidance outlined in the v1.2.0 release:
One potential item:
RW reported some objects missing solr indexes today.
We took a look and found the metadata objects in those packages don't have solr indexes but the data objects have. In the index_queue
table, one metadata object has multiple index tasks with the status of In Process
. There is no tasks with the status of New
for the metadata object. In the index_processor log, we found the statement of starting index process for the metadata object, but could not find the statement of ending index process.
It is obvious that some silent failures happened and the index tasks were not reset the status to failure
.
I also noticed the objects have series ids.
In late 2017, Axiom pointed out what ended up being a bug in the XPath we use to populate the origin
field in our ISOTC211 (ISO19115) indexing component. See https://redmine.dataone.org/issues/8165. The gist of the problem is that we were pulling any ResponsibleParty from the document and treating them as creators/authors which was too broad.
A fix was applied at some point and I can even see in Redmine 8504 that it was still present in 2018 but it ended up reverted at some point and the current value in the 2.3 branch is the original XPath.
I'm really not sure what happened but I think we should:
The fixed XPath was:
//gmd:identificationInfo/gmd:MD_DataIdentification/gmd:citation/gmd:CI_Citation/gmd:citedResponsibleParty/gmd:CI_ResponsibleParty[gmd:role/gmd:CI_RoleCode/text() = "owner" or gmd:role/gmd:CI_RoleCode/text() = "originator" or gmd:role/gmd:CI_RoleCode/text() = "principalInvestigator" or gmd:role/gmd:CI_RoleCode/text() = "author"]/gmd:individualName/*/text() | //gmd:identificationInfo/gmd:MD_DataIdentification/gmd:citation/gmd:CI_Citation/gmd:citedResponsibleParty/gmd:CI_ResponsibleParty[(gmd:role/gmd:CI_RoleCode/text() = "owner" or gmd:role/gmd:CI_RoleCode/text() = "originator" or gmd:role/gmd:CI_RoleCode/text() ="principalInvestigator" or gmd:role/gmd:CI_RoleCode/text() = "author") and (not(gmd:individualName) or gmd:individualName[@gco:nilReason = "missing"])]/gmd:organisationName/*/text()
The SPARQL queries that extract schema.org bounding box coordinates from a geo.box
property only handle single spaces, but need to handle and combination of commas and spaces. Here are some possibilities for box
:
Currently the index processor populates the 'resourceMap' field for each aggregated object in a resource map based on the cito:documents
relationship, putting the resource map id in that field.
This approach doesn't work for metadata only packages, where there are no objects that have the 'cito:documents' field.
Client software such as the dataone
R package add a self-refering 'cito:documents' relationship for the metadata object, so that the link from the Solr doc for the metadata object back to the resource map is made.
An approach that we should consider is to use the 'ore:isAggregatedBy' relationship to identify all the package members that should have the resourceMap
field populated, because for metadata-only packages, this relationship is present, as the ORE spec requires.
Things to consider with this change:
See issue NCEAS/metacat#1488 for a complete explanation.
Update the Spring application context files that use the portal and collection schema v1.1.0.
This includes replacing the v1.0.0 schema version with v1.1.0 and adding collection XML elements from the
new version, including <filterGroup>
For changes to metacat that use v1.1.0 schema, see NCEAS/metacat#1499
For info on the v1.1.0 schemas, see ahttps://github.com/DataONEorg/collections-portals-schemas/releases/tag/1.1.0
Our indexer couldn't process the content of the schema.org objects from the Hakai IYS Catalog member node. The error message is:
cn-index-processor-daemon.log.7:[ WARN] 2022-05-05 18:14:15,377 (SolrIndexService:processObject:241) The subprocessor org.dataone.cn.indexer.parser.JsonLdSubprocessor can't process the id sha256:7bf3f2000c610da060004d517032b45d1681b1c88bdbf60ecc290649ceb1d203 since The Processor cannot find the either prefix of https://schema.org/ or http://schema.org/ in the expanded json-ld object.. However, the index still can be achieved without this part of information provided by the processor.
We noticed today that our EML index processing rules in application-context-eml-base.xml
only capture attribute
elements under a dataTable
element (e.g., //dataTable/attributeList/attribute
). However, EML supports attributes under all entity types, not just dataTable
. This is probably confusing to users and something I think we should address.
I suggest we change the relevant XPath selectors in application-context-eml-base.xml
to cover all entity types. I can't think of any downsides to the change, other than having to reindex content. I think all users will want to be able to search for attributes on other entity types.
@datadavev @csjx: Can I get a +1 from you? @mbjones already put in a vote for this change in our salmantics call for this change.
The build that is triggered by checkins is generating errors:
Failed tests:
JsonLdSubprocessorTest.testInsertSchemaOrg:121->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null
SolrIndexDeleteTest.testDeleteTwoOverlappedDataPackage:315->verifyFirstOverlapDataPackageIndexed:517->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null
SolrIndexDeleteTest.testArchiveDataPackage:201->verifyTestDataPackageIndexed:718->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null
SolrIndexDeleteTest.testArchiveScienceMetadataInPackage:170->verifyTestDataPackageIndexed:718->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null
SolrIndexDeleteTest.testDeleteDataPackageWithDuplicatedRelationship:249->verifyTestDataPackageIndexed:718->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null
SolrIndexDeleteTest.testDeleteDataPackagesWithComplicatedRelation:284->verifyComplicatedDataPackageIndexed:601->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null
SolrIndexDeleteTest.testDeleteSingleDocFromIndex:114->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null
SolrIndexDeleteTest.testDataPackageWithArchivedDoc:340->verifyDataPackageNo1271:403->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null
SolrIndexDeleteTest.testArchiveDataInPackage:139->verifyTestDataPackageIndexed:718->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null
SolrIndexDeleteTest.testDeleteDataPackage:228->verifyTestDataPackageIndexed:718->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null
Some schema.org JSON-LD Dataset descriptions may include a list of values for the Dataset description. For example (truncated for brevity):
"description": [
"The relationship between CO2 flow from soil and soil CO2 concentration was ... ",
"<div class=\"o-metadata__file-usage-entry\"><h4 class=\"o-heading__level3-file-title\">field_data_flow_concentration</h4><div class=\"o-metadata__file-description\">Table describes values of soil CO2 concentration ..."
],
The indexer should:
"@container":"@list"
into the context, as for identifier
and creator
.description
as an ordered list\n
.If concatenation raises issues, then defer concatenation for a later release and use the first value from the list. In this case, create a new issue documenting the need to support concatenation.
For certain documents, the parsing for schema.org documents is not stripping the datatype off of the 'abstract' field.
See the "Abstract" at https://search-sandbox.test.dataone.org/view/urn%3Auuid%3A4ad54da7-d5c0-4497-91c7-4c004f8a5be2, which has the string "^^https://schema.org/HTML" appended to the end
The source json-ld document has:
"description": {
"@type": "HTML",
"@value":"<p>"Winter ecology of larval kril..."
},
So the SPARQL query that extracts "description" -> "abstract" needs to strip off the type for this field, for example:
SELECT ( str(?description) as ?abstract )
instead of
SELECT ( ?description as ?abstract )
This is being done for some values/queries already, but should probably be done for all values.
This issue was tranferred from the metacat repo issue, as the desired solution is to change the SPARQL queries in this repo to address this problem, and not have metacat update documents to use the 'https://schema.org' namespace.
When manually uploading a schema.org document with the JSON-LD context set to
"@context": {
"@vocab": "http://schema.org/"
},
none of the SO:Dataset fields are indexed to Solr.
The reason for this is that when metacat-index serializes the document to RDF/XML, all SO predicates are serialized as that context, for example:
<https://dataone.org/datasets/doi%3A10.18739%2FA2JQ0SW4G> <http://schema.org/datePublished> "2021-01-01T00:00:00Z" .
The SPARQL queries that are used to extract info from the document all use the 'https://schema.org' namespace.
Do we need to support both "http://schema.org" and "https://schema.org". It looks like the transition from http to
https may linger for a long time, e.g. https://schema.org/docs/faq.html#19
Note that the slender node implementation converts harvested documents from "http://schema.org" to "https://schema.org"
If we do support both, then which of the following should be used to implement:
Here are the test docs indexing result:
EML Semantic Annotations are represented in EML using a structure like
<annotation>
<propertyURI label="some label">http://example.com/some_uri</propertyURI>
<valueURI label="some other label">http://example.com/some_other_uri</valueURI>
</annotation>
To provide search for the above metadata, we extract and parse the character data from the valueURI
element as an IRI, query the OntologyModelService for any parent classes for the IRI, and smush all these terms together in the sem_annotation
field. We kept the indexing rules narrowly-focused as a start because we were planning on using EML Semantic Annotations narrowly to start with. It's catching on within our teams and also within external teams and the use is outstripping the implementation.
Over on NCEAS/metacatui#1807, I'm breaking apart the popover widgets we show on dataset landing pages that contain EML Semantic Annotations into two separate popovers: One for the propertyURI
and one for the valueURI
. A key part of that widget is a link that searches for other datasets annotated with the term you're viewing. Because propertyURI
s aren't being expanded and stored in the search index, searches for datasets annotated with a specific propertyURI
don't work.
I propose we expand what we store in the sem_annotation
field to cover the the valueURI
and propertyURI
and of course any expanded terms (superclasses for valueURI
and superproperties for propertyURI
). I could see us developing a more structured indexing approach for EML Semantic Annotations but I don't think we need it at this point so I'm opting for the small change.
This change will will require re-indexing the ~200-300 EML docs with semantic annotations in them. The number might grow before re-indexing is complete.
v1.0.0 of the MOSAIC ontology is out and we're already annotating with it. It should be added to d1_cn_index_processor and affected content should be reindexed once deployed. We have a crop of new ontologies we should add indexing support for:
Tasks
q=sem_annotation:*
In order to index SO:Dataset documents, a formatId for these documents needs to be added.
The Media Type for JSON-LD documents is application/ld+json
related issue: #3
The slender node processing for schema.org documents inserts a property ("creator") in the @context section of a harvested document to allow the 'creator' properties to be processed correctly as a list. Here is an except from a properly prepared
document:
{
"@context": {
"@vocab": "https://schema.org/",
"creator":{
"@container":"@list",
"@id":"https://schema.org/creator"
}
},
...
During some manual testing, I inadvertently uploaded documents that don't have this fixed-up @context, and saw
that these Solr fields don't get populated as a result: "author", "origin".
@mbjones @datadavev @taojing2002
Given this behaviour - should metacat fixup schema.org documents to contain this section if it hasn't been
included, which could be the case if documents are added directly via R client -> metacat and not via a
slender node?
Note that the "creator" fixup is needed so that RDF/XML serialization of the original json-ld document and SPARQL
query processing can extract creators correctly, as the first creator in a list is extracted as the 'author' field.
I'll keep this issue slim since it's a pair with NCEAS/metacat#926. The indexing code doesn't support a facility like EML references. As it's a core part of the EML schema, we should support this. Right now, when references are used in fields like creator, the corresponding index field is empty.
Indexing rules (Spring beans) need to be added to application-context-schema-org.xml
to populate the geohas_*
and text
fields. The existing classes should be reused to accomplish this, i.e for EML these classes are used:
<bean id="eml.geohashRoot" class="org.dataone.cn.indexer.parser.utility.RootElement" p:name="geohashRoot"
<bean id="eml.text" class="org.dataone.cn.indexer.parser.FullTextSolrField">
<bean id="eml.fullText" class="org.dataone.cn.indexer.parser.AggregateSolrField" >
(see main/resources/application-context-eml-base.xml for details of EML bean definitions)
DataONE CN indexing will support indexing of schema.org records that contain schema:dataset descriptions as recommended in the Google Search Guide for DataSets. Additional recommendations are included from the ESIP Federation schema.org cluster in their "Science on schema.org" Dataset guide
Documents will be harvested to a special DataONE SlenderNode from participating repositories. The DataSet descriptions are harvested from repository dataset landing pages, by extracting the JSON-LD text from an HTML <script> element.
Amend the indexer rule for populating serviceEndpoint
to add an entry for the contentUrl
if present in a distribution
entry of type DataDownload
in schema.org/Dataset
JSON-LD metadata.
For example:
distribution: {
"@type": "DataDownload",
"contentUrl": "http://datadryad.org/api/v2/datasets/doi%253A10.5061%252Fdryad.5qb78/download",
"encodingFormat": "application/zip"
},
The URL http://datadryad.org/api/v2/datasets/doi%253A10.5061%252Fdryad.5qb78/download
should be added to the list of values for serviceEndpoint
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.