dataoneorg / d1_cn_index_processor Goto Github PK

View Code? Open in Web Editor NEW

0.0 9.0 1.0 4.07 MB

The CN index processor component

Java 76.54% Roff 5.29% HTML 0.27% CSS 0.45% JavaScript 2.73% XSLT 3.11% MATLAB 11.60%

cicore java d1-cn

d1_cn_index_processor's People

Contributors

Watchers

Forkers

amoeba

d1_cn_index_processor's Issues

Create SO:Dataset to DataONE solr crosswalk

Create the mapping from schema.org Dataset descriptions to the DataONE Solr index.
This mapping will first be documented in this repo, for development usage, and then added to the DataONE API documentation (i.e. https://github.com/DataONEorg/api-documentation/tree/master/source/design) so that users understand how SO:Dataset descriptions are indexed.

TODO:

create table of the mapping
develop SPARQL queries to extract data from JSON-LD documents (converted to RDF)
update DataONE API documentation when the mapping is fully documented and released

For Solr date fields, is just the date sufficient?

Solr 'date' field types contain date and time information.

For fields that contain dates in the index, time of day info is ignored, even if though it may be available in the source metadata. However, the value inserted in these Solr fields is the full dateTime format, i.e. "YYYY-MM-DDThh:mm:ssZ".

For example, for beginDate for pid: doi:10.18739/A2599Z17N (EML 2.1.1), time info is included in the metdata, but not captured by the indexer:

beginDate from Solr
- "2017-06-09T00:00:00Z"
beginDate from the source EML:

<temporalCoverage>
  <rangeOfDates>
    <beginDate>
      <calendarDate>2017-06-09</calendarDate>
      <time>20:48:59</time>
    </beginDate>
...

The <time> component is ignored by the indexer, even though the time is inserted into the Solr field (as zeros).

As this field is described as 'The starting date of the temporal range...', is capturing just the date sufficient?

Verify schema.org indexing compatibility with SOSO v1.2.0

Verify that the parsing performed by the JsonLdSubprocessor handles the guidance outlined in the v1.2.0 release:

One potential item:

verify geospatial coverages are parsed correctly

Indexing fails for objects from member node RW

RW reported some objects missing solr indexes today.
We took a look and found the metadata objects in those packages don't have solr indexes but the data objects have. In the index_queue table, one metadata object has multiple index tasks with the status of In Process. There is no tasks with the status of New for the metadata object. In the index_processor log, we found the statement of starting index process for the metadata object, but could not find the statement of ending index process.
It is obvious that some silent failures happened and the index tasks were not reset the status to failure.
I also noticed the objects have series ids.

Re-apply previously overwritten changes to XPath for ISOTC211 origin field

In late 2017, Axiom pointed out what ended up being a bug in the XPath we use to populate the origin field in our ISOTC211 (ISO19115) indexing component. See https://redmine.dataone.org/issues/8165. The gist of the problem is that we were pulling any ResponsibleParty from the document and treating them as creators/authors which was too broad.

A fix was applied at some point and I can even see in Redmine 8504 that it was still present in 2018 but it ended up reverted at some point and the current value in the 2.3 branch is the original XPath.

I'm really not sure what happened but I think we should:

Re-apply this change
Deploy a release (either bug fix or full) with the change to production
Reindex affected content once the patched code is live
Coordinate with Axiom (and others?) to make sure they're satisfied.

The fixed XPath was:

//gmd:identificationInfo/gmd:MD_DataIdentification/gmd:citation/gmd:CI_Citation/gmd:citedResponsibleParty/gmd:CI_ResponsibleParty[gmd:role/gmd:CI_RoleCode/text() = "owner" or gmd:role/gmd:CI_RoleCode/text() = "originator" or gmd:role/gmd:CI_RoleCode/text() = "principalInvestigator" or gmd:role/gmd:CI_RoleCode/text() = "author"]/gmd:individualName/*/text() | //gmd:identificationInfo/gmd:MD_DataIdentification/gmd:citation/gmd:CI_Citation/gmd:citedResponsibleParty/gmd:CI_ResponsibleParty[(gmd:role/gmd:CI_RoleCode/text() = "owner" or gmd:role/gmd:CI_RoleCode/text() = "originator" or gmd:role/gmd:CI_RoleCode/text() ="principalInvestigator" or gmd:role/gmd:CI_RoleCode/text() = "author") and (not(gmd:individualName) or gmd:individualName[@gco:nilReason = "missing"])]/gmd:organisationName/*/text()

Handle all valid delimeters for SO:box

The SPARQL queries that extract schema.org bounding box coordinates from a geo.box property only handle single spaces, but need to handle and combination of commas and spaces. Here are some possibilities for box:

"-28.09816 -32.95731 41.00 1.71098"
"-28.09816,-32.95731 41.00, 1.71098"
"-28.09816,-32.95731 41.00,1.71098"
"-28.09816, -32.95731 41.00, 1.71098"

Populate 'resourceMap' Solr field based on 'ore:isAggregatedBy'

Currently the index processor populates the 'resourceMap' field for each aggregated object in a resource map based on the cito:documents relationship, putting the resource map id in that field.

This approach doesn't work for metadata only packages, where there are no objects that have the 'cito:documents' field.
Client software such as the dataone R package add a self-refering 'cito:documents' relationship for the metadata object, so that the link from the Solr doc for the metadata object back to the resource map is made.

An approach that we should consider is to use the 'ore:isAggregatedBy' relationship to identify all the package members that should have the resourceMap field populated, because for metadata-only packages, this relationship is present, as the ORE spec requires.

Things to consider with this change:

will a re-indexing of the entire DataONE corpus be required?
will additional items be 'linked' for some packages, as there may be more 'isAggregated' relationships that 'documents'
other considerations?

Incorrect geohash calculated for bbox crossing IDL

See issue NCEAS/metacat#1488 for a complete explanation.

Updates for portal/collection schema v1.1.0

Update the Spring application context files that use the portal and collection schema v1.1.0.
This includes replacing the v1.0.0 schema version with v1.1.0 and adding collection XML elements from the
new version, including <filterGroup>

For changes to metacat that use v1.1.0 schema, see NCEAS/metacat#1499

For info on the v1.1.0 schemas, see ahttps://github.com/DataONEorg/collections-portals-schemas/releases/tag/1.1.0

Json-ld subprocess can't process legitimate schema.org objects

Our indexer couldn't process the content of the schema.org objects from the Hakai IYS Catalog member node. The error message is:

cn-index-processor-daemon.log.7:[ WARN] 2022-05-05 18:14:15,377 (SolrIndexService:processObject:241) The subprocessor org.dataone.cn.indexer.parser.JsonLdSubprocessor can't process the id sha256:7bf3f2000c610da060004d517032b45d1681b1c88bdbf60ecc290649ceb1d203 since The Processor cannot find the either prefix of https://schema.org/ or http://schema.org/ in the expanded json-ld object.. However, the index still can be achieved without this part of information provided by the processor.

Expand elements covered by EML's attribute* index fields beyond just dataTable

We noticed today that our EML index processing rules in application-context-eml-base.xml only capture attribute elements under a dataTable element (e.g., //dataTable/attributeList/attribute). However, EML supports attributes under all entity types, not just dataTable. This is probably confusing to users and something I think we should address.

I suggest we change the relevant XPath selectors in application-context-eml-base.xml to cover all entity types. I can't think of any downsides to the change, other than having to reindex content. I think all users will want to be able to search for attributes on other entity types.

@datadavev @csjx: Can I get a +1 from you? @mbjones already put in a vote for this change in our salmantics call for this change.

Resolve build errors

The build that is triggered by checkins is generating errors:

Failed tests: 
  JsonLdSubprocessorTest.testInsertSchemaOrg:121->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null
  SolrIndexDeleteTest.testDeleteTwoOverlappedDataPackage:315->verifyFirstOverlapDataPackageIndexed:517->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null
  SolrIndexDeleteTest.testArchiveDataPackage:201->verifyTestDataPackageIndexed:718->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null
  SolrIndexDeleteTest.testArchiveScienceMetadataInPackage:170->verifyTestDataPackageIndexed:718->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null
  SolrIndexDeleteTest.testDeleteDataPackageWithDuplicatedRelationship:249->verifyTestDataPackageIndexed:718->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null
  SolrIndexDeleteTest.testDeleteDataPackagesWithComplicatedRelation:284->verifyComplicatedDataPackageIndexed:601->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null
  SolrIndexDeleteTest.testDeleteSingleDocFromIndex:114->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null
  SolrIndexDeleteTest.testDataPackageWithArchivedDoc:340->verifyDataPackageNo1271:403->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null
  SolrIndexDeleteTest.testArchiveDataInPackage:139->verifyTestDataPackageIndexed:718->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null
  SolrIndexDeleteTest.testDeleteDataPackage:228->verifyTestDataPackageIndexed:718->DataONESolrJettyTestBase.assertPresentInSolrIndex:93->Assert.assertFalse:79->Assert.assertFalse:68->Assert.assertTrue:43->Assert.fail:92 null

Support for schema.org/Dataset with multiple `description` entries

Some schema.org JSON-LD Dataset descriptions may include a list of values for the Dataset description. For example (truncated for brevity):

   "description": [
      "The relationship between CO2 flow from soil and soil CO2 concentration was ... ",
      "<div class=\"o-metadata__file-usage-entry\"><h4 class=\"o-heading__level3-file-title\">field_data_flow_concentration</h4><div class=\"o-metadata__file-description\">Table describes values of soil CO2 concentration ..."
    ],

From: https://so.test.dataone.org/mnTestDRYAD/v2/object/sha256:7f5d0aab7e3025626b5bb869b6ac51203327f17a24d53073234ca42a4bca7fe3

The indexer should:

Inject "@container":"@list" into the context, as for identifier and creator.
Treat description as an ordered list
Concat values from the list, delimited by \n.

If concatenation raises issues, then defer concatenation for a later release and use the first value from the list. In this case, create a new issue documenting the need to support concatenation.

schema.org indexing appends type to 'abstract' field

For certain documents, the parsing for schema.org documents is not stripping the datatype off of the 'abstract' field.
See the "Abstract" at https://search-sandbox.test.dataone.org/view/urn%3Auuid%3A4ad54da7-d5c0-4497-91c7-4c004f8a5be2, which has the string "^^https://schema.org/HTML" appended to the end

The source json-ld document has:

"description": {
    "@type": "HTML",
    "@value":"&lt;p&gt;&quot;Winter ecology of larval kril..."
},

So the SPARQL query that extracts "description" -> "abstract" needs to strip off the type for this field, for example:

            SELECT ( str(?description) as ?abstract )

instead of

            SELECT ( ?description as ?abstract )


This is being done for some values/queries already, but should probably be done for all values.

schema.org indexing recognizes 'https://schema.org' and not 'http://schema.org'

This issue was tranferred from the metacat repo issue, as the desired solution is to change the SPARQL queries in this repo to address this problem, and not have metacat update documents to use the 'https://schema.org' namespace.

When manually uploading a schema.org document with the JSON-LD context set to

    "@context": {
      "@vocab": "http://schema.org/"
    },

none of the SO:Dataset fields are indexed to Solr.
The reason for this is that when metacat-index serializes the document to RDF/XML, all SO predicates are serialized as that context, for example:

<https://dataone.org/datasets/doi%3A10.18739%2FA2JQ0SW4G> <http://schema.org/datePublished> "2021-01-01T00:00:00Z" .

The SPARQL queries that are used to extract info from the document all use the 'https://schema.org' namespace.

Do we need to support both "http://schema.org" and "https://schema.org". It looks like the transition from http to
https may linger for a long time, e.g. https://schema.org/docs/faq.html#19

Note that the slender node implementation converts harvested documents from "http://schema.org" to "https://schema.org"

If we do support both, then which of the following should be used to implement:

wrangle the SPARQL queries to support either namespace
update metacat so that it modifies the documents

Here are the test docs indexing result:

with 'http://schema.org', didn't index properly
- https://mn-sandbox-ucsb-2.test.dataone.org/knb/d1/mn/v2/query/solr/q=id:%22urn:uuid:afb4184a-e02f-44e3-9bc5-2885b05b3ee9%22
with 'https://sche.org', did index properly
- https://mn-sandbox-ucsb-2.test.dataone.org/knb/d1/mn/v2/query/solr/?q=id:%22urn:uuid:7e6af06e-4ff6-426f-b5f8-b4f44230ffa7%22

Update EML Semantic Annotation indexing to include and expand property URIs

EML Semantic Annotations are represented in EML using a structure like

<annotation>
  <propertyURI label="some label">http://example.com/some_uri</propertyURI>
  <valueURI label="some other label">http://example.com/some_other_uri</valueURI>
</annotation>

To provide search for the above metadata, we extract and parse the character data from the valueURI element as an IRI, query the OntologyModelService for any parent classes for the IRI, and smush all these terms together in the sem_annotation field. We kept the indexing rules narrowly-focused as a start because we were planning on using EML Semantic Annotations narrowly to start with. It's catching on within our teams and also within external teams and the use is outstripping the implementation.

Over on NCEAS/metacatui#1807, I'm breaking apart the popover widgets we show on dataset landing pages that contain EML Semantic Annotations into two separate popovers: One for the propertyURI and one for the valueURI. A key part of that widget is a link that searches for other datasets annotated with the term you're viewing. Because propertyURIs aren't being expanded and stored in the search index, searches for datasets annotated with a specific propertyURI don't work.

I propose we expand what we store in the sem_annotation field to cover the the valueURI and propertyURI and of course any expanded terms (superclasses for valueURI and superproperties for propertyURI). I could see us developing a more structured indexing approach for EML Semantic Annotations but I don't think we need it at this point so I'm opting for the small change.

This change will will require re-indexing the ~200-300 EML docs with semantic annotations in them. The number might grow before re-indexing is complete.

Make the changes in this codebase and metacat-index where the logic is duplicated
Deploy changes to CN and MN stacks
Re-index affected content

Add MOSAIC, ARCRC, SENSO, ADCAD, SALMON ontologies to list of built-in ontologies and reindex content

v1.0.0 of the MOSAIC ontology is out and we're already annotating with it. It should be added to d1_cn_index_processor and affected content should be reindexed once deployed. We have a crop of new ontologies we should add indexing support for:

MOSAIC
ARCRC
ADCAD
SENSO
SALMON
SASAP

Tasks

Make above changes
Cherry pick commits into dataone-index repo too
Reindex content matching q=sem_annotation:*

Add formatId for JSON-LD documents

In order to index SO:Dataset documents, a formatId for these documents needs to be added.
The Media Type for JSON-LD documents is application/ld+json

related issue: #3

schema.org indexing doesn't process creator without context declaration

The slender node processing for schema.org documents inserts a property ("creator") in the @context section of a harvested document to allow the 'creator' properties to be processed correctly as a list. Here is an except from a properly prepared
document:

{
    "@context": {
        "@vocab": "https://schema.org/",
        "creator":{
            "@container":"@list",
            "@id":"https://schema.org/creator"
        }
    },
...

During some manual testing, I inadvertently uploaded documents that don't have this fixed-up @context, and saw
that these Solr fields don't get populated as a result: "author", "origin".

@mbjones @datadavev @taojing2002
Given this behaviour - should metacat fixup schema.org documents to contain this section if it hasn't been
included, which could be the case if documents are added directly via R client -> metacat and not via a
slender node?

Note that the "creator" fixup is needed so that RDF/XML serialization of the original json-ld document and SPARQL
query processing can extract creators correctly, as the first creator in a list is extracted as the 'author' field.

Support EML references in indexing code

I'll keep this issue slim since it's a pair with NCEAS/metacat#926. The indexing code doesn't support a facility like EML references. As it's a core part of the EML schema, we should support this. Right now, when references are used in fields like creator, the corresponding index field is empty.

geohashes, text fields not being indexed for SO documents

Indexing rules (Spring beans) need to be added to application-context-schema-org.xml to populate the geohas_* and text fields. The existing classes should be reused to accomplish this, i.e for EML these classes are used:

geohash_*: <bean id="eml.geohashRoot" class="org.dataone.cn.indexer.parser.utility.RootElement" p:name="geohashRoot"
text:
- eml.text: <bean id="eml.text" class="org.dataone.cn.indexer.parser.FullTextSolrField">
- eml.fullText: <bean id="eml.fullText" class="org.dataone.cn.indexer.parser.AggregateSolrField" >

(see main/resources/application-context-eml-base.xml for details of EML bean definitions)

Index JSON-LD documents containing SO:Dataset descriptions

DataONE CN indexing will support indexing of schema.org records that contain schema:dataset descriptions as recommended in the Google Search Guide for DataSets. Additional recommendations are included from the ESIP Federation schema.org cluster in their "Science on schema.org" Dataset guide

Documents will be harvested to a special DataONE SlenderNode from participating repositories. The DataSet descriptions are harvested from repository dataset landing pages, by extracting the JSON-LD text from an HTML <script> element.

Add `schema.org/Dataset` `distribution` info as serviceEndpoint in index

Amend the indexer rule for populating serviceEndpoint to add an entry for the contentUrl if present in a distribution entry of type DataDownload in schema.org/Dataset JSON-LD metadata.

For example:

    distribution: {
      "@type": "DataDownload",
      "contentUrl": "http://datadryad.org/api/v2/datasets/doi%253A10.5061%252Fdryad.5qb78/download",
      "encodingFormat": "application/zip"
    },

(from https://so.test.dataone.org/mnTestDRYAD/v2/object/sha256:a90d598d491ecc47051387ff5bd58042dff128064ef74358c7e3318bdf3a25fe )

The URL http://datadryad.org/api/v2/datasets/doi%253A10.5061%252Fdryad.5qb78/download should be added to the list of values for serviceEndpoint.

dataoneorg / d1_cn_index_processor Goto Github PK

d1_cn_index_processor's People

Contributors

Watchers

Forkers

d1_cn_index_processor's Issues

Recommend Projects

Recommend Topics

Recommend Org