Comments (5)
Since our text
field is supposed to store the "full text of the metadata record", I'd vote for the first. If the size of what we're tossing in Solr for JSON-LD/SOSO docs is a problem, we also have that problem for EML and ISO docs so I'd say it's not really a problem here.
from d1_cn_index_processor.
For the Solr 'text' field that is derived from an SO document, there are two different approaches the indexer can employ to extract the required values from the document to populate the Solr field:
- use a SPARQL query that returns all string values from the SO document (where the RDF triple object is a literal value). This can be accomplished with the query:
SELECT DISTINCT (str(?string) as ?text)
WHERE {
{
?a ?b ?string .
filter(datatype(?string) = xsd:string) .
}
UNION
{
?a ?b ?string .
filter(datatype(?string) = SO:HTML) .
}
}
- the downside of this approach is that any string item from the document will be retrieved, even items that are not retrieved for any Solr field.
- this is the approach used for EML document indexing.
- Concatenate the values of a specific list of 'field' values for items that we are explicitly retrieving from the document, for example, we would concatenate the values for 'title', 'abstract', 'keywords', etc., using the queries that are already defined for these fields.
- the downside to this approach is that the list of items to retrieve may need to be updated in the future, e.g. for 'license' when it is added to the index.
Note that these two approaches derive different solutions for the Solr 'text' field.
@mbjones @datadavev @taojing2002 which method should be implemented?
from d1_cn_index_processor.
The approaches described above will return these values for the Solr "text" field:
h the pack ice habitat
- When using a single SPARQL query that returns all strings from document (word count 3680):
https://somerepository.org/datasets/10.xxxx/Dataset-101/process-script.R https://www.example-data-repository.org
Example Data Repository yrday_local http://lod.example-data-repository.org/id/dataset-parameter/20879
local day and decimal time, as 326.5 for the 326th day of the year, or November 22 at 1200 hours (noon)
latitude, in decimal degrees, North is positive, negative denotes South time_sample
http://lod.example-data-repository.org/id/dataset-parameter/20863 minutes Number of minutes between collection and sampling for pigment content;
decline of pigment content with time was used to calculate time to clear the gut of pigment.
text/tab-separated-values 2010-02-03 https://www.example-data-repository.org/dataset/3300/data/larval-krill.tsv
Spatial Reference System http://www.wikidata.org/entity/Q161779 http://www.opengis.net/def/crs/OGC/1.3/CRS84
lat http://lod.example-data-repository.org/id/dataset-parameter/20874 decimal degrees https://www.example-data-repository.org/dataset/3300
Larval krill studies - fluorescence and clearance from ARSV Laurence M. Gould LMG0106, LMG0205 in the Southern Ocean from 2001-2002 (SOGLOBEC project)
Hand-held plankton net Manual Biota Sampler oceans krill biota larval krill pigments Quetin, L., Ross, R. (2010) Larval krill studies - fluorescence and clearance from ARSV Laurence M. Gould LMG0106,
LMG0205 in the Southern Ocean from 2001-2002 (SOGLOBEC project). Example Data Repository. Version 1. doi:10.1234/1234567890 [access date]
2001-08-06/2002-09-09 1 https://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1234/1234567890 year
http://lod.example-data-repository.org/id/dataset-parameter/20861 calendar year month_local
http://lod.example-data-repository.org/id/dataset-parameter/20877 cruiseid http://lod.example-data-repository.org/id/dataset-parameter/20860
text sample_id http://lod.example-data-repository.org/id/dataset-parameter/20862 day_local
http://lod.example-data-repository.org/id/dataset-parameter/20876 stage_id http://lod.example-data-repository.org/id/dataset-parameter/20865
NSF Antarctic Sciences NSF ANT pigment_content http://lod.example-data-repository.org/id/dataset-parameter/20864 micrograms
total chl/grams wet weight https://registry.identifiers.org/registry/doi doi:10.1234/1234567890 http://doi.org/abcd
https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R wet_weight http://lod.example-data-repository.org/id/dataset-parameter/20866
mg lon http://lod.example-data-repository.org/id/dataset-parameter/20875 time_local http://lod.example-data-repository.org/id/dataset-parameter/20878
https://www.example-data-repository.org/person/51160 Dr Robin Ross month, local time https://www.example-data-repository.org/person/51159
Dr Langdon Quetin -68.4817 -75.8183 -65.08 -68.5033 well-known text (WKT) representation of geometry http://www.wikidata.org/entity/Q4018860
POLYGON ((-75.8183 -68.4817, -68.5033 -68.4817, -68.5033 -65.08, -75.8183 -65.08, -75.8183 -68.4817)) pigment content cruise identification year of experiment day of month,
local time longitude, in decimal degrees, East is positive, negative denotes West time of day, local time, using 2400 clock format sample identification:
WBC=whole body clearance expt.; WBF=whole body fluorescence on collection stage development index of larvae in sample
(furcilia = F1-6 = 1-6, juvenile = J=7) Dr Roberta Marinelli https://orcid.org/0000-0001-7775-xxxx average wet weight/larvae in sample
ANT-9909933 https://www.example-data-repository.org/award/55102 http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=9909933
Winter ecology of larval krill: quantifying their interaction with the pack ice habitat
- When using string values from only defined Solr fields (word count: 869):
NSF Antarctic Sciences https://example.org/executions/execution-42 biota 2002-09-09T00:00:00.000Z Dr Langdon Quetin
https://somerepository.org/datasets/10.xxxx/Dataset-101/process-script.R Dr Robin Ross 1 Winter ecology of larval
krill: quantifying their interaction with the pack ice habitat. larval krill pigments https://www.example-data-repository.org/dataset/3300/data/larval-krill.tsv
Larval krill studies - fluorescence and clearance from ARSV Laurence M. Gould LMG0106, LMG0205 in the Southern Ocean from 2001-2002
(SOGLOBEC project) 2010-02-03T00:00:00.000Z 2001-08-06T00:00:00.000Z https://doi.org/10.xxxx/Dataset-1
https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R http://purl.dataone.org/provone/2015/01/15/ontology#Data
lon https://somerepository.org/datasets/10.xxxx/Dataset-101 https://example.org/executions/execution-101
The second technique includes these fields:
- abstract, title, label, awardNumber, awardTitle, author, pubDate, authorGivenName, authorLastName, funderIdentifier, funderName, origin, hasPart, keywords, investigator, prov_hasDerivations, prov_instanceOfClass, prov_usedByExecution, prov_usedByProgram, prov_wasDerivedFrom, prov_generatedByExecution, prov_generatedByProgram, namedLocation, beginDate, endDate, parameter, edition, serviceEndpoint
It does not include fields:
- eastBoundCoordi, westBoundCoord, southBoundCoord, northBoundCoord, geohash*
from d1_cn_index_processor.
I'd vote for the first too, and agree with Bryce's reasoning.
from d1_cn_index_processor.
Indexing of text
field for schema.org documents added in commit 1d4bda6
from d1_cn_index_processor.
Related Issues (20)
- Verify schema.org indexing compatibility with SOSO v1.2.0 HOT 8
- Handle all valid delimeters for SO:box
- Incorrect geohash calculated for bbox crossing IDL HOT 2
- Support EML references in indexing code
- schema.org indexing appends type to 'abstract' field HOT 1
- schema.org indexing doesn't process creator without context declaration HOT 3
- schema.org indexing recognizes 'https://schema.org' and not 'http://schema.org' HOT 11
- Updates for portal/collection schema v1.1.0
- Indexing fails for objects from member node RW HOT 1
- Support for schema.org/Dataset with multiple `description` entries HOT 1
- Add `schema.org/Dataset` `distribution` info as serviceEndpoint in index HOT 1
- Update EML Semantic Annotation indexing to include and expand property URIs HOT 1
- Add MOSAIC, ARCRC, SENSO, ADCAD, SALMON ontologies to list of built-in ontologies and reindex content HOT 3
- Json-ld subprocess can't process legitimate schema.org objects HOT 2
- Add formatId for JSON-LD documents HOT 3
- Create SO:Dataset to DataONE solr crosswalk HOT 4
- Re-apply previously overwritten changes to XPath for ISOTC211 origin field HOT 7
- Resolve build errors
- For Solr date fields, is just the date sufficient?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from d1_cn_index_processor.