Code Monkey home page Code Monkey logo

Comments (5)

amoeba avatar amoeba commented on August 14, 2024 1

Since our text field is supposed to store the "full text of the metadata record", I'd vote for the first. If the size of what we're tossing in Solr for JSON-LD/SOSO docs is a problem, we also have that problem for EML and ISO docs so I'd say it's not really a problem here.

from d1_cn_index_processor.

gothub avatar gothub commented on August 14, 2024

For the Solr 'text' field that is derived from an SO document, there are two different approaches the indexer can employ to extract the required values from the document to populate the Solr field:

  1. use a SPARQL query that returns all string values from the SO document (where the RDF triple object is a literal value). This can be accomplished with the query:
SELECT DISTINCT (str(?string) as ?text)
WHERE {
    {
        ?a ?b ?string .
        filter(datatype(?string) = xsd:string) .
    }
    UNION
    {
        ?a ?b ?string .
        filter(datatype(?string) = SO:HTML) .
    }
}
  • the downside of this approach is that any string item from the document will be retrieved, even items that are not retrieved for any Solr field.
  • this is the approach used for EML document indexing.
  1. Concatenate the values of a specific list of 'field' values for items that we are explicitly retrieving from the document, for example, we would concatenate the values for 'title', 'abstract', 'keywords', etc., using the queries that are already defined for these fields.
  • the downside to this approach is that the list of items to retrieve may need to be updated in the future, e.g. for 'license' when it is added to the index.

Note that these two approaches derive different solutions for the Solr 'text' field.

@mbjones @datadavev @taojing2002 which method should be implemented?

from d1_cn_index_processor.

gothub avatar gothub commented on August 14, 2024

The approaches described above will return these values for the Solr "text" field:
h the pack ice habitat

  1. When using a single SPARQL query that returns all strings from document (word count 3680):
https://somerepository.org/datasets/10.xxxx/Dataset-101/process-script.R https://www.example-data-repository.org
Example Data Repository yrday_local http://lod.example-data-repository.org/id/dataset-parameter/20879
local day and decimal time, as 326.5 for the 326th day of the year, or November 22 at 1200 hours (noon)
latitude, in decimal degrees, North is positive, negative denotes South time_sample
http://lod.example-data-repository.org/id/dataset-parameter/20863 minutes Number of minutes between collection and sampling for pigment content;
decline of pigment content with time was used to calculate time to clear the gut of pigment.
text/tab-separated-values 2010-02-03 https://www.example-data-repository.org/dataset/3300/data/larval-krill.tsv
Spatial Reference System http://www.wikidata.org/entity/Q161779 http://www.opengis.net/def/crs/OGC/1.3/CRS84
lat http://lod.example-data-repository.org/id/dataset-parameter/20874 decimal degrees https://www.example-data-repository.org/dataset/3300
Larval krill studies - fluorescence and clearance from ARSV Laurence M. Gould LMG0106, LMG0205 in the Southern Ocean from 2001-2002 (SOGLOBEC project)
Hand-held plankton net Manual Biota Sampler oceans krill biota larval krill pigments Quetin, L., Ross, R. (2010) Larval krill studies - fluorescence and clearance from ARSV Laurence M. Gould LMG0106,
LMG0205 in the Southern Ocean from 2001-2002 (SOGLOBEC project). Example Data Repository. Version 1. doi:10.1234/1234567890 [access date]
2001-08-06/2002-09-09 1 https://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1234/1234567890 year
http://lod.example-data-repository.org/id/dataset-parameter/20861 calendar year month_local
http://lod.example-data-repository.org/id/dataset-parameter/20877 cruiseid http://lod.example-data-repository.org/id/dataset-parameter/20860
text sample_id http://lod.example-data-repository.org/id/dataset-parameter/20862 day_local
http://lod.example-data-repository.org/id/dataset-parameter/20876 stage_id http://lod.example-data-repository.org/id/dataset-parameter/20865
NSF Antarctic Sciences NSF ANT pigment_content http://lod.example-data-repository.org/id/dataset-parameter/20864 micrograms
total chl/grams wet weight https://registry.identifiers.org/registry/doi doi:10.1234/1234567890 http://doi.org/abcd
https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R wet_weight http://lod.example-data-repository.org/id/dataset-parameter/20866
mg lon http://lod.example-data-repository.org/id/dataset-parameter/20875 time_local http://lod.example-data-repository.org/id/dataset-parameter/20878
https://www.example-data-repository.org/person/51160 Dr Robin Ross month, local time https://www.example-data-repository.org/person/51159
Dr Langdon Quetin -68.4817 -75.8183 -65.08 -68.5033 well-known text (WKT) representation of geometry http://www.wikidata.org/entity/Q4018860
POLYGON ((-75.8183 -68.4817, -68.5033 -68.4817, -68.5033 -65.08, -75.8183 -65.08, -75.8183 -68.4817)) pigment content cruise identification year of experiment day of month,
local time longitude, in decimal degrees, East is positive, negative denotes West time of day, local time, using 2400 clock format sample identification:
WBC=whole body clearance expt.; WBF=whole body fluorescence on collection stage development index of larvae in sample
(furcilia = F1-6 = 1-6,  juvenile = J=7) Dr Roberta Marinelli https://orcid.org/0000-0001-7775-xxxx average wet weight/larvae in sample
ANT-9909933 https://www.example-data-repository.org/award/55102 http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=9909933
Winter ecology of larval krill: quantifying their interaction with the pack ice habitat
  1. When using string values from only defined Solr fields (word count: 869):
NSF Antarctic Sciences https://example.org/executions/execution-42 biota 2002-09-09T00:00:00.000Z Dr Langdon Quetin
https://somerepository.org/datasets/10.xxxx/Dataset-101/process-script.R Dr Robin Ross 1 Winter ecology of larval
krill: quantifying their interaction with the pack ice habitat. larval krill pigments https://www.example-data-repository.org/dataset/3300/data/larval-krill.tsv
Larval krill studies - fluorescence and clearance from ARSV Laurence M. Gould LMG0106, LMG0205 in the Southern Ocean from 2001-2002
(SOGLOBEC project) 2010-02-03T00:00:00.000Z 2001-08-06T00:00:00.000Z https://doi.org/10.xxxx/Dataset-1
https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R http://purl.dataone.org/provone/2015/01/15/ontology#Data
lon https://somerepository.org/datasets/10.xxxx/Dataset-101 https://example.org/executions/execution-101

The second technique includes these fields:

  • abstract, title, label, awardNumber, awardTitle, author, pubDate, authorGivenName, authorLastName, funderIdentifier, funderName, origin, hasPart, keywords, investigator, prov_hasDerivations, prov_instanceOfClass, prov_usedByExecution, prov_usedByProgram, prov_wasDerivedFrom, prov_generatedByExecution, prov_generatedByProgram, namedLocation, beginDate, endDate, parameter, edition, serviceEndpoint

It does not include fields:

  • eastBoundCoordi, westBoundCoord, southBoundCoord, northBoundCoord, geohash*

from d1_cn_index_processor.

mbjones avatar mbjones commented on August 14, 2024

I'd vote for the first too, and agree with Bryce's reasoning.

from d1_cn_index_processor.

gothub avatar gothub commented on August 14, 2024

Indexing of text field for schema.org documents added in commit 1d4bda6

from d1_cn_index_processor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.