Code Monkey home page Code Monkey logo

oai's Introduction

oai: General Purpose ‘Oai-PMH’ Services Client

Project Status: Active – The project has reached a stable, usable state and is being actively developed. R-check cran checks codecov.io rstudio mirror downloads cran version

oai is an R client to work with OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) services, a protocol developed by the Open Archives Initiative (https://en.wikipedia.org/wiki/Open_Archives_Initiative). OAI-PMH uses XML data format transported over HTTP.

OAI-PMH Info:

oai is built on xml2 and httr. In addition, we give back data.frame’s whenever possible to make data comprehension, manipulation, and visualization easier. We also have functions to fetch a large directory of OAI-PMH services - it isn’t exhaustive, but does contain a lot.

OAI-PMH instead of paging with e.g., page and per_page parameters, uses (optionally) resumptionTokens, optionally with an expiration date. These tokens can be used to continue on to the next chunk of data, if the first request did not get to the end. Often, OAI-PMH services limit each request to 50 records, but this may vary by provider, I don’t know for sure. The API of this package is such that we while loop for you internally until we get all records. We may in the future expose e.g., a limit parameter so you can say how many records you want, but we haven’t done this yet.

Install

Install from CRAN

install.packages("oai")

Development version

devtools::install_github("ropensci/oai")
library("oai")

Identify

id("http://oai.datacite.org/oai")
#>   repositoryName                      baseURL protocolVersion
#> 1       DataCite https://oai.datacite.org/oai             2.0
#>             adminEmail    earliestDatestamp deletedRecord          granularity
#> 1 [email protected] 2011-01-01T00:00:00Z    persistent YYYY-MM-DDThh:mm:ssZ
#>   compression compression.1                                    description
#> 1        gzip       deflate oaioai.datacite.org:oai:oai.datacite.org:12425

ListIdentifiers

list_identifiers(from = '2018-05-01T', until = '2018-06-01T')
#> # A tibble: 75 × 5
#>    identifier                           datestamp        setSpec setSp…¹ setSp…²
#>    <chr>                                <chr>            <chr>   <chr>   <chr>  
#>  1 4b64d1f2-31c2-40c9-80aa-bb7ddb424684 2018-05-30T13:5… instal… datase… countr…
#>  2 884378d6-d591-4760-bb70-7b4851784d96 2018-05-29T19:1… instal… datase… countr…
#>  3 18799ce9-1a66-40fc-ad18-5ac54cd3417b 2018-05-14T12:1… instal… datase… countr…
#>  4 7e91aacb-c994-41ee-a7b7-bd23c02cd5bf 2018-05-21T10:5… instal… datase… countr…
#>  5 f83746ee-4cf2-4e60-a720-dd508b559794 2018-05-08T09:4… instal… datase… countr…
#>  6 a3533a61-6f88-443e-89ae-37611ea88267 2018-05-08T13:5… instal… datase… countr…
#>  7 ba9b66a3-2d11-4193-922e-ace4d5909239 2018-05-05T23:5… instal… datase… countr…
#>  8 78b696d9-8f0d-41ab-9c23-1c3547da411d 2018-05-05T23:0… instal… datase… countr…
#>  9 c791b255-a184-4600-b828-ef9d4092a212 2018-05-05T14:2… instal… datase… countr…
#> 10 b929ccda-03b1-4166-9e5b-34588339d61d 2018-05-09T02:5… instal… datase… countr…
#> # … with 65 more rows, and abbreviated variable names ¹​setSpec.1, ²​setSpec.2

Count Identifiers

count_identifiers()
#>                            url   count
#> 1 http://export.arxiv.org/oai2 2158148

ListRecords

list_records(from = '2018-05-01T', until = '2018-05-15T')
#> # A tibble: 41 × 26
#>    identi…¹ dates…² setSpec setSp…³ setSp…⁴ title publi…⁵ ident…⁶ subject source
#>    <chr>    <chr>   <chr>   <chr>   <chr>   <chr> <chr>   <chr>   <chr>   <chr> 
#>  1 18799ce… 2018-0… instal… datase… countr… Bird… Sokoin… https:… Occurr… ""    
#>  2 f83746e… 2018-0… instal… datase… countr… NDFF… Dutch … https:… Metada… "http…
#>  3 a3533a6… 2018-0… instal… datase… countr… EDP … EDP - … https:… Occurr… ""    
#>  4 ba9b66a… 2018-0… instal… datase… countr… Ende… Sokoin… https:… Occurr… ""    
#>  5 78b696d… 2018-0… instal… datase… countr… Ende… Sokoin… https:… Occurr… ""    
#>  6 c791b25… 2018-0… instal… datase… countr… Ende… Sokoin… https:… Occurr… ""    
#>  7 b929ccd… 2018-0… instal… datase… countr… List… Sokoin… https:… Occurr… ""    
#>  8 da285c2… 2018-0… instal… datase… countr… Moni… Corpor… https:… seguim… ""    
#>  9 8737287… 2018-0… instal… datase… countr… Moni… Corpor… https:… seguim… ""    
#> 10 ed7d4c2… 2018-0… instal… datase… countr… Samo… Minist… https:… Occurr… ""    
#> # … with 31 more rows, 16 more variables: description <chr>,
#> #   description.1 <chr>, type <chr>, creator <chr>, date <chr>, language <chr>,
#> #   coverage <chr>, coverage.1 <chr>, format <chr>, source.1 <chr>,
#> #   subject.1 <chr>, creator.1 <chr>, coverage.2 <chr>, description.2 <chr>,
#> #   creator.2 <chr>, subject.2 <chr>, and abbreviated variable names
#> #   ¹​identifier, ²​datestamp, ³​setSpec.1, ⁴​setSpec.2, ⁵​publisher, ⁶​identifier.1

GetRecords

ids <- c("87832186-00ea-44dd-a6bf-c2896c4d09b4", "d981c07d-bc43-40a2-be1f-e786e25106ac")
get_records(ids)
#> $`87832186-00ea-44dd-a6bf-c2896c4d09b4`
#> $`87832186-00ea-44dd-a6bf-c2896c4d09b4`$header
#> # A tibble: 1 × 3
#>   identifier                           datestamp            setSpec             
#>   <chr>                                <chr>                <chr>               
#> 1 87832186-00ea-44dd-a6bf-c2896c4d09b4 2018-06-29T12:08:17Z installation:729a73…
#> 
#> $`87832186-00ea-44dd-a6bf-c2896c4d09b4`$metadata
#> # A tibble: 0 × 0
#> 
#> 
#> $`d981c07d-bc43-40a2-be1f-e786e25106ac`
#> $`d981c07d-bc43-40a2-be1f-e786e25106ac`$header
#> # A tibble: 1 × 3
#>   identifier                           datestamp            setSpec             
#>   <chr>                                <chr>                <chr>               
#> 1 d981c07d-bc43-40a2-be1f-e786e25106ac 2021-09-28T13:58:57Z installation:804b8d…
#> 
#> $`d981c07d-bc43-40a2-be1f-e786e25106ac`$metadata
#> # A tibble: 1 × 12
#>   title       publi…¹ ident…² subject source descr…³ type  creator date  langu…⁴
#>   <chr>       <chr>   <chr>   <chr>   <chr>  <chr>   <chr> <chr>   <chr> <chr>  
#> 1 Peces de l… Instit… https:… Occurr… http:… Caract… Data… Fernan… 2021… es     
#> # … with 2 more variables: coverage <chr>, format <chr>, and abbreviated
#> #   variable names ¹​publisher, ²​identifier, ³​description, ⁴​language

List MetadataFormats

list_metadataformats(id = "87832186-00ea-44dd-a6bf-c2896c4d09b4")
#> $`87832186-00ea-44dd-a6bf-c2896c4d09b4`
#>   metadataPrefix                                                   schema
#> 1         oai_dc           http://www.openarchives.org/OAI/2.0/oai_dc.xsd
#> 2            eml http://rs.gbif.org/schema/eml-gbif-profile/1.0.2/eml.xsd
#>                             metadataNamespace
#> 1 http://www.openarchives.org/OAI/2.0/oai_dc/
#> 2          eml://ecoinformatics.org/eml-2.1.1

List Sets

list_sets("http://api.gbif.org/v1/oai-pmh/registry")
#> # A tibble: 621 × 2
#>    setSpec                     setName         
#>    <chr>                       <chr>           
#>  1 dataset_type                per dataset type
#>  2 dataset_type:OCCURRENCE     occurrence      
#>  3 dataset_type:CHECKLIST      checklist       
#>  4 dataset_type:METADATA       metadata        
#>  5 dataset_type:SAMPLING_EVENT sampling_event  
#>  6 country                     per country     
#>  7 country:AD                  Andorra         
#>  8 country:AM                  Armenia         
#>  9 country:AO                  Angola          
#> 10 country:AQ                  Antarctica      
#> # … with 611 more rows

Examples of other OAI providers

Biodiversity Heritage Library

Identify

id("http://www.biodiversitylibrary.org/oai")
#>                                 repositoryName
#> 1 Biodiversity Heritage Library OAI Repository
#>                                   baseURL protocolVersion
#> 1 https://www.biodiversitylibrary.org/oai             2.0
#>                    adminEmail earliestDatestamp deletedRecord granularity
#> 1 [email protected]        2006-01-01            no  YYYY-MM-DD
#>                                                        description
#> 1 oaibiodiversitylibrary.org:oai:biodiversitylibrary.org:item/1000

Get records

get_records(c("oai:biodiversitylibrary.org:item/7", "oai:biodiversitylibrary.org:item/9"),
            url = "http://www.biodiversitylibrary.org/oai")
#> $`oai:biodiversitylibrary.org:item/7`
#> $`oai:biodiversitylibrary.org:item/7`$header
#> # A tibble: 1 × 3
#>   identifier                         datestamp            setSpec
#>   <chr>                              <chr>                <chr>  
#> 1 oai:biodiversitylibrary.org:item/7 2016-01-26T06:05:19Z item   
#> 
#> $`oai:biodiversitylibrary.org:item/7`$metadata
#> # A tibble: 1 × 11
#>   title    creator subject descr…¹ publi…² contr…³ type  ident…⁴ langu…⁵ relat…⁶
#>   <chr>    <chr>   <chr>   <chr>   <chr>   <chr>   <chr> <chr>   <chr>   <chr>  
#> 1 Die Mus… Fleisc… Bogor;… pt.5:v… Leiden… Missou… text… https:… Dutch   https:…
#> # … with 1 more variable: rights <chr>, and abbreviated variable names
#> #   ¹​description, ²​publisher, ³​contributor, ⁴​identifier, ⁵​language, ⁶​relation
#> 
#> 
#> $`oai:biodiversitylibrary.org:item/9`
#> $`oai:biodiversitylibrary.org:item/9`$header
#> # A tibble: 1 × 3
#>   identifier                         datestamp            setSpec
#>   <chr>                              <chr>                <chr>  
#> 1 oai:biodiversitylibrary.org:item/9 2016-01-26T06:05:19Z item   
#> 
#> $`oai:biodiversitylibrary.org:item/9`$metadata
#> # A tibble: 1 × 11
#>   title    creator subject descr…¹ publi…² contr…³ type  ident…⁴ langu…⁵ relat…⁶
#>   <chr>    <chr>   <chr>   <chr>   <chr>   <chr>   <chr> <chr>   <chr>   <chr>  
#> 1 Die Mus… Fleisc… Bogor;… pt.5:v… Leiden… Missou… text… https:… Dutch   https:…
#> # … with 1 more variable: rights <chr>, and abbreviated variable names
#> #   ¹​description, ²​publisher, ³​contributor, ⁴​identifier, ⁵​language, ⁶​relation

Acknowledgements

Michał Bojanowski thanks National Science Centre for support through grant 2012/07/D/HS6/01971.

Meta

  • Please report any issues or bugs.
  • License: MIT
  • Get citation information for oai in R doing citation(package = 'oai')
  • Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

oai's People

Contributors

jimhester avatar karthik avatar maelle avatar mbojan avatar salim-b avatar sckott avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

oai's Issues

Help getting started with dspace repositories

Hi,

many thanks for this package; it looks just like what I'm after.

Can someone help me get started? I'm trying to use oai to interact with DSPACE repositories, but neither of the two that I tried worked. What am I missing?

> id("https://dash.harvard.edu/oai")
Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  : 
  Opening and ending tag mismatch: meta line 6 and head [76]

and

> id("https://www.repository.cam.ac.uk/oai")
Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  : 
  Opening and ending tag mismatch: meta line 6 and head [76]

Thanks, Stephen

Speed up some tests

Looked at times of tests, seems there's two we need to speed up. I'll work on list_identifiers. @mbojan - can you make the dumpers tests faster?

                           .id user.self sys.self elapsed user.child sys.child
10            test-providers.R     0.028    0.001   0.042      0.000     0.000
8          test-list_records.R     0.088    0.002   0.606      0.000     0.000
3           test-get_records.R     0.092    0.003   0.787      0.000     0.000
7  test-list_metadataformats.R     0.100    0.004   0.992      0.000     0.000
5                    test-id.R     0.090    0.004   2.443      0.000     0.000
4         test-handle_errors.R     0.079    0.010   3.266      0.000     0.000
1     test-count_identifiers.R     0.265    0.020   6.917      7.717     0.097
9             test-list_sets.R     1.012    0.018   7.045      0.000     0.000
6      test-list_identifiers.R    19.338    0.181 111.121      0.000     0.000
2               test-dumpers.R    19.518    0.680 259.208      0.000     0.000

Flow control is ignored -- causing unrecoverable errors

Try:

out <- oai::list_records(url="http://export.arxiv.org/oai2",
  prefix="arXiv",
  from='2022-04-01',
  )

This results in:

Service Unavailable (HTTP 503)

No results are returned, even though partial results were collected. So there is no graceful way to resume.

Their are three issues here that seem to make the overall interface non-robust

  1. there is no mechanism for setting a delay between subsequent requests ... so
  2. the arxiv server eventually issues a 503 flow control directive -- which the client treats as a permanent failure -- rather than
    an instruction to delay for the specified value and retry. Causing the client to stop() which ...
  3. aborts the call, rather returning partial results ... so no resumption is possible.

A possible workaround could be to write an external wrapper that divides up the "from" - "to" interval into small chunks, and uses purr:: wrappers to schedule each chunk and retry... This is inelegant. A cleaner solution might be to handle the OAI flow control explicitly internally in while_oai(), and to at least return partial values and a resumption token on error.

Problem with testing count_identifiers

Running count_identifiers with more than one URL triggers a warning because is.url gets a vector and returns a vector, which is later used in if(), and if requires a scalar.

This also happens in one of the tests.

Probably would be enough to use all() somewhere.

Handling http errors

Extracted from #17.

Internet-related errors, like timeouts etc. They are thrown by httr::GET, which is used in several places, but most importantly in while_oai, which is the core of list_* functions.

  • Things like timeouts could be addressed by proper pre-configuring httr::GET (passing additional arguments through httr::config
  • Perhaps it should be possible to add an optional sleep time between requests in while_oai.

Certain "subjetcs" are missing in list_records

We are using zenodo.org as a oai compatibe repository.
Zenodo encodes certain "subjects" in this format in the xml:

<subject subjectScheme="url">http://id.agrisemantics.org/gacs/C8154</subject>

This type of subjetcs do not appear in the data.frame, if I call list_records, like this:

get_records(c("oai:zenodo.org:159890"),url="https://zenodo.org/oai2d",prefix="oai_datacite")

I need to choose the "oai_datacite" prefix, if not the server does not return the records at all.

So by using prefix="oai_datacite" I can see that they get returned by using the "raw" option, like this:

get_records(c("oai:zenodo.org:159890"),url="https://zenodo.org/oai2d",prefix="oai_datacite",as="raw")

But the the later parsing omits them somehow. I looked at the code and debuged it, but could not find where excactly they get lost.

change maintainer on cran

@mbojan when you get a chance, could you submit a new version to cran changing the maintainer to you? not urgent, but any correspondence from cran will go to me right now, so it'd be good to have it go to you instead. thanks!

Another OAI repo problem?

Travis trips with

test_check("oai")
── 1. Error: list_identifiers - set (@test-list_identifiers.R#31)  ─────────────
OAI-PMH errors: noRecordsMatch: The combination of the values of the from, until, set, and metadataPrefix arguments results in an empty list.
1: list_identifiers(from = "2011-06-01T", until = "2012-11-01T", set = "CDL.OSU") at testthat/test-list_identifiers.R:31
2: while_oai(url, args, token, as, ...)
3: handle_errors(parsed)

Calling `list_records` with known resumption `token` makes a bad URL

See the request made by

list_records(token="foo", config=httr::verbose())

The request should contain the parameter resumptionToken (not token) and no other parameters (without metadataPrefix). So it should be:

r <- httr::GET("http://oai.datacite.org/oai?verb=ListRecords&resumptionToken=foo",
     config=httr::verbose())
httr::content(r, "text")   

which properly includes badResumptionToken error code for resumptionToken=foo

Perhaps also list_records when given a token should compose the URL without any other arguments whatever user supplied or give errors immediately, for example:

list_records(from="2015-09-01T", until="2015-09-01T", 
    token="foo", config=httr::verbose())

should skip not only the prefix (which puts some default), but also from and until and so on

Handling XML errors

Extracted from #17

Malformed XML errors - it may happen that the downloaded XML is malformed and cannot be parsed by xml2::read_xml. For example, contains UTF-8 characters that are illegal for XML, that are not escaped etc.

If the XML is malformed (3 in #17) it could be written somewhere without parsing. That would require some fallback mechanism of getting the resumptionToken without parsing the whole XML with read_xml. One option is just to write a regular expression. Perhaps another option is related to r-lib/xml2#10 if it gets implemented.

function 'check_as' is missing

Hi Scott,
the last commit added a call for check_as

oai/R/id.R

Line 24 in 39102ff

check_as(as)
but the function is missing.

Maybe something like this?

# check as ----------------------------------------
check_as <- function(as) {
  if (!as %in% c('parsed', 'raw')) {
    stop(sprintf("'%s' not in acceptable set: parsed, raw", as) ,
         call. = FALSE)
  }
}

👋 Patrick

appveyor builds broken

been so for a while now. b/c won't install Rtools if no /src dir, but seems to work fine for some pkgs with no /src dir, weird.

Limiting results

Noticed something while making fix in #18 - that we probably need to give user power to determine how much data they get back.

Our while loop will just keep going, getting more data if a resumptionToken is available.

We may not be able to expose a parameter that does this exactly, like limit = 10, and you get 10 results, but at least it could be something like

  • limit = "all" (all results)
  • limit = "one" (do one HTTP request, then stop, so even if get a resumptionToken, stop anyway)
  • other options?

Additional attribution information

Hi @sckott , I have the following question: would you have anything against including the information about the package authors in the oai-package.R file? That would allow me to insert there the grant citation that was supporting my contributions to this package. I imagine a section "Authors" with a bullet for each author and mine as something like "Michal Bojanowski, supported by NCN grant 2012/07/D/HS6/01971". This could be also in README. Or perhaps you'll have a different suggestion.

Bit more background: I used the oai package to collect data for research purposes for my project http://recon.icm.edu.pl . Now I am in for writing the final report and listing various contributions made. I wanted to include my work on extending oai (dumpers etc.) for data collection purposes. This will have no copyright/licensing consequences whatsoever.

What do you think?

Recovering from errors

This is a rather general/thick issue that will have to be split.

At this moment harvesting functions do not have any recovery protocols. This is rather painful with larger requests to OAI-PMH because there is no good way of splitting a request into chunks apart from resumptionToken.

In general there are the following types of errors:

  1. OAI-PMH errors - currently handled by handle_errors
  2. Internet-related errors, like timeouts etc. They are thrown by httr::GET, which is used in several places, but most importantly in while_oai, which is the core of list_* functions.
  3. Malformed XML errors - it may happen that the downloaded XML is malformed and cannot be parsed by xml2::read_xml. For example, contains UTF-8 characters that are illegal for XML, that are not escaped etc.

It would be useful to come up with a way to recover from such errors. For example:

  • If the XML is malformed (3) it could be written somewhere without parsing. That would require some fallback mechanism of getting the resumptionToken without parsing the whole XML with read_xml. One option is just to write a regular expression. Perhaps another option is related to r-lib/xml2#10 if it gets implemented.
  • Timeouts could be addressed by proper pre-configuring GET, but also by adding an optional sleep time between requests in while_oai.
  • resumptionToken usually comes with a expirationDate so in order no to overload the server, harvesting could wait some time, not longer that the expirationDate, until issuing next request with the token.
  • ...

read local xml files

It would be nice if you could read the local xml files that were downloaded by using dump_raw_to_txt or another harvester. I imagine a function list_local_records(startfile = "/path/to/file", prefix = "oai_dc"), which reads all the local dump files using the resumptionToken in the startfile. Similar to list_records but for local files.

xpath or children

Is there a reason why while_oai and other functions use nested xml_children calls to extract various elements from the result (like resumptionToken, errors, etc.)? I think the OAI-PMH specification does not specify the order of the tags so different services might return these in different order. Using XPath seems simpler.

One trouble with XPath is handling namespaces. This might be dodged with local-name() like here http://stackoverflow.com/questions/16717211/getting-elements-with-default-namespace-no-namespace-prefix-using-xpath

Allowing for as="xml" and as="xml_list"

Currently results returned by list_* functions can be "raw" XML, XML past to a "list" or to "df". It could rather easily return additionally:

  • as="xml" - parsed XML as a xml_document object (in while_oai use xml_orig)
  • as="xml_list" - a verb-related subtree of XML, e.g. for list_records the tag <ListRecords> (in while_oai use xml object).

XML is already parsed so we get those for free. It is useful to use them in dumper functions - no need to parse raw XML again.

list_identifiers throws error when no results found

from @mbojan

the test for list_identifiers throws an error when there are no results. This just happened for me when running R CMD check - there where no data yet for today. For now I modified the test to look up updates from yesterday.

Newbie questions / ResumptionTokens

I'm sorry to ask these questions, since they might have obvious answers, but I couldn't find anyone's code using this library (which is how I normally learn how to use a package). So here it goes:

I'm trying to parse this the Dublin Core from this OAI-PMH Request URL I would like to have it as tibble (which includes the results from of the resumptionTokens) is there a way to do this with this client?

Handling big requests

Thanks for this pkg!

I need to harvest and process rather large requests and merging all results of a single request into one R object is a nono (not enough RAM). For that purpose a forked (https://github.com/mbojan/oai) and modified while_oai to call some dumper function that might e.g. write results to a DB page-by-page (resumptionToken).

Perhaps you are interested in such an enhancement? Suggestions welcome. I'm still working on it.

No 'as = "raw"' for id

Many of the other methods allow callers to specify a format, but id() doesn't. This would be very useful to access the extended XML defined by many software packages.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.