Code Monkey home page Code Monkey logo

Comments (5)

miku avatar miku commented on May 29, 2024 1

@zazi, thanks for the bug report. Could reproduce. The DNB endpoint is in general relatively broken. I believe I saw this error before:

Your request matches to many records (>100000). The result size is 353017. Please try to restrict the request-period.

$ curl -vL "https://services.dnb.de/oai/repository?from=2008-04-05T00:00:00Z&metadataPrefix=MARC21-xml&set=authorities:person&until=2008-04-05T23:59:59Z&verb=ListRecords"
<html><head><title>Error</title></head><body>Your request matches to many records (&amp;gt;100000). The result size is 353017. Please try to restrict the request-period.</body></html>

It really odd, because even on a daily slice (using the -daily flag) it is too much. If, in theory, all records would have a single timestamp, there would be no way at all to retrieve the records in a windowed fashion - which in turn means that it is not fully OAI compliant.

Next thing I would try would be:

$ oaicrawl -verbose -f MARC21-xml https://services.dnb.de/oai/repository

We wrote oaicrawl for zvdd.de OAI, because it's calling itself OAI, despite being broken. The oaicrawl is a much blunter tool, it will fetch all identifiers (ListIdentifiers) and request records one-by-one (GetRecord). Let's see what happens with DNB:

$ oaicrawl -verbose -f MARC21-xml https://services.dnb.de/oai/repository
FATA[2018-07-30T14:15:52+02:00] expected element type <OAI-PMH> but have <html> 

Digging into it a bit more:

<title>Error</title>Your request matches to many records (&gt;100000). The result size is 13413063. Please try to restrict the request-period.

Now, let me rant on a bit. Why does OAI has so-called "resumption-tokens" at all? Datacite, base (Bielefeld) and other huge repositories can work just fine by paging through the data (tens of millions of records) for days. It's a DNB problem, it would be best, if they use their own resources to solve this problem.

from metha.

miku avatar miku commented on May 29, 2024 1

I cannot define the concrete set over there, or?

Yes, oaicrawl was more of a one-shot for a particular endpoint and has a minimal feature set.

Thanks a lot for your feedback, I'll forward it to DNB somehow.

I can try to do the same.

Does this sound like a solution for you @miku ?

Yes, sure this is an option. This is also a limitation of metha, which I would like to get rid of one day (it was not essential for the use cases so far, so it is not implemented): It has only monthly and daily slices, not arbitrary precision.

from metha.

zazi avatar zazi commented on May 29, 2024

thanks a lot @miku for your very fast reply. I was also on trying oaicrawl for this, but then I thought that it might be a bit to much fetching this rather larger authorities set 1-by-1 from DNB - so I skipped this approach. Furthermore, as far as I understood the arguments from oaicrawl - I cannot define the concrete set over there, or?
Thanks a lot for your feedback, I'll forward it to DNB somehow.
For our concrete usecase it probably might even be enough to get the data excerpt from "Sächsische Bibliographie" via SRU. Then I "only" need to be able to define the appropriate CQL query (which is a bit out of my knowledge so far).

from metha.

zazi avatar zazi commented on May 29, 2024

while writing the draft for an answer to DNB and reading their OAI docs again, I came to a possible solution:
since the request return a 413, which is a standard HTTP status code from RFC 7231 - one can make use of this information and reduce the standard interval from daily to e.g. hourly for such cases (which requires to set both parameters, from and until, in the request).

Does this sound like a solution for you @miku ?

PS: the DNB OAI docu also says "Depending on the OAI repository these can be either defined to the day (YYYY-MM-DD) or to the second (YYYY-MM-DDThh:mm:ssZ)" - so working with hourly slice might be possible.

curl -vL "https://services.dnb.de/oai/repository?from=2008-04-05T13:00:00Z&until=2008-04-05T14:00:00Z&metadataPrefix=MARC21-xml&set=authorities:person&verb=ListRecords"

delivers at least some results (incl. a resumption token)

from metha.

zazi avatar zazi commented on May 29, 2024

Ok, we've send a request to DNB, whether they can increase the result size limit. On the other side, we would appreciate, when you could implement the proposed fall-back functionality, when a 413 will be thrown, i.e., decrease the interval temporarily to hourly (and the go back to daily).

from metha.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.