bloomonkey / oai-harvest Goto Github PK

View Code? Open in Web Editor NEW

62.0 6.0 40.0 156 KB

Python package for harvesting records from OAI-PMH provider(s).

License: Other

Python 100.00%

python oai-pmh harvest-data harvesting harvester harvest

oai-harvest's Introduction

OAI-PMH Harvest

Description
Latest Version
Documentation
Requirements / Dependencies
Installation
Bugs, Feature requests etc.
Copyright And Licensing
Examples

Description

A harvester to collect records from an OAI-PMH enabled provider.

The harvester can be used to carry out one-time harvesting of all records from a particular OAI-PMH provider by giving its base URL. It can also be used for selective harvesting, e.g. to harvest only records updated after, or before specified dates.

To assist in regular harvesting from one or more OAI-PMH providers, there's a provider registry. It is possible to associate a short memorable name for a provider with its base URLs, destination directory for harvested records, and the format (metadataPrefix) in which records should be harvested. The registry will also record the date and time of the most recent harvest, and automatically add this to subsequent requests in order to avoid repeatedly harvesting unmodified records.

This could be used in conjunction with a scheduler (e.g. CRON) to maintain a reasonably up-to-date copy of the record in one or more providers. Examples of how to accomplish these tasks are available below.

Latest Version

The latest stable release version is available in the Python Packages Index:

https://pypi.python.org/pypi/oaiharvest

Source code is under version control and available from:

http://github.com/bloomonkey/oai-harvest

Documentation

All executable commands are self documenting, i.e. you can get help on how to use them with the -h or --help option.

At this time the only additional documentation that exists can be found in this README file!

Requirements / Dependencies

Python >= 2.7 or Python 3.x
pyoai
lxml
sqlite3

Installation

Users

pip install oaiharvest

Developers

I recommend that you use virtualenv to isolate your development environment from system Python and any packages that may be installed there.

In GitHub, fork the repository

Clone your fork:

git clone [email protected]:<username>/oai-harvest.git

Setup development virtualenv using tox:
```
pip install tox
tox -e dev
```

Activate development virtualenv:

-nix:

source env/bin/activate

Windows:

env\Scripts\activate

Bugs, Feature requests etc.

Bug reports and feature requests can be submitted to the GitHub issue tracker: http://github.com/bloomonkey/oai-harvest/issues

If you'd like to contribute code, patches etc. please email the author, or submit a pull request on GitHub.

Copyright And Licensing

This project is licensed under the terms of the 3-Clause BSD License.

Examples

Harvesting records from an OAI-PMH provider URL

All records

oai-harvest http://example.com/oai

Records modified since a certain date

oai-harvest --from 2013-01-01 http://example.com/oai

Records from a named set

oai-harvest --set "some:set" http://example.com/oai

Limit the number of records to harvest

oai-harvest --limit 50 http://example.com/oai

Get help on all available options

oai-harvest --help

OAI-PMH Provider Registry

Add a provider

oai-reg add provider1 http://example.com/oai/1

If you don't supply --metadataPrefix and --directory options, you will be interactively prompted to supply alternatives, or accept the defaults.

Remove an existing provider

oai-reg rm provider1 [provider2]

List existing providers

oai-reg list

Harvesting from OAI-PMH providers in the registry

Harvest from one or more providers in the registry using the short names that they were registered with:

oai-harvest provider1 [provider2]

By default, this will harvest all records modified since the last harvest from each provider. You can over-ride this behavior using the --from and --until options.

Harvest from all providers in the registry:

oai-harvest all

Scheduling Regular Harvesting

In order to maintain a reasonably up-to-date copy of all the the records held by those providers, one could configure a scheduler to periodically harvest from all registered providers. e.g. to tell CRON to harvest all at 2am every day, one might add the following to crontab:

0 2 * * * oai-harvest all

oai-harvest's People

Contributors

Stargazers

Watchers

Forkers

atomotic mhoffman caseyrollins epetakov mamico gudaomao yonyonson ritwikgupta bwunsch janicak tomdierkes adrianp ulikoehler kwikwag kristel- nicolesque ppmdo victorchi kbernt manifoldai uudigitalhumanitieslab sdm7g chengh3 el2727 releach petulica jc7k juanksega gustavofonseca danmichaelo strunge29 marc-portier gmkll dee27pika arcotraiano senderyt fernando-andutta ub-mannheim clesauln emiliodevag

oai-harvest's Issues

decoding error on Arabic text

Trying to harvest records from the American university in Beirut results in decimal encodings written to file instead of utf Arabic characters.

To reproduce the error try: oai-harvest --set "aladab" https://libraries.aub.edu.lb/xtf/oai

Attribute error while harvesting

When running the command

oai-harvest -p arXiv http://export.arxiv.org/oai2

I get the error

INFO     Harvesting from http://export.arxiv.org/oai2
ERROR    'Record' object has no attribute 'identifier'
Traceback (most recent call last):
  File "/home/epatters/miniconda3/lib/python3.7/site-packages/oaiharvest/harvest.py", line 181, in main
    **kwargs
  File "/home/epatters/miniconda3/lib/python3.7/site-packages/oaiharvest/harvesters/directory_harvester.py", line 65, in harvest
    record.identifier, metadataPrefix
AttributeError: 'Record' object has no attribute 'identifier'

after downloading about ~100K records. I've tried several times, each with the same result. I am running Python 3.7 on a Linux system.

Limit harvesting to an arbitrary number

Limit harvesting to an arbitrary number of records (e.g. to avoid throttling / blacklisting by provider server)

harvest exits on parse errors

If the harvester hits a parse error on the metadata payload, it quits with an exception, and without a simple way to restart after the bad record, there is no way to continue downloading the rest of the resource from that site. This is typically not an issue with oai_dc metadata, but I'm seeing it frequently with oai_ead feeds produced by ArchivesSpace.
I have found that back patching the parse method from oaipmh.client to create an XMLParser with recover=True option enables a more forgiving parser that doesn't crash. ( Typical sort of errors I've seen are unescaped ampersands and unpaired quotes around attribute values. )
I'm still exploring other modification: rather than silently working about the errors, I would probably like to log them, and perhaps also save the raw data so I can be run thru a validator to report problems upstream.
Just wondering if you have any thoughts about the best way to deal with this issue, both for my needs and what sort of solution you might consider accepting upstream.

Incremental harvesting?

Dear @bloomonkey, I'd like to harvest a big set incrementally. My impression is that oai-harvest does not support this scenario. If I run oai-harvest -s SET -l 3 provider twice, it will simply download the same records twice.

Is there a simple way in which I could modify oai-harvest in order to support this? I'd be happy to submit a pull request. Alternatively, are you aware of a tool that already supports this functionality? Thanks in advance.

Registry directory ignore on `harvest all`

First reported by @sdm7g in #27

Also note that when using a directory setting in the registry, and doing a harvest "all" , all of the files from all providers go into the directory from the first provider chosen. Not sure that anything can be done about that other than documenting that is the case, or perhaps it would be less confusing just to take directory out of the registry. Since the directory setting is per invocation rather than per provider, it would make more sense to restrict it to the command line and not have registry values that are ignored.

Can't load record prior to 2007 from arxiv

Hello,

thanks for the great tool!
I'm having trouble to download metadata from arxiv with the following command:
oai-harvest --from "1994-01-01" --limit 50 http://export.arxiv.org/oai2
or without the --from option.
oai-harvest --limit 50 http://export.arxiv.org/oai2

The problem is that the earliest items I get are from 2007 but I can find items back to 1994
on the website. Any idea what could cause this issue?

best regards,
Immanuel

less verbose output

Hi, oai-harvest is great but has a very verbose output (see below).

How to get a silent output?

Thanks,

--Martin

DEBUG    Writing to file /home/martin/martin-no-backup/arxiv-classifier/data/oai:arXiv.org:1003.0006.arXiv.xml
DEBUG    Writing to file /home/martin/martin-no-backup/arxiv-classifier/data/oai:arXiv.org:1212.5387.arXiv.xml
DEBUG    Writing to file /home/martin/martin-no-backup/arxiv-classifier/data/oai:arXiv.org:1405.6957.arXiv.xml
DEBUG    Writing to file /home/martin/martin-no-backup/arxiv-classifier/data/oai:arXiv.org:1407.7261.arXiv.xml
DEBUG    Writing to file /home/martin/martin-no-backup/arxiv-classifier/data/oai:arXiv.org:1501.00129.arXiv.xml
DEBUG    Writing to file /home/martin/martin-no-backup/arxiv-classifier/data/oai:arXiv.org:1506.08655.arXiv.xml
DEBUG    Writing to file /home/martin/martin-no-backup/arxiv-classifier/data/oai:arXiv.org:1508.06785.arXiv.xml
DEBUG    Writing to file /home/martin/martin-no-backup/arxiv-classifier/data/oai:arXiv.org:1511.01375.arXiv.xml
DEBUG    Writing to file /home/martin/martin-no-backup/arxiv-classifier/data/oai:arXiv.org:1512.01056.arXiv.xml
DEBUG    Writing to file /home/martin/martin-no-backup/arxiv-classifier/data/oai:arXiv.org:1601.01001.arXiv.xml
DEBUG    Writing to file /home/martin/martin-no-backup/arxiv-classifier/data/oai:arXiv.org:1601.07738.arXiv.xml
DEBUG    Writing to file /home/martin/martin-no-backup/arxiv-classifier/data/oai:arXiv.org:1607.02928.arXiv.xml

ORE. Any plans to pull files?

Is there any way to pull files along with ORE metadata? If not, is there any interest / plan in developing such feature?

metadata output not exactly in utf8 encoding...

metadata output seems to be in ascii with other unicode characters encoded as numerical character entities. Legal for default utf8 encoding, as ascii is a subset, but this is not what I, and I think most people want or expect.
( This may be the same issue reported as #32 . This was also reported to me by Columbia.edu and I was able to reproduce it on both my and their OAI feeds. )

I initially tried adding encoding="UTF-8" to etree.tostring call in metadata.py but this worked under python3.x, but failed under python2.x .

adding encoding="unicode" appears to be the correct fix that seems to work under both python2.x and python3.x .

Under python2.x , encoding="UTF-8" returns a <type "str"> that contains unicode characters, which then may give an error when coercing to <type "unicode"> . encoding="unicode" returns <type "unicode"> .

See: https://github.com/sdm7g/oai-harvest/blob/fix-pyoai/oaiharvest/metadata.py#L51-L53

Harvester Timed Out

Harvester timed out at about 677,989 out of 1,500,000 items while trying to harvest all of Arxiv.org, is there a way to pick the harvest back up where it timed out? Instead of starting at the beginning?

BadVerbError: Value of the verb argument ...

I have exactly the same failure and error like BadVerbError: #19

` File "/Users/kilian/Library/Python/3.11/lib/python/site-packages/oaipmh/common.py", line 121, in call
return bound_self.handleVerb(self._verb, kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/Users/kilian/Library/Python/3.11/lib/python/site-packages/oaipmh/client.py", line 74, in handleVerb
kw, self.makeRequestErrorHandling(verb=verb, **kw))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/Users/kilian/Library/Python/3.11/lib/python/site-packages/oaipmh/client.py", line 308, in makeRequestErrorHandling
raise getattr(error, code[0].upper() + code[1:] + 'Error')(msg)

oaipmh.error.BadVerbError: Value of the verb argument is not a legal OAI-PMH verb, the verb argument is missing, or the verb argument is repeated.`

Seven years after #19, I am on a mac, oai-harvest is installed in Python 3.11 envorinment.

Harvesting submissions from arxiv from month of march 2019 returns earlier dates

Using the command oai-harvest -f 2019-03-01 http://export.arxiv.org/oai2, harvested data for the month of march, when I took a look at the data the last revised submission history for the data was earlier than than 2019-03-01 for a number of the documents. Can a record or document be modified without it making a new submission history?

So if were to use
-f YYYY-MM-DD, --from YYYY-MM-DD
harvest only records added/modified after this date.

You could get records "modified" in some way after your input date but with a last revised submission date earlier than your input date?

I can clarify further if this isn't making much sense, but I am hoping to harvest records by submission/modification date, month by month for all of arxiv without duplicates. And command used gave me documents last submitted outside of the month of march.

thanks hope to hear back,
Brody

Continue harvest from last good resumptionToken

Is there a way to continue with the harvest from a failure, with the last good resumption token?

When you list registry you can see the URL for the next harvest. Can we override this? Or start the harvest with something like:

oai-harvest my_repository --resumptionToken MYLASTGOODRT

I know that must of the resumption tokens expires quick, but in my case are persistent. The parameters in the RT are easily hackable, like mp_marcxml.set_art.c_000812323

Python 3 Compatibility

Hello!

I would like to use oai-harvest in a Python 3 environment. Is there any interest from somebody else in porting it? Has this been considered before?

Many thanks.

cannot harvest https://doaj.org/oai.article

Hi,

Am I doing something wrong?

~/.local/bin/oai-harvest 'https://www.doaj.org/oai.article'
INFO Harvesting from https://www.doaj.org/oai.article
ERROR Value of the verb argument is not a legal OAI-PMH verb, the verb argument is missing, or the verb argument is repeated.
Traceback (most recent call last):
File "/www/liinwww/.local/lib/python2.7/site-packages/oaiharvest/harvest.py", line 305, in main
**kwargs
File "/www/liinwww/.local/lib/python2.7/site-packages/oaiharvest/harvest.py", line 138, in harvest
**kwargs):
File "/www/liinwww/.local/lib/python2.7/site-packages/oaiharvest/harvest.py", line 83, in _listRecords
client.identify()
File "/www/liinwww/.local/lib/python2.7/site-packages/oaipmh/common.py", line 126, in method
return obj(self, **kw)
File "/www/liinwww/.local/lib/python2.7/site-packages/oaipmh/common.py", line 121, in call
return bound_self.handleVerb(self._verb, kw)
File "/www/liinwww/.local/lib/python2.7/site-packages/oaipmh/client.py", line 74, in handleVerb
kw, self.makeRequestErrorHandling(verb=verb, **kw))
File "/www/liinwww/.local/lib/python2.7/site-packages/oaipmh/client.py", line 308, in makeRequestErrorHandling
raise getattr(error, code[0].upper() + code[1:] + 'Error')(msg)
BadVerbError: Value of the verb argument is not a legal OAI-PMH verb, the verb argument is missing, or the verb argument is repeated.

Polling the repository manually using e.g.
https://doaj.org/oai.article?verb=ListRecords&metadataPrefix=oai_dc
works perfectly in browser...

The installation is on Ubuntu 16.04 with 'pip install oaiharvest'.

I would appreciate any hints you can provide me with,

Paul

Selective harvesting by set

'from oaiharvest import harvest' Error

from oaiharvest import harvest does not work.
from oaiharvest.harvest import * or import oaiharvest.harvest does work.

Python 3.6.5 (default, Apr 25 2018, 14:26:36) 
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from oaiharvest import harvest
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'OAI-PMH Harvester'
>>> from oaiharvest.harvest import *
>>> import oaiharvest.harvest

If I change the module __name__ to be the same as __package__ : "oaiharvest"
then it works.

Alternate workaround is import oaiharvest.harvest as harvest

high CPU usage

Hi, I am currently running this module to crawl 3.67 M articles from doaj.org, which, obviously, takes a while, especially since the the site seems to really limit the bandwidth for performance reasons (it ran over the weekend and is now, after ~65h, only about 40% done)

However I notice there is a constant CPU usage from the crawler-task at about 30%. Any idea why? If anything I'd expect Network and Disk to be affected, but they're not, in fact they do almost nothing..

Not compiling

... because sys_platform not defined, potentially due to using an outdated marker.py that seems to be downloaded when using the suggested install command.

Collecting oaiharvest from git+http://github.com/bloomonkey/oai-harvest.git#egg=oaiharvest
Cloning http://github.com/bloomonkey/oai-harvest.git to c:\users\appdata\local\temp\pip-build-v3glyz\oaiharvest
Complete output from command python setup.py egg_info:
Downloading http://pypi.python.org/packages/source/d/distribute/distribute-0.6.34.tar.gz
Extracting in c:\users\appdata\local\temp\tmpa3onvj
Now working in c:\users\appdata\local\temp\tmpa3onvj\distribute-0.6.34
Building a Distribute egg in c:\users\appdata\local\temp\pip-build-v3glyz\oaiharvest
Traceback (most recent call last):
File "setup.py", line 248, in
scripts = scripts,
File "c:\python27\lib\distutils\core.py", line 111, in setup
_setup_distribution = dist = klass(attrs)
File "c:\users\appdata\local\temp\tmpa3onvj\distribute-0.6.34\setuptools\dist.py", line 225, in init
_Distribution.init(self,attrs)
File "c:\python27\lib\distutils\dist.py", line 287, in init
self.finalize_options()
File "c:\users\appdata\local\temp\tmpa3onvj\distribute-0.6.34\setuptools\dist.py", line 257, in finalize_options
ep.require(installer=self.fetch_build_egg)
File "c:\users\appdata\local\temp\tmpa3onvj\distribute-0.6.34\pkg_resources.py", line 2025, in require
working_set.resolve(self.dist.requires(self.extras),env,installer))
File "c:\users\appdata\local\temp\tmpa3onvj\distribute-0.6.34\pkg_resources.py", line 2235, in requires
dm = self._dep_map
File "c:\users\appdata\local\temp\tmpa3onvj\distribute-0.6.34\pkg_resources.py", line 2464, in _dep_map
self.__dep_map = self._compute_dependencies()
File "c:\users\appdata\local\temp\tmpa3onvj\distribute-0.6.34\pkg_resources.py", line 2497, in _compute_dependencies
common = frozenset(reqs_for_extra(None))
File "c:\users\appdata\local\temp\tmpa3onvj\distribute-0.6.34\pkg_resources.py", line 2494, in reqs_for_extra
if req.marker_fn(override={'extra':extra}):
File "c:\users\appdata\local\temp\tmpa3onvj\distribute-0.6.34_markerlib\markers.py", line 109, in marker_fn
return eval(compiled_marker, environment)
File "", line 1, in
NameError: name 'sys_platform' is not defined
c:\users\appdata\local\temp\pip-build-v3glyz\oaiharvest\distribute-0.6.34-py2.7.egg
Traceback (most recent call last):
File "", line 1, in
File "c:\users\appdata\local\temp\pip-build-v3glyz\oaiharvest\setup.py", line 9, in
distribute_setup.use_setuptools()
File "distribute_setup.py", line 152, in use_setuptools
return _do_download(version, download_base, to_dir, download_delay)
File "distribute_setup.py", line 132, in _do_download
_build_egg(egg, tarball, to_dir)
File "distribute_setup.py", line 123, in _build_egg
raise IOError('Could not build the egg.')
IOError: Could not build the egg.

Command "python setup.py egg_info" failed with error code 1 in c:\users\appdata\local\temp\pip-build-v3glyz\oaiharvest\

ensure directory exists log message ( and directory setting in registry on harvest all )

Log message from
https://github.com/bloomonkey/oai-harvest/blob/develop/oaiharvest/harvest.py#L211
logger.debug("Creating target directory {0}".format(self._dir))
would make more sense as:
logger.debug("Creating target directory {0}".format(os.path.dirname))
It is logging the top level directory and not the directory it is actually creating.

Also note that when using a directory setting in the registry, and doing a harvest "all" , all of the files from all providers go into the directory from the first provider chosen. Not sure that anything can be done about that other than documenting that is the case, or perhaps it would be less confusing just to take directory out of the registry. Since the directory setting is per invocation rather than per provider, it would make more sense to restrict it to the command line and not have registry values that are ignored.

OAI-PMH Dspace

Hello, with this package can I download files in a DSPACE repository?

Specify sets in provider registry

Hello and thank you for your very helpful program! I would like to ask if it would be possible to store not only the providers in the provider registry database, but also individual sets?

I would be interested in only harvesting specific sets from different providers on a regular basis and store the metadata files for each set into a separate folder, e.g.

provider 1, set 1a, folder_1, date_lastHarvest_1a
provider 1, set 1b, folder_2, date_lastHarvest_1b
provider 2, set 2a, folder_3, date_lastHarvest_2a
provider 3, set 3a, folder_4, date_lastHarvest_3a
provider 3, set 3b, folder_5, date_lastHarvest_3b

If you could tell me if this is already possible (and I just overlooked it) or if you see the chance that this feature could be implemented, you would help me a lot.

(Regarding the provider setup in the database, I noticed one more small thing in the documentation: It is described that there is an optional --directory argument when adding a provider in the database. This option only works for me if I rename the argument --directory into --dir.)

Thank you very much in advance and kind regards.

logfile conflict - no harvest.log

There seems to be a conflict between logging config at https://github.com/bloomonkey/oai-harvest/blob/develop/oaiharvest/harvest.py#L500-L505 and https://github.com/bloomonkey/oai-harvest/blob/develop/oaiharvest/registry.py#L340-L346 .

Harvest logs are always written to registry.log. There is no harvest.log.
Not much of an issue, but caused a little confusion when looking for logs.

( I am going to run harvest from a cron job, so I'm going to want to redirect both the logs and the registry into another directory anyway. )

Encoding issue

Good morning John,

I am using your project to fetch OAI-PMH data and I encounter this problem. It manages to pull about 200k entries and goes down on a single one systematically with this error:

ERROR 'ascii' codec can't encode character u'\xfc' in position 10: ordinal not in range(128)

I don't see that there are any options for me to deal with encoding issues. Otherwise filtering it out would be counter productive but I would happily just have the script skip these delinquent entries if they can't be transliterated.

Do you have any ideas?

Thank you for your time and your project,
Phil