Code Monkey home page Code Monkey logo

ckanext-harvest's Introduction

ckanext-harvest - Remote harvesting extension

image

This extension provides a common harvesting framework for ckan extensions and adds a CLI and a WUI to CKAN to manage harvesting sources and jobs.

Installation

This extension requires CKAN v2.0 or later on both the CKAN it is installed into and the CKANs it harvests. However you are unlikely to encounter a CKAN running a version lower than 2.0.

  1. The harvest extension can use two different backends. You can choose whichever you prefer depending on your needs, but Redis has been found to be more stable and reliable so it is the recommended one:
    • Redis (recommended): To install it, run:

      sudo apt-get update
      sudo apt-get install redis-server

      On your CKAN configuration file, add in the [app:main] section:

      ckan.harvest.mq.type = redis
    • RabbitMQ: To install it, run:

      sudo apt-get update
      sudo apt-get install rabbitmq-server

      On your CKAN configuration file, add in the [app:main] section:

      ckan.harvest.mq.type = amqp
  2. Activate your CKAN virtual environment, for example:

    $ . /usr/lib/ckan/default/bin/activate
  3. Install the ckanext-harvest Python package into your virtual environment:

    (pyenv) $ pip install -e git+https://github.com/ckan/ckanext-harvest.git#egg=ckanext-harvest
  4. Install the python modules required by the extension (adjusting the path according to where ckanext-harvest was installed in the previous step):

    (pyenv) $ cd /usr/lib/ckan/default/src/ckanext-harvest/
    (pyenv) $ pip install -r requirements.txt
  5. Make sure the CKAN configuration ini file contains the harvest main plugin, as well as the harvester for CKAN instances if you need it (included with the extension):

    ckan.plugins = harvest ckan_harvester
  6. If you haven't done it yet on the previous step, define the backend that you are using with the ckan.harvest.mq.type option in the [app:main] section (it defaults to amqp):

    ckan.harvest.mq.type = redis

There are a number of configuration options available for the backends. These don't need to be modified at all if you are using the default Redis or RabbitMQ install (step 1). However you may wish to add them with custom options to the into the CKAN config file the [app:main] section. The list below shows the available options and their default values:

  • Redis:
    • ckan.harvest.mq.hostname (localhost)
    • ckan.harvest.mq.port (6379)
    • ckan.harvest.mq.redis_db (0)
    • ckan.harvest.mq.password (None)
  • RabbitMQ:
    • ckan.harvest.mq.user_id (guest)
    • ckan.harvest.mq.password (guest)
    • ckan.harvest.mq.hostname (localhost)
    • ckan.harvest.mq.port (5672)
    • ckan.harvest.mq.virtual_host (/)

Note: it is safe to use the same backend server (either Redis or RabbitMQ) for different CKAN instances, as long as they have different site ids. The ckan.site_id config option (or default) will be used to namespace the relevant things:

  • On RabbitMQ it will be used to name the queues used, eg ckan.harvest.site1.gather and ckan.harvest.site1.fetch.
  • On Redis, it will namespace the keys used, so only the relevant instance gets them, eg site1:harvest_job_id, site1:harvest_object__id:804f114a-8f68-4e7c-b124-3eb00f66202f

Configuration

Run the following command to create the necessary tables in the database (ensuring the pyenv is activated):

(pyenv) $ ckan --config=/etc/ckan/default/ckan.ini db upgrade -p harvest

Finally, restart CKAN to have the changes take effect:

sudo service apache2 restart

After installation, the harvest source listing should be available under /harvest, eg:

http://localhost/harvest

Database logger configuration(optional)

  1. Logging to the database is disabled by default. If you want your ckan harvest logs to be exposed to the CKAN API you need to properly configure the logger with the following configuration parameter:

    ckan.harvest.log_scope = 0
    • -1 - Do not log in the database - DEFAULT
    • 0 - Log everything
    • 1 - model, logic.action, logic.validators, harvesters
    • 2 - model, logic.action, logic.validators
    • 3 - model, logic.action
    • 4 - logic.action
    • 5 - model
    • 6 - plugin
    • 7 - harvesters
  2. Setup time frame (in days) for the clean-up mechanism with the following config parameter (in the [app:main] section):

    ckan.harvest.log_timeframe = 10

    If no value is present the default is 30 days.

  3. Setup log level for the database logger:

    ckan.harvest.log_level = info

    If no log level is set the default is debug.

API Usage

You can access CKAN harvest logs via the API:

$ curl {ckan_url}/api/3/action/harvest_log_list

Replace {ckan_url} with the url from your CKAN instance.

Allowed parameters are:

  • level (filter log records by level)
  • limit (used for pagination)
  • offset (used for pagination)

e.g. Fetch all logs with log level INFO:

$ curl {ckan_url}/api/3/action/harvest_log_list?level=info

{
  "help":"http://127.0.0.1:5000/api/3/action/help_show?name=harvest_log_list",

  "success":true,

  "result": [{"content":"Sent job aa987717-2316-4e47-b0f2-cbddfb4c4dfc to the gather queue","level":"INFO","created":"2016-06-03 10:59:40.961657"}, {"content":"Sent job aa987717-2316-4e47-b0f2-cbddfb4c4dfc to the gather queue","level":"INFO","created":"2016-06-03 10:59:40.951548"}]

}

Dataset name generation configuration (optional)

If the dataset name is created based on the title, duplicate names may occur. To avoid this, a suffix is appended to the name if it already exists.

You can configure the default behaviour in your production.ini:

ckanext.harvest.default_dataset_name_append = number-sequence

or

ckanext.harvest.default_dataset_name_append = random-hex

If you don't specify this setting, the default will be number-sequence.

Send error mails when harvesting fails (optional)

If you want to send an email when a Harvest Job fails, you can set the following configuration option in the ini file:

ckan.harvest.status_mail.errored = True

If you want to send an email when completed Harvest Jobs finish (whether or not it failed), you can set the following configuration option in the ini file:

ckan.harvest.status_mail.all = True

That way, all CKAN Users who are declared as Sysadmins will receive the Error emails at their configured email address. If the Harvest-Source of the failing Harvest-Job belongs to an organization, the error-mail will also be sent to the organization-members who have the admin-role if their E-Mail is configured.

If you don't specify this setting, the default will be False.

Set a timeout for a harvest job (optional)

IF you want to set a timeout for harvest jobs, you can add this configuration option to the ini file:

ckan.harvest.timeout = 1440

The timeout value is in minutes, so 1440 represents 24 hours. Any jobs which are timed out will create an error message for the user to see.

If you don't specify this setting, the default will be False and there will be no timeout on harvest jobs. This timeout value is compared to the completion time of the last object in the job.

Avoid overwriting certain fields (optional)

If you want to skip some fields from being changed because of the harvesting, you can add a list of field that should not be overwritten to not_overwrite_fields in the ini file. This is useful in case you want to add additional fields to the harvested datasets, or if you want to alter them after they have harvested. For example, in case you want to retain changes made by the users to the fields decription and tags:

ckan.harvest.not_overwrite_fields = description tags

Command line interface

The following operations can be run from the command line as described underneath:

harvester source {name} {url} {type} [{title}] [{active}] [{owner_org}] [{frequency}] [{config}]
  - create new harvest source

harvester source {source-id/name}
  - shows a harvest source

harvester rmsource {source-id/name}
  - remove (deactivate) a harvester source, whilst leaving any related
    datasets, jobs and objects

harvester clearsource {source-id/name}
  - clears all datasets, jobs and objects related to a harvest source,
    but keeps the source itself

harvester clearsource-history [{source-id}] [-k]
  - If no source id is given the history for all harvest sources (maximum is 1000)
    will be cleared.
    Clears all jobs and objects related to a harvest source, but keeps the source
    itself. The datasets imported from the harvest source will **NOT** be deleted!!!
    If a source id is given, it only clears the history of the harvest source with
    the given source id.

    To keep the currently active jobs use the -k option.

harvester sources [all]
  - lists harvest sources
    If 'all' is defined, it also shows the Inactive sources

harvester job {source-id/name}
  - create new harvest job

harvester jobs
  - lists harvest jobs

harvester job-abort {source-id/name}
  - marks a job as "Aborted" so that the source can be restarted afresh.
    It ensures that the job's harvest objects status are also marked
    finished. You should ensure that neither the job nor its objects are
    currently in the gather/fetch queues.

harvester run
  - starts any harvest jobs that have been created by putting them onto
    the gather queue. Also checks running jobs - if finished it
    changes their status to Finished.

harvester run-test {source-id/name}
  - runs a harvest - for testing only.
    This does all the stages of the harvest (creates job, gather, fetch,
    import) without involving the web UI or the queue backends. This is
    useful for testing a harvester without having to fire up
    gather/fetch_consumer processes, as is done in production.

harvester run-test {source-id/name} force-import=guid1,guid2...
  - In order to force an import of particular datasets, useful to
    target a dataset for dev purposes or when forcing imports on other environments.

harvester gather-consumer
  - starts the consumer for the gathering queue

harvester fetch-consumer
  - starts the consumer for the fetching queue

harvester purge-queues
  - removes all jobs from fetch and gather queue
    WARNING: if using Redis, this command purges all data in the current
    Redis database

harvester clean-harvest-log
  - Clean-up mechanism for the harvest log table.
    You can configure the time frame through the configuration
    parameter 'ckan.harvest.log_timeframe'. The default time frame is 30 days

harvester [-j] [-o] [--segments={segments}] import [{source-id}]
  - perform the import stage with the last fetched objects, for a certain
    source or a single harvest object. Please note that no objects will
    be fetched from the remote server. It will only affect the objects
    already present in the database.

    To import a particular harvest source, specify its id as an argument.
    To import a particular harvest object use the -o option.
    To import a particular package use the -p option.

    You will need to specify the -j flag in cases where the datasets are
    not yet created (e.g. first harvest, or all previous harvests have
    failed)

    The --segments flag allows to define a string containing hex digits that represent which of
    the 16 harvest object segments to import. e.g. 15af will run segments 1,5,a,f

harvester job-all
  - create new harvest jobs for all active sources.

harvester reindex
  - reindexes the harvest source datasets

The commands should be run with the pyenv activated and refer to your CKAN configuration file:

(pyenv) $ ckan --config=/etc/ckan/default/ckan.ini harvester --help

(pyenv) $ ckan --config=/etc/ckan/default/ckan.ini harvester sources

Note that on CKAN >= 2.9 all commands with an underscore in their name changed. They now use a hyphen instead of an underscore (e.g. gather_consumer changed to gather-consumer).

Authorization

Harvest sources behave exactly the same as datasets (they are actually internally implemented as a dataset type). That means they can be searched and faceted, and that the same authorization rules can be applied to them. The default authorization settings are based on organizations.

Have a look at the Authorization documentation on CKAN core to see how to configure your instance depending on your needs.

The CKAN harvester

The plugin includes a harvester for remote CKAN instances. To use it, you need to add the ckan_harvester plugin to your options file:

ckan.plugins = harvest ckan_harvester

After adding it, a 'CKAN' option should appear in the 'New harvest source' form.

The CKAN harvesters support a number of configuration options to control their behaviour. Those need to be defined as a JSON object in the configuration form field. The currently supported configuration options are:

  • api_version: You can force the harvester to use either version 1 or 2 of the CKAN API. Default is 2.
  • default_tags: A list of tags that will be added to all harvested datasets. Tags don't need to previously exist. This field takes a list of tag dicts (see example), which allows you to optinally specify a vocabulary.
  • default_groups: A list of group IDs or names to which the harvested datasets will be added to. The groups must exist.
  • default_extras: A dictionary of key value pairs that will be added to extras of the harvested datasets. You can use the following replacement strings, that will be replaced before creating or updating the datasets:
    • {dataset_id}
    • {harvest_source_id}
    • {harvest_source_url} # Will be stripped of trailing forward slashes (/)
    • {harvest_source_title}
    • {harvest_job_id}
    • {harvest_object_id}
  • override_extras: Assign default extras even if they already exist in the remote dataset. Default is False (only non existing extras are added).
  • user: User who will run the harvesting process. Please note that this user needs to have permission for creating packages, and if default groups were defined, the user must have permission to assign packages to these groups.
  • api_key: If the remote CKAN instance has restricted access to the API, you can provide a CKAN API key, which will be sent in any request.
  • read_only: Create harvested packages in read-only mode. Only the user who performed the harvest (the one defined in the previous setting or the 'harvest' sysadmin) will be able to edit and administer the packages created from this harvesting source. Logged in users and visitors will be only able to read them.
  • force_all: By default, after the first harvesting, the harvester will gather only the modified packages from the remote site since the last harvesting. Setting this property to true will force the harvester to gather all remote packages regardless of the modification date. Default is False.
  • remote_groups: By default, remote groups are ignored. Setting this property enables the harvester to import the remote groups. There are two alternatives. Setting it to 'only_local' will just import groups which name/id is already present in the local CKAN. Setting it to 'create' will make an attempt to create the groups by copying the details from the remote CKAN.
  • remote_orgs: By default, remote organizations are ignored. Setting this property enables the harvester to import remote organizations. There are two alternatives. Setting it to 'only_local' will just import organizations which id is already present in the local CKAN. Setting it to 'create' will make an attempt to create the organizations by copying the details from the remote CKAN.
  • clean_tags: By default, tags are not stripped of accent characters, spaces and capital letters for display. If this option is set to True, accent characters will be replaced by their ascii equivalents, capital letters replaced by lower-case ones, and spaces replaced with dashes. Setting this option to False gives the same effect as leaving it unset.
  • organizations_filter_include: This configuration option allows you to specify a list of remote organization names (e.g. "arkansas-gov" is the name for organization http://catalog.data.gov/organization/arkansas-gov ). If this property has a value then only datasets that are in one of these organizations will be harvested. All other datasets will be skipped. Only one of organizations_filter_include or organizations_filter_exclude should be configured.
  • organizations_filter_exclude: This configuration option allows you to specify a list of remote organization names (e.g. "arkansas-gov" is the name for organization http://catalog.data.gov/organization/arkansas-gov ). If this property is set then all datasets from the remote source will be harvested unless it belongs to one of the organizations in this option. Only one of organizations_filter_exclude or organizations_filter_include should be configured.
  • groups_filter_include: Exactly the same as organizations_filter_include but for groups.
  • groups_filter_exclude: Exactly the same as organizations_filter_exclude but for groups.

Here is an example of a configuration object (the one that must be entered in the configuration field):

{
 "api_version": 1,
 "default_tags": [{"name": "geo"}, {"name": "namibia"}],
 "default_groups": ["science", "spend-data"],
 "default_extras": {"encoding":"utf8", "harvest_url": "{harvest_source_url}/dataset/{dataset_id}"},
 "override_extras": true,
 "organizations_filter_include": [],
 "organizations_filter_exclude": ["remote-organization"],
 "user":"harverster-user",
 "api_key":"<REMOTE_API_KEY>",
 "read_only": true,
 "remote_groups": "only_local",
 "remote_orgs": "create"
}

Plugins can extend the default CKAN harvester and implement the modify_package_dict in order to modify the dataset dict generated by the harvester just before it is actually created or updated. For instance, they might want to add or delete certain fields, or fire additional tasks based on the metadata fields.

Plugins will get the dataset dict including any processig described above (eg with the correct groups assigned, replacement strings applied, etc). It will also be passed the harvest object, which contains the original, unmodified dataset dict in the content property.

This is a simple example:

from ckanext.harvest.harvesters.ckanharvester import CKANHarvester

class MySiteCKANHarvester(CKANHarvester):

    def modify_package_dict(self, package_dict, harvest_object):

        # Set a default custom field

        package_dict['remote_harvest'] = True

        # Add tags
        package_dict['tags'].append({'name': 'sdi'})

        return package_dict

Remember to register your custom harvester plugin in your extension setup.py file, and load the plugin in the config in file afterwards:

# setup.py

entry_points='''
    [ckan.plugins]
    my_site=ckanext.my_site.plugin:MySitePlugin
    my_site_ckan_harvester=ckanext.my_site.harvesters:MySiteCKANHarvester
'''


# ini file
ckan.plugins = ... my_site my_site_ckan_harvester

The harvesting interface

Extensions can implement the harvester interface to perform harvesting operations. The harvesting process takes place on three stages:

  1. The gather stage compiles all the resource identifiers that need to be fetched in the next stage (e.g. in a CSW server, it will perform a GetRecords operation).
  2. The fetch stage gets the contents of the remote objects and stores them in the database (e.g. in a CSW server, it will perform n GetRecordById operations).
  3. The import stage performs any necessary actions on the fetched resource (generally creating a CKAN package, but it can be anything the extension needs).

Plugins willing to implement the harvesting interface must provide the following methods:

from ckan.plugins.core import SingletonPlugin, implements
from ckanext.harvest.interfaces import IHarvester

class MyHarvester(SingletonPlugin):
'''
A Test Harvester
'''
implements(IHarvester)

def info(self):
    '''
    Harvesting implementations must provide this method, which will return
    a dictionary containing different descriptors of the harvester. The
    returned dictionary should contain:

    * name: machine-readable name. This will be the value stored in the
      database, and the one used by ckanext-harvest to call the appropiate
      harvester.
    * title: human-readable name. This will appear in the form's select box
      in the WUI.
    * description: a small description of what the harvester does. This
      will appear on the form as a guidance to the user.

    A complete example may be::

        {
            'name': 'csw',
            'title': 'CSW Server',
            'description': 'A server that implements OGC's Catalog Service
                            for the Web (CSW) standard'
        }

    :returns: A dictionary with the harvester descriptors
    '''

def validate_config(self, config):
    '''

    [optional]

    Harvesters can provide this method to validate the configuration
    entered in the form. It should return a single string, which will be
    stored in the database.  Exceptions raised will be shown in the form's
    error messages.

    :param harvest_object_id: Config string coming from the form
    :returns: A string with the validated configuration options
    '''

def get_original_url(self, harvest_object_id):
    '''

    [optional]

    This optional but very recommended method allows harvesters to return
    the URL to the original remote document, given a Harvest Object id.
    Note that getting the harvest object you have access to its guid as
    well as the object source, which has the URL.
    This URL will be used on error reports to help publishers link to the
    original document that has the errors. If this method is not provided
    or no URL is returned, only a link to the local copy of the remote
    document will be shown.

    Examples:
        * For a CKAN record: http://{ckan-instance}/api/rest/{guid}
        * For a WAF record: http://{waf-root}/{file-name}
        * For a CSW record: http://{csw-server}/?Request=GetElementById&Id={guid}&...

    :param harvest_object_id: HarvestObject id
    :returns: A string with the URL to the original document
    '''

def gather_stage(self, harvest_job):
    '''
    The gather stage will receive a HarvestJob object and will be
    responsible for:
        - gathering all the necessary objects to fetch on a later.
          stage (e.g. for a CSW server, perform a GetRecords request)
        - creating the necessary HarvestObjects in the database, specifying
          the guid and a reference to its job. The HarvestObjects need a
          reference date with the last modified date for the resource, this
          may need to be set in a different stage depending on the type of
          source.
        - creating and storing any suitable HarvestGatherErrors that may
          occur.
        - returning a list with all the ids of the created HarvestObjects.
        - to abort the harvest, create a HarvestGatherError and raise an
          exception. Any created HarvestObjects will be deleted.

    :param harvest_job: HarvestJob object
    :returns: A list of HarvestObject ids
    '''

def fetch_stage(self, harvest_object):
    '''
    The fetch stage will receive a HarvestObject object and will be
    responsible for:
        - getting the contents of the remote object (e.g. for a CSW server,
          perform a GetRecordById request).
        - saving the content in the provided HarvestObject.
        - creating and storing any suitable HarvestObjectErrors that may
          occur.
        - returning True if everything is ok (ie the object should now be
          imported), "unchanged" if the object didn't need harvesting after
          all (ie no error, but don't continue to import stage) or False if
          there were errors.

    :param harvest_object: HarvestObject object
    :returns: True if successful, 'unchanged' if nothing to import after
              all, False if not successful
    '''

def import_stage(self, harvest_object):
    '''
    The import stage will receive a HarvestObject object and will be
    responsible for:
        - performing any necessary action with the fetched object (e.g.
          create, update or delete a CKAN package).
          Note: if this stage creates or updates a package, a reference
          to the package should be added to the HarvestObject.
        - setting the HarvestObject.package (if there is one)
        - setting the HarvestObject.current for this harvest:
           - True if successfully created/updated
           - False if successfully deleted
        - setting HarvestObject.current to False for previous harvest
          objects of this harvest source if the action was successful.
        - creating and storing any suitable HarvestObjectErrors that may
          occur.
        - creating the HarvestObject - Package relation (if necessary)
        - returning True if the action was done, "unchanged" if the object
          didn't need harvesting after all or False if there were errors.

    NB You can run this stage repeatedly using 'paster harvest import'.

    :param harvest_object: HarvestObject object
    :returns: True if the action was done, "unchanged" if the object didn't
              need harvesting after all or False if there were errors.
    '''

See the CKAN harvester for an example of how to implement the harvesting interface:

  • ckanext-harvest/ckanext/harvest/harvesters/ckanharvester.py

Here you can also find other examples of custom harvesters:

Running the harvest jobs

There are two ways to run a harvest:

  1. harvester run-test for the command-line, suitable for testing
  2. harvester run used by the Web UI and scheduled runs

harvester run-test

You can run a harvester simply using the run-test command. This is handy for running a harvest with one command in the console and see all the output in-line. It runs the gather, fetch and import stages all in the same process. You must ensure that you have pip installed dev-requirements.txt in /home/ckan/ckan/lib/default/src/ckanext-harvest before using the run-test command.

This is useful for developing a harvester because you can insert break-points in your harvester, and rerun a harvest without having to restart the gather_consumer and fetch_consumer processes each time. In addition, because it doesn't use the queue backends it doesn't interfere with harvests of other sources that may be going on in the background.

However running this way, if gather_stage, fetch_stage or import_stage raise an exception, they are not caught, whereas with harvester run they are handled slightly differently as they are called by queue.py. So when testing this aspect its best to use harvester run.

harvester run

When a harvest job is started by a user in the Web UI, or by a scheduled harvest, the harvest is started by the harvester run command. This is the normal method in production systems and scales well.

In this case, the harvesting extension uses two different queues: one that handles the gathering and another one that handles the fetching and importing. To start the consumers run the following command (make sure you have your python environment activated):

(pyenv) $ ckan --config=/etc/ckan/default/ckan.ini harvester gather-consumer

On another terminal, run the following command:

(pyenv) $ ckan --config=/etc/ckan/default/ckan.ini harvester fetch-consumer

Finally, on a third console, run the following command to start any pending harvesting jobs:

(pyenv) $ ckan --config=/etc/ckan/default/ckan.ini harvester run

The run command not only starts any pending harvesting jobs, but also flags those that are finished, allowing new jobs to be created on that particular source and refreshing the source statistics. That means that you will need to run this command before being able to create a new job on a source that was being harvested. (On a production site you will typically have a cron job that runs the command regularly, see next section).

Occasionally you can find a harvesting job is in a "limbo state" where the job has run with errors but the harvester run command will not mark it as finished, and therefore you cannot run another job. This is due to particular harvester not handling errors correctly e.g. during development. In this circumstance, ensure that the gather & fetch consumers are running and have nothing more to consume, and then run this abort command with the name or id of the harvest source:

(pyenv) $ ckan --config=/etc/ckan/default/ckan.ini harvester job-abort {source-id/name}

Setting up the harvesters on a production server

The previous approach works fine during development or debugging, but it is not recommended for production servers. There are several possible ways of setting up the harvesters, which will depend on your particular infrastructure and needs. The bottom line is that the gather and fetch process should be kept running somehow and then the run command should be run periodically to start any pending jobs.

The following approach is the one generally used on CKAN deployments, and it will probably suit most of the users. It uses Supervisor, a tool to monitor processes, and a cron job to run the harvest jobs, and it assumes that you have already installed and configured the harvesting extension (See Installation if not).

Note: It is recommended to run the harvest process from a non-root user (generally the one you are running CKAN with). Replace the user ckan in the following steps with the one you are using.

  1. Install Supervisor:

    sudo apt-get update
    sudo apt-get install supervisor

    You can check if it is running with this command:

    ps aux | grep supervisord

    You should see a line similar to this one:

    root      9224  0.0  0.3  56420 12204 ?        Ss   15:52   0:00 /usr/bin/python /usr/bin/supervisord
  2. Supervisor needs to have programs added to its configuration, which will describe the tasks that need to be monitored. This configuration files are stored in /etc/supervisor/conf.d.

    Create a file named /etc/supervisor/conf.d/ckan_harvesting.conf, and copy the following contents:

    ON CKAN >= 2.9:

    ; ===============================
    ; ckan harvester
    ; ===============================
    
    [program:ckan_gather_consumer]
    
    command=/usr/lib/ckan/default/bin/ckan --config=/etc/ckan/default/ckan.ini harvester gather-consumer
    
    ; user that owns virtual environment.
    user=ckan
    
    numprocs=1
    stdout_logfile=/var/log/ckan/std/gather_consumer.log
    stderr_logfile=/var/log/ckan/std/gather_consumer.log
    autostart=true
    autorestart=true
    startsecs=10
    
    [program:ckan_fetch_consumer]
    
    command=/usr/lib/ckan/default/bin/ckan --config=/etc/ckan/default/ckan.ini harvester fetch-consumer
    
    ; user that owns virtual environment.
    user=ckan
    
    numprocs=1
    stdout_logfile=/var/log/ckan/std/fetch_consumer.log
    stderr_logfile=/var/log/ckan/std/fetch_consumer.log
    autostart=true
    autorestart=true
    startsecs=10

    There are a number of things that you will need to replace with your specific installation settings (the example above shows paths from a ckan instance installed via Debian packages):

    • command: The absolute path to the paster command located in the python virtual environment and the absolute path to the config ini file.
    • user: The unix user you are running CKAN with
    • stdout_logfile and stderr_logfile: All output coming from the harvest consumers will be written to this file. Ensure that the necessary permissions are setup.

    The rest of the configuration options are pretty self explanatory. Refer to the Supervisor documentation to know more about these and other options available.

  3. Start the supervisor tasks with the following commands:

    sudo supervisorctl reread
    sudo supervisorctl add ckan_gather_consumer
    sudo supervisorctl add ckan_fetch_consumer
    sudo supervisorctl start ckan_gather_consumer
    sudo supervisorctl start ckan_fetch_consumer

    To check that the processes are running, you can run:

    sudo supervisorctl status
    
    ckan_fetch_consumer              RUNNING    pid 6983, uptime 0:22:06
    ckan_gather_consumer             RUNNING    pid 6968, uptime 0:22:45

    Some problems you may encounter when starting the processes:

    • ckan_gather_consumer: ERROR (no such process)

      Double-check your supervisor configuration file and stop and restart the supervisor daemon:

      sudo service supervisor start; sudo service supervisor stop
    • ckan_gather_consumer: ERROR (abnormal termination)

      Something prevented the command from running properly. Have a look at the log file that you defined in the stdout_logfile section to see what happened. Common errors include:

      `socket.error: [Errno 111] Connection refused`
      RabbitMQ is not running::
      
        sudo service rabbitmq-server start
  4. Once we have the two consumers running and monitored, we just need to create a cron job that will run the run harvester command periodically. To do so, edit the cron table with the following command (it may ask you to choose an editor):

    sudo crontab -e -u ckan

    Note that we are running this command as the same user we configured the processes to be run with (ckan in our example).

    Paste this line into your crontab, again replacing the paths to paster and the ini file with yours:

    # m h dom mon dow command /15 * * * /usr/lib/ckan/default/bin/ckan -c /etc/ckan/default/ckan.ini harvester run

    This particular example will check for pending jobs every fifteen minutes. You can of course modify this periodicity, this Wikipedia page has a good overview of the crontab syntax.

  5. In order to setup clean-up mechanism for the harvest log one more cron job needs to be scheduled:

    sudo crontab -e -u ckan

    Paste this line into your crontab, again replacing the paths to paster/ckan and the ini file with yours:

    # m h dom mon dow command

    0 5 * * * /usr/lib/ckan/default/bin/ckan -c /etc/ckan/default/ckan.ini harvester clean-harvest-log

    This particular example will perform clean-up each day at 05 AM. You can tweak the value according to your needs.

Extensible actions

Recipients on harvest jobs notifications

harvest_get_notifications_recipients: you can chain this action from another extension to change the recipients for harvest jobs notifications.

Tests

You can run the tests like this:

cd ckanext-harvest
pytest --ckan-ini=test.ini ckanext/harvest/tests

Here are some common errors and solutions:

  • (OperationalError) no such table: harvest_object_error u'delete from "harvest_object_error" The database has got into in a bad state. Run the tests again but with the --reset-db parameter.
  • (ProgrammingError) relation "harvest_object_extra" does not exist The database has got into in a bad state. Run the tests again but without the --reset-db parameter. Alternatively it's because you forgot to use the --ckan parameter.
  • (OperationalError) near "SET": syntax error You are testing with SQLite as the database, but the CKAN Harvester needs PostgreSQL. Specify test-core.ini instead of test.ini.

Harvest API =====

ckanext-harvest has multiple API's exposed in the format /api/action/<endpoint>.

  • /api/action/harvest_source_list

This endpoint will return all the harvest sources in CKAN with a default limit of 100 items. The limit can be set to a bespoke value in the config for ckan under ckan.harvest.harvest_source_limit.

An optional query param organization_id can be used to narrow down the results to only return the harvest sources created by certain organization's by supplying their respective organization id -> /api/action/harvest_source_list?organization_id=<some-org-id>

Releases

To create a new release, follow the following steps:

  • Determine new release number based on the rules of semantic versioning
  • Update the CHANGELOG, especially the link for the "Unreleased" section
  • Update the version number in setup.py
  • Create a new release on GitHub and add the CHANGELOG of this release as release notes

Community

Contributing

For contributing to ckanext-harvest or its documentation, follow the guidelines described in CONTRIBUTING.

License

This extension is open and licensed under the GNU Affero General Public License (AGPL) v3.0. Its full text may be found at:

http://www.fsf.org/licensing/licenses/agpl-3.0.html

ckanext-harvest's People

Contributors

amercader avatar avdata99 avatar bellisk avatar bonnland avatar brucebolt avatar etj avatar frafra avatar fuhuxia avatar icmurray avatar jbrown-xentity avatar jin-sun-tts avatar joetsoi avatar johnmartin avatar kentsanggds avatar kindly avatar metaodi avatar nickumia-reisys avatar pdekraker-epa avatar pdelboca avatar polarp avatar pudo avatar raphaelstolt avatar seitenbau-govdata avatar smotornyuk avatar stefina avatar tino097 avatar tobes avatar tomecirun avatar wwaites avatar zharktas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ckanext-harvest's Issues

New Harvest Source form cleanup

Since upgrading bootstrap and the new iA changes that are now both in release-v2.0 on ckan core, the harvest source form has become a little broken.

See these:

Screen Shot 2013-03-20 at 10 20 45

Final clean up before 2.0

Including:

  • Old auth profile stuff
  • Old routes
  • Templates (will keep them in the source for the time being)
  • Controller

Improve Job Errors reporting

For the last job of a particular source, show a summary of the most common errors, as well as a list of all documents with their errors.

Harvest Source template tweaks

There are a few minor things that are broken in them at the moment. These fixes should end up in release-v2.0:

  • From within a orgs harvest source pages there should be a way to add a harvest source to that org
  • From within a orgs harvest source pages we should link to the admin page not the edit page for a harvest source
  • Like dataset pages within orgs, harvest source pages should inherit the breadcrumbs

Improve gather stage error handling

The way queue.py handles the gather stage is very inconvenient, as it captures all exceptions that may happen in the harvester gather_stage, preventing all debugging and leaving the job in a half state.

Similarly to the refactoring on the fetch stage, there should be no exceptions caught, as harvesters should be the ones being robust enough (if there are exceptions happening, we want to see them). If gather stage returns anything that is not a list of ids, the process stops. If the harvest object ids list is empty, the process stops. gather_finished is always set up, allowing the "run" command to flag the job as Finished (we want the "run" command to do it so the harvest source status is reindexed).

Also the debug messages with thousands of harvest object ids are not very useful.

On a later stage, we could investigate into implementing retry times as in the fetch stage.

Default package name to its current one, if the user haven't passed any

When we're creating a new package, we create its name based on its title, so the user doesn't have to care if he's sending a valid name. But, when updating, we require a name. As we require it to be in a certain format, it's hard to the user to guarantee that he's building a correct name, and keep in sync with CKAN, if we ever change how we build names.

Harvesting extras from a CKAN 1.8 site into a CKAN 2.0 site fails (SQLAlchemy error: can't adapt type dict)

If you try to harvest a dataset from a CKAN 1 site and the dataset has an extra with a non-string value (eg. the type of the extras value is dict or list etc.) CKAN just returns a 500 Server Error to the API client. In the CKAN logs, you get this error from SQLAlchemy:

ProgrammingError: (ProgrammingError) can't adapt type 'dict' 'INSERT INTO package_extra...

In CKAN 1.8 it was possible to post non-string extras such as list, dicts, etc. Undocumented, but possible.

In CKAN 2.0, this was changed so that extras must be strings.

(The docs for both 2.0 and 1.8 say that extras should be strings.)

The change is ckan/model/package_extra.py. In CKAN 2.0 the value of a package extra in the database has type UnicodeText: https://github.com/okfn/ckan/blob/master/ckan/model/package_extra.py#L22

That's why SQLAlchemy crashes when you try to post something like a dict.

In CKAN 1.8, this database column had type JsonType, so posting dicts etc would work: https://github.com/okfn/ckan/blob/release-v1.8/ckan/model/package_extra.py#L22

This is the commit that made the change, in CKAN 2.0: ckan/ckan@fc3bd3d

So this means that the CKAN Harvester is broken, because CKAN crashes when trying to harvest datasets from a CKAN 1 site into a CKAN 2 site if those datasets have non-string extras. I'm not sure what the fix should be, the harvester could simply remove any non-string extras from the dataset. Or it could try to convert them to strings using JSON.

CKAN harvester bugged with relative links

I tried the harvester on this CKAN instance http://dati.toscana.it but it retrives useless data for datasets like this http://dati.toscana.it/api/rest/dataset/arte-e-cultura/resource/8912d987-7f6b-48a0-b376-4613bb8e7905

As you can see the csv file is local on CKAN and has a relative path.
"url": "/it/storage/f/2012-07-26T160139/intoscana-arte-e-cultura.csv",
The harvester just copies the relative path but doesn't copy the actual file so the relative link won't work.

save harvest object summaries to harvest job once a job is completed.

on large instances such as pdeu, viewing all the harvest jobs for a source is painful, so painful that I'm disabling the summaries apart from on the view_job pages. Since these summaries will never change (unless they are deleted) it might be worth doing a bit of denormalization and saving the job statistics to the harvest_job or a way of disabling the summaries without resorting to your own extension.

Missing dependency pyparsing

On branch release-v2.0 I had to add pyparsing==1.5.7 to pip-requirements.txt in order to get the initdb command to run.

Flag anonymous auth functions as such

Starting from 2.2 you need to explicitly flag auth functions that allow anonymous access with the p.toolkit.auth_allow_anonymous_access decorator. We'll need to keep backwards compatibility by only adding it on CKAN >= 2.2

Unable to add harvest source on command-line

Currently it's impossible to add a new harvest source using the CLI.

The README describes to usage as follows:

harvester source {url} {type} [{config}] [{active}] [{user-id}] [{publisher-id}] [{frequency}]

If I just specify the URL and the type like that:

paster --plugin=ckanext-harvest harvester source http://localhost ckan -c 
development.ini

I get a ckan.logic.ValidationError:

2013-08-05 20:37:04,366 INFO  [ckanext.harvest.logic.action.create] Creating harvest source: {'user_id': u'', 'url': u'http://localhost', 'type': u'ckan', 'frequency': 'MANUAL', 'publisher_id': u'', 'active': True, 'config': None}
An error occurred:
{'name': ['Missing value'], 'title': ['Missing value'], 'source_type': ['Missing value']}
Traceback (most recent call last):
  File "/home/vagrant/pyenv/bin/paster", line 9, in <module>
    load_entry_point('PasteScript==1.7.5', 'console_scripts', 'paster')()
  File "/home/vagrant/pyenv/local/lib/python2.7/site-packages/paste/script/command.py", line 104, in run
    invoke(command, command_name, options, args[1:])
  File "/home/vagrant/pyenv/local/lib/python2.7/site-packages/paste/script/command.py", line 143, in invoke
    exit_code = runner.run(args)
  File "/home/vagrant/pyenv/local/lib/python2.7/site-packages/paste/script/command.py", line 238, in run
    result = self.command()
  File "/vagrant/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 104, in command
    self.create_harvest_source()
  File "/vagrant/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 217, in create_harvest_source
    raise e
ckan.logic.ValidationError: {'Name': 'Missing value', 'Source type': 'Missing value', 'Title': 'Missing value'}

Delete harvest sources on 2.0

  • Update UI
  • Use the package delete action, syncing internally the source object

Bulk deletion of the source datasets will need to target 2.1

Memory leaks when viewing a harvester with 10000+ datasets

When clicking view on a harvester with 10000+ datasets in it, I get a memory leak using Centos 6.3 as apache wsgi runtime and using the latest git version of ckanext-harvester and ckan. To get it, I navigate to '/harvest' and click view on the harvester which has a lot of datasets. The system is a KVM virtual machine.

It returns no data but in dmesg, its noticed as :

hrtimer: interrupt took 31998821 ns
BUG: soft lockup - CPU#0 stuck for 67s! [java:1566]
Modules linked in: ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mod i2c_piix4 i2c_core virtio_balloon e1000 snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix [last unloaded: scsi_wait_scan]
CPU 0 
Modules linked in: ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mod i2c_piix4 i2c_core virtio_balloon e1000 snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix [last unloaded: scsi_wait_scan]

Pid: 1566, comm: java Not tainted 2.6.32-279.el6.x86_64 #1 Bochs Bochs
RIP: 0010:[<ffffffff81500126>]  [<ffffffff81500126>] _spin_lock+0x26/0x30
RSP: 0018:ffff880037763918  EFLAGS: 00000206
RAX: 0000000000000001 RBX: ffff880037763918 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff8800bc9d55c0 RDI: ffff8800b8f10998
RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000134a0 R11: 0000000000000000 R12: ffffea0000e41ae0
R13: 80000000412c4067 R14: ffffffff811497a4 R15: ffff880037763908
FS:  00007fb9a492f700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000ffd24ea8 CR3: 0000000037154000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process java (pid: 1566, threadinfo ffff880037762000, task ffff8800373faae0)
Stack:
 ffff880037763948 ffffffff811480e3 0000000000000000 ffff8800b8f10998
<d> ffffea0000e41aa8 0000000000000000 ffff8800377639e8 ffffffff811682a8
<d> ffff880000000001 ffff8800b8f10998 0000000000000000 0000880000000000
Call Trace:
 [<ffffffff811480e3>] ? page_lock_anon_vma+0x53/0x70
 [<ffffffff811682a8>] ? migrate_pages+0x3a8/0x4b0
 [<ffffffff8115d8f0>] ? compaction_alloc+0x0/0x3e0
 [<ffffffff8115e1e7>] ? compact_zone+0x517/0x820
 [<ffffffff8115e771>] ? compact_zone_order+0xa1/0xe0
 [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
 [<ffffffff8115e8cc>] ? try_to_compact_pages+0x11c/0x190
 [<ffffffff81127415>] ? __alloc_pages_nodemask+0x5f5/0x940
 [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8115c2da>] ? alloc_pages_vma+0x9a/0x150
 [<ffffffff81176635>] ? do_huge_pmd_anonymous_page+0x145/0x380
 [<ffffffff8113fe7a>] ? handle_mm_fault+0x25a/0x2b0
 [<ffffffff810a68c2>] ? do_futex+0x682/0xb00
 [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
 [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81500625>] ? page_fault+0x25/0x30
Code: e3 d7 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 0f 1f 44 00 00 83 3f 00 <75> f4 eb df c9 c3 0f 1f 40 00 55 48 89 e5 0f 1f 44 00 00 f0 81 
Call Trace:
 [<ffffffff811480e3>] ? page_lock_anon_vma+0x53/0x70
 [<ffffffff811682a8>] ? migrate_pages+0x3a8/0x4b0
 [<ffffffff8115d8f0>] ? compaction_alloc+0x0/0x3e0
 [<ffffffff8115e1e7>] ? compact_zone+0x517/0x820
 [<ffffffff8115e771>] ? compact_zone_order+0xa1/0xe0
 [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
 [<ffffffff8115e8cc>] ? try_to_compact_pages+0x11c/0x190
 [<ffffffff81127415>] ? __alloc_pages_nodemask+0x5f5/0x940
 [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8115c2da>] ? alloc_pages_vma+0x9a/0x150
 [<ffffffff81176635>] ? do_huge_pmd_anonymous_page+0x145/0x380
 [<ffffffff8113fe7a>] ? handle_mm_fault+0x25a/0x2b0
 [<ffffffff810a68c2>] ? do_futex+0x682/0xb00
 [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
 [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81500625>] ? page_fault+0x25/0x30
BUG: soft lockup - CPU#5 stuck for 67s! [java:1570]
Modules linked in: ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mod i2c_piix4 i2c_core virtio_balloon e1000 snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix [last unloaded: scsi_wait_scan]
CPU 5 
Modules linked in: ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mod i2c_piix4 i2c_core virtio_balloon e1000 snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix [last unloaded: scsi_wait_scan]

Pid: 1570, comm: java Not tainted 2.6.32-279.el6.x86_64 #1 Bochs Bochs
RIP: 0010:[<ffffffff81500126>]  [<ffffffff81500126>] _spin_lock+0x26/0x30
RSP: 0018:ffff88003703b918  EFLAGS: 00000206
RAX: 0000000000000001 RBX: ffff88003703b918 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff8800bc9b2980 RDI: ffff8800b8f10998
RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000134a0 R11: 0000000000000000 R12: ffffea0000e98ac8
R13: 8000000042b9f067 R14: ffffffff811497a4 R15: ffff88003703b908
FS:  00007fb9a452b700(0000) GS:ffff880002340000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fb9aeb73580 CR3: 0000000037154000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process java (pid: 1570, threadinfo ffff88003703a000, task ffff8800b8480040)
Stack:
 ffff88003703b948 ffffffff811480e3 0000000000000000 ffff8800b8f10998
<d> ffffea0000e98a90 0000000000000000 ffff88003703b9e8 ffffffff811682a8
<d> ffff880000000001 ffff8800b8f10998 0000000000000000 0000880000000000
Call Trace:
 [<ffffffff811480e3>] ? page_lock_anon_vma+0x53/0x70
 [<ffffffff811682a8>] ? migrate_pages+0x3a8/0x4b0
 [<ffffffff8115d8f0>] ? compaction_alloc+0x0/0x3e0
 [<ffffffff8115e1e7>] ? compact_zone+0x517/0x820
 [<ffffffff8115e771>] ? compact_zone_order+0xa1/0xe0
 [<ffffffff8115e8cc>] ? try_to_compact_pages+0x11c/0x190
 [<ffffffff81127415>] ? __alloc_pages_nodemask+0x5f5/0x940
 [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8115c2da>] ? alloc_pages_vma+0x9a/0x150
 [<ffffffff81176635>] ? do_huge_pmd_anonymous_page+0x145/0x380
 [<ffffffff8113fe7a>] ? handle_mm_fault+0x25a/0x2b0
 [<ffffffff810a68c2>] ? do_futex+0x682/0xb00
 [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
 [<ffffffff814fd830>] ? thread_return+0x4e/0x76e
 [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81500625>] ? page_fault+0x25/0x30
Code: e3 d7 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 0f 1f 44 00 00 83 3f 00 <75> f4 eb df c9 c3 0f 1f 40 00 55 48 89 e5 0f 1f 44 00 00 f0 81 
Call Trace:
 [<ffffffff811480e3>] ? page_lock_anon_vma+0x53/0x70
 [<ffffffff811682a8>] ? migrate_pages+0x3a8/0x4b0
 [<ffffffff8115d8f0>] ? compaction_alloc+0x0/0x3e0
 [<ffffffff8115e1e7>] ? compact_zone+0x517/0x820
 [<ffffffff8115e771>] ? compact_zone_order+0xa1/0xe0
 [<ffffffff8115e8cc>] ? try_to_compact_pages+0x11c/0x190
 [<ffffffff81127415>] ? __alloc_pages_nodemask+0x5f5/0x940
 [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8115c2da>] ? alloc_pages_vma+0x9a/0x150
 [<ffffffff81176635>] ? do_huge_pmd_anonymous_page+0x145/0x380
 [<ffffffff8113fe7a>] ? handle_mm_fault+0x25a/0x2b0
 [<ffffffff810a68c2>] ? do_futex+0x682/0xb00
 [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
 [<ffffffff814fd830>] ? thread_return+0x4e/0x76e
 [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81500625>] ? page_fault+0x25/0x30
BUG: soft lockup - CPU#1 stuck for 62s! [java:1568]
Modules linked in: ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mod i2c_piix4 i2c_core virtio_balloon e1000 snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix [last unloaded: scsi_wait_scan]
CPU 1 
Modules linked in: ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mod i2c_piix4 i2c_core virtio_balloon e1000 snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix [last unloaded: scsi_wait_scan]

Pid: 1568, comm: java Not tainted 2.6.32-279.el6.x86_64 #1 Bochs Bochs
RIP: 0010:[<ffffffff81500126>]  [<ffffffff81500126>] _spin_lock+0x26/0x30
RSP: 0018:ffff8800371a5918  EFLAGS: 00000206
RAX: 0000000000000001 RBX: ffff8800371a5918 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff8800bc9f3980 RDI: ffff8800b8f10998
RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000134a0 R11: 0000000000000000 R12: ffffea0000e3ef58
R13: 80000000411fd067 R14: ffffffff811497a4 R15: ffff8800371a5908
FS:  00007fb9a472d700(0000) GS:ffff880002240000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000001488020 CR3: 0000000037154000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process java (pid: 1568, threadinfo ffff8800371a4000, task ffff8800b8481500)
Stack:
 ffff8800371a5948 ffffffff811480e3 0000000000000000 ffff8800b8f10998
<d> ffffea0000e3ef20 0000000000000000 ffff8800371a59e8 ffffffff811682a8
<d> ffff880000000001 ffff8800b8f10998 0000000000000000 0000880000000000
Call Trace:
 [<ffffffff811480e3>] ? page_lock_anon_vma+0x53/0x70
 [<ffffffff811682a8>] ? migrate_pages+0x3a8/0x4b0
 [<ffffffff8115d8f0>] ? compaction_alloc+0x0/0x3e0
 [<ffffffff8115e1e7>] ? compact_zone+0x517/0x820
 [<ffffffff810629d3>] ? dequeue_entity+0x113/0x2e0
 [<ffffffff8115e771>] ? compact_zone_order+0xa1/0xe0
 [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
 [<ffffffff8115e8cc>] ? try_to_compact_pages+0x11c/0x190
 [<ffffffff81127415>] ? __alloc_pages_nodemask+0x5f5/0x940
 [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8106335b>] ? enqueue_task_fair+0xfb/0x100
 [<ffffffff8115c2da>] ? alloc_pages_vma+0x9a/0x150
 [<ffffffff81176635>] ? do_huge_pmd_anonymous_page+0x145/0x380
 [<ffffffff8113fe7a>] ? handle_mm_fault+0x25a/0x2b0
 [<ffffffff810a68c2>] ? do_futex+0x682/0xb00
 [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
 [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81500625>] ? page_fault+0x25/0x30
Code: e3 d7 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 0f 1f 44 00 00 83 3f 00 <75> f4 eb df c9 c3 0f 1f 40 00 55 48 89 e5 0f 1f 44 00 00 f0 81 
Call Trace:
 [<ffffffff811480e3>] ? page_lock_anon_vma+0x53/0x70
 [<ffffffff811682a8>] ? migrate_pages+0x3a8/0x4b0
 [<ffffffff8115d8f0>] ? compaction_alloc+0x0/0x3e0
 [<ffffffff8115e1e7>] ? compact_zone+0x517/0x820
 [<ffffffff810629d3>] ? dequeue_entity+0x113/0x2e0
 [<ffffffff8115e771>] ? compact_zone_order+0xa1/0xe0
 [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
 [<ffffffff8115e8cc>] ? try_to_compact_pages+0x11c/0x190
 [<ffffffff81127415>] ? __alloc_pages_nodemask+0x5f5/0x940
 [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8106335b>] ? enqueue_task_fair+0xfb/0x100
 [<ffffffff8115c2da>] ? alloc_pages_vma+0x9a/0x150
 [<ffffffff81176635>] ? do_huge_pmd_anonymous_page+0x145/0x380
 [<ffffffff8113fe7a>] ? handle_mm_fault+0x25a/0x2b0
 [<ffffffff810a68c2>] ? do_futex+0x682/0xb00
 [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
 [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81500625>] ? page_fault+0x25/0x30
BUG: soft lockup - CPU#2 stuck for 62s! [java:1565]
Modules linked in: ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mod i2c_piix4 i2c_core virtio_balloon e1000 snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix [last unloaded: scsi_wait_scan]
CPU 2 
Modules linked in: ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mod i2c_piix4 i2c_core virtio_balloon e1000 snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix [last unloaded: scsi_wait_scan]

Pid: 1565, comm: java Not tainted 2.6.32-279.el6.x86_64 #1 Bochs Bochs
RIP: 0010:[<ffffffff8150011e>]  [<ffffffff8150011e>] _spin_lock+0x1e/0x30
RSP: 0018:ffff880037059898  EFLAGS: 00000206
RAX: 0000000000000001 RBX: ffff880037059898 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000301 RDI: ffff8800b8f10998
RBP: ffffffff8100bc0e R08: ffff8800b8f10998 R09: 0000000000000001
R10: 00000000000134a0 R11: 0000000000000000 R12: 0000000000000000
R13: ffff8800b934cb18 R14: ffffffff810097cc R15: ffff880037059848
FS:  00007fb9a4a30700(0000) GS:ffff880002280000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000ffe141d8 CR3: 0000000037154000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process java (pid: 1565, threadinfo ffff880037058000, task ffff8800372eaaa0)
Stack:
 ffff8800370598c8 ffffffff811480e3 000000000000fb88 ffff8800b8f10998
<d> ffffea0000ee8618 ffffea0000ee8618 ffff880037059928 ffffffff81149871
<d> ffffc9000060b000 ffffea0000ee8618 ffffea0000ee85e0 ffff8800370599a8
Call Trace:
 [<ffffffff811480e3>] ? page_lock_anon_vma+0x53/0x70
 [<ffffffff81149871>] ? try_to_unmap_anon+0x21/0x140
 [<ffffffff8114a1e5>] ? try_to_unmap+0x55/0x70
 [<ffffffff81168139>] ? migrate_pages+0x239/0x4b0
 [<ffffffff8115d8f0>] ? compaction_alloc+0x0/0x3e0
 [<ffffffff8115e1e7>] ? compact_zone+0x517/0x820
 [<ffffffff8115e771>] ? compact_zone_order+0xa1/0xe0
 [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
 [<ffffffff8115e8cc>] ? try_to_compact_pages+0x11c/0x190
 [<ffffffff81127415>] ? __alloc_pages_nodemask+0x5f5/0x940
 [<ffffffff810a3b49>] ? futex_wait_queue_me+0xb9/0xf0
 [<ffffffff8115c2da>] ? alloc_pages_vma+0x9a/0x150
 [<ffffffff81176635>] ? do_huge_pmd_anonymous_page+0x145/0x380
 [<ffffffff8113fe7a>] ? handle_mm_fault+0x25a/0x2b0
 [<ffffffff810a6340>] ? do_futex+0x100/0xb00
 [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
 [<ffffffff814fd830>] ? thread_return+0x4e/0x76e
 [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81500625>] ? page_fault+0x25/0x30
Code: 00 00 00 01 74 05 e8 e2 e3 d7 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> 1f 44 00 00 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89 
Call Trace:
 [<ffffffff811480e3>] ? page_lock_anon_vma+0x53/0x70
 [<ffffffff81149871>] ? try_to_unmap_anon+0x21/0x140
 [<ffffffff8114a1e5>] ? try_to_unmap+0x55/0x70
 [<ffffffff81168139>] ? migrate_pages+0x239/0x4b0
 [<ffffffff8115d8f0>] ? compaction_alloc+0x0/0x3e0
 [<ffffffff8115e1e7>] ? compact_zone+0x517/0x820
 [<ffffffff8115e771>] ? compact_zone_order+0xa1/0xe0
 [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
 [<ffffffff8115e8cc>] ? try_to_compact_pages+0x11c/0x190
 [<ffffffff81127415>] ? __alloc_pages_nodemask+0x5f5/0x940
 [<ffffffff810a3b49>] ? futex_wait_queue_me+0xb9/0xf0
 [<ffffffff8115c2da>] ? alloc_pages_vma+0x9a/0x150
 [<ffffffff81176635>] ? do_huge_pmd_anonymous_page+0x145/0x380
 [<ffffffff8113fe7a>] ? handle_mm_fault+0x25a/0x2b0
 [<ffffffff810a6340>] ? do_futex+0x100/0xb00
 [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
 [<ffffffff814fd830>] ? thread_return+0x4e/0x76e
 [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81500625>] ? page_fault+0x25/0x30
epmd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
epmd cpuset=/ mems_allowed=0
Pid: 1321, comm: epmd Not tainted 2.6.32-279.el6.x86_64 #1
Call Trace:
 [<ffffffff810c4971>] ? cpuset_print_task_mems_allowed+0x91/0xb0
 [<ffffffff811170e0>] ? dump_header+0x90/0x1b0
 [<ffffffff812146fc>] ? security_real_capable_noaudit+0x3c/0x70
 [<ffffffff81117562>] ? oom_kill_process+0x82/0x2a0
 [<ffffffff811174a1>] ? select_bad_process+0xe1/0x120
 [<ffffffff811179a0>] ? out_of_memory+0x220/0x3c0
 [<ffffffff811276be>] ? __alloc_pages_nodemask+0x89e/0x940
 [<ffffffff8115c1da>] ? alloc_pages_current+0xaa/0x110
 [<ffffffff811144e7>] ? __page_cache_alloc+0x87/0x90
 [<ffffffff8118fec0>] ? pollwake+0x0/0x60
 [<ffffffff8112a10b>] ? __do_page_cache_readahead+0xdb/0x210
 [<ffffffff8112a261>] ? ra_submit+0x21/0x30
 [<ffffffff81115813>] ? filemap_fault+0x4c3/0x500
 [<ffffffff8113ec14>] ? __do_fault+0x54/0x510
 [<ffffffff81127b2f>] ? free_hot_page+0x2f/0x60
 [<ffffffff8113f1c7>] ? handle_pte_fault+0xf7/0xb50
 [<ffffffff81010ba0>] ? copy_user_generic+0x0/0x20
 [<ffffffff81010bae>] ? copy_user_generic+0xe/0x20
 [<ffffffff8118fbe9>] ? set_fd_set+0x49/0x60
 [<ffffffff811910bc>] ? core_sys_select+0x1ec/0x2c0
 [<ffffffff8113fe04>] ? handle_mm_fault+0x1e4/0x2b0
 [<ffffffff81044479>] ? __do_page_fault+0x139/0x480
 [<ffffffff8103876c>] ? kvm_clock_read+0x1c/0x20
 [<ffffffff81038779>] ? kvm_clock_get_cycles+0x9/0x10
 [<ffffffff8109cd39>] ? ktime_get_ts+0xa9/0xe0
 [<ffffffff8118fb18>] ? poll_select_copy_remaining+0xf8/0x150
 [<ffffffff8150326e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81500625>] ? page_fault+0x25/0x30
Mem-Info:
Node 0 DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
CPU    1: hi:    0, btch:   1 usd:   0
CPU    2: hi:    0, btch:   1 usd:   0
CPU    3: hi:    0, btch:   1 usd:   0
CPU    4: hi:    0, btch:   1 usd:   0
CPU    5: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd:  33
CPU    1: hi:  186, btch:  31 usd:   8
CPU    2: hi:  186, btch:  31 usd: 150
CPU    3: hi:  186, btch:  31 usd:  28
CPU    4: hi:  186, btch:  31 usd:  31
CPU    5: hi:  186, btch:  31 usd:  50
active_anon:539479 inactive_anon:144774 isolated_anon:0
 active_file:5 inactive_file:209 isolated_file:32
 unevictable:0 dirty:3 writeback:0 unstable:0
 free:14247 slab_reclaimable:2867 slab_unreclaimable:13627
 mapped:8838 shmem:8817 pagetables:5063 bounce:0
Node 0 DMA free:12184kB min:224kB low:280kB high:336kB active_anon:1248kB inactive_anon:2048kB active_file:20kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15344kB mlocked:0kB dirty:0kB writeback:0kB mapped:28kB shmem:0kB slab_reclaimable:12kB slab_unreclaimable:4kB kernel_stack:0kB pagetables:28kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 2990 2990 2990
Node 0 DMA32 free:44804kB min:44828kB low:56032kB high:67240kB active_anon:2156668kB inactive_anon:577048kB active_file:0kB inactive_file:836kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:3062308kB mlocked:0kB dirty:12kB writeback:0kB mapped:35324kB shmem:35268kB slab_reclaimable:11456kB slab_unreclaimable:54504kB kernel_stack:2104kB pagetables:20224kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 4*4kB 4*8kB 3*16kB 2*32kB 4*64kB 0*128kB 2*256kB 2*512kB 2*1024kB 2*2048kB 1*4096kB = 12192kB
Node 0 DMA32: 1303*4kB 771*8kB 421*16kB 228*32kB 113*64kB 43*128kB 16*256kB 3*512kB 1*1024kB 0*2048kB 0*4096kB = 44804kB
9131 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap  = 0kB
Total swap = 0kB
780284 pages RAM
47313 pages reserved
68602 pages shared
705483 pages non-shared
[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[  454]     0   454     2832      272   0     -17         -1000 udevd
[ 1064]     0  1064     2278      125   5       0             0 dhclient
[ 1108]     0  1108     6908       68   2     -17         -1000 auditd
[ 1133]     0  1133    62270      145   5       0             0 rsyslogd
[ 1145]    81  1145     7910       74   4       0             0 dbus-daemon
[ 1157]     0  1157    47287      225   1       0             0 cupsd
[ 1190]     0  1190    16016      168   3     -17         -1000 sshd
[ 1217]    26  1217    53409      793   0     -17         -1000 postmaster
[ 1296]     0  1296    19667      216   0       0             0 master
[ 1301]    89  1301    19730      212   1       0             0 qmgr
[ 1321]   498  1321     2705       38   5       0             0 epmd
[ 1334]     0  1334    27039       50   0       0             0 sh
[ 1336]     0  1336    27039       52   2       0             0 rabbitmq-server
[ 1343]     0  1343    36335       96   0       0             0 su
[ 1347]   498  1347   231690     7677   0       0             0 beam.smp
[ 1400]    26  1400    44162      260   1       0             0 postmaster
[ 1478]   498  1478     1012       21   4       0             0 cpu_sup
[ 1483]    26  1483    53442     8661   4       0             0 postmaster
[ 1484]    26  1484    53409      292   0       0             0 postmaster
[ 1485]    26  1485    53469      303   1       0             0 postmaster
[ 1486]    26  1486    44194      285   5       0             0 postmaster
[ 1487]   498  1487     2696       28   3       0             0 inet_gethost
[ 1488]   498  1488     4278       44   0       0             0 inet_gethost
[ 1555]    91  1555   973792    94613   3       0             0 java
[ 1574]     0  1574    45903      533   1       0             0 httpd
[ 1580]    48  1580   202983   120866   3       0             0 httpd
[ 1581]    48  1581    97567    17050   3       0             0 httpd
[ 1582]    48  1582   116178    33462   1       0             0 httpd
[ 1583]    48  1583    93866    13378   4       0             0 httpd
[ 1584]    48  1584   170141    89649   3       0             0 httpd
[ 1585]    48  1585    93866    13378   5       0             0 httpd
[ 1586]    48  1586   116177    33462   3       0             0 httpd
[ 1587]    48  1587   164341    83616   5       0             0 httpd
[ 1590]     0  1590    29301      157   4       0             0 crond
[ 1606]     0  1606     5362       46   3       0             0 atd
[ 1614]     0  1614    48903     2182   3       0             0 supervisord
[ 1617]   500  1617    83751    12720   4       0             0 paster
[ 1618]   500  1618   165335    93624   0       0             0 paster
[ 1631]     0  1631     1014       24   3       0             0 mingetty
[ 1633]     0  1633     1014       24   5       0             0 mingetty
[ 1635]     0  1635     3096      507   1     -17         -1000 udevd
[ 1636]     0  1636     3096      507   3     -17         -1000 udevd
[ 1637]     0  1637     1014       24   0       0             0 mingetty
[ 1639]     0  1639     1014       24   0       0             0 mingetty
[ 1641]     0  1641     1014       23   4       0             0 mingetty
[ 1662]    26  1662    55455    10540   4       0             0 postmaster
[ 1663]    26  1663    53918     1106   1       0             0 postmaster
[ 1675]    26  1675    54443     9639   2       0             0 postmaster
[ 1679]    48  1679    93866    13378   3       0             0 httpd
[ 1698]    48  1698    93865    13378   2       0             0 httpd
[ 1699]    48  1699    97567    17050   2       0             0 httpd
[ 1708]    26  1708    53931     1063   5       0             0 postmaster
[ 1709]    26  1709    56860    10570   2       0             0 postmaster
[ 1710]    26  1710    56867    10050   1       0             0 postmaster
[ 1711]    26  1711    53930     1023   3       0             0 postmaster
[ 1712]    26  1712    54438     9048   3       0             0 postmaster
[ 1729]    26  1729    56149    10427   5       0             0 postmaster
[ 1730]    26  1730    53930     1024   0       0             0 postmaster
[ 1731]    26  1731    54006     1756   3       0             0 postmaster
[ 1732]    26  1732    54005     1692   2       0             0 postmaster
[ 1733]    26  1733    53930     1025   5       0             0 postmaster
[ 3310]    89  3310    19687      210   0       0             0 pickup
[ 3492]     0  3492     1014       23   4       0             0 mingetty
[ 3493]     0  3493    24453      241   1       0             0 sshd
[ 3497]     0  3497    27074       94   0       0             0 bash
[ 3515]     0  3515    27255      468   4       0             0 watch
[ 3522]     0  3522     2272       24   1       0             0 sh
Out of memory: Kill process 1580 (httpd) score 165 or sacrifice child
Killed process 1580, UID 48, (httpd) total-vm:811932kB, anon-rss:483420kB, file-rss:44kB

Add last harvested time to harvest source and dataset pages

Currently to see when a dataset was last harvested you have to go to the dataset's harvest source's page and then click on Admin.

Suggest adding this info to the main harvest source page at least, and if possible also to the dataset page.

make "delete harvest source" non-blocking

In our current default install with the harvest extension. Deleting a large harvest source may take a long time, this will cause nginx to timeout, even without the timeout a user may get confused and click back and attempt to delete the harvest source again.

We could make it non-blocking by adding another queue for deletion.

403 error from gather consumer interface

I get following error messages from the gather consumer interface

2014-01-08 17:44:44,141 DEBUG [ckanext.harvest.queue] Received harvest job id: a5449e60-0996-49c3-a305-8b4034647cc6
2014-01-08 17:44:44,142 DEBUG [ckanext.harvest.queue] pika connection using {'retry_delay': 2.0, 'frame_max': 10000, 'channel_max': 0, 'locale': 'en_US', 'socket_timeout': 0.25, 'ssl': False, 'host': 'localhost', 'ssl_options': {}, 'virtual_host': '/', 'heartbeat': 0, 'credentials': <pika.credentials.PlainCredentials object at 0x3c33b10>, 'backpressure_detection': False, 'port': 5672, 'connection_attempts': 1}
2014-01-08 17:44:44,695 DEBUG [ckanext.harvest.harvesters.ckanharvester] In CKANHarvester gather_stage ({CKAN_WEBSITE_I_WANT_TO_HARVEST_FROM})
2014-01-08 17:44:44,695 DEBUG [ckanext.harvest.harvesters.ckanharvester] Using config: {u'read_only': True, u'default_tags': [u'POPULATION', u'ACS'], u'remote_groups': u'only_local', u'default_groups': [u'testgroup'], u'user': u'harvest', u'api_key': u'3a3c9e64-45f1-40d5-ab04-c9ddc8157885', u'override_extras': True, u'api_version': 2}
2014-01-08 17:44:44,746 ERROR [ckanext.harvest.harvesters.base] Unable to get content for URL: http://{CKAN_WEBSITE_I_WANT_TO_HARVEST_FROM}/api/2/rest/package: HTTP Error 403: Forbidden
2014-01-08 17:44:44,750 ERROR [ckanext.harvest.queue] Gather stage failed

my harvest configuration is

{
   "read_only":true,
   "default_tags":[
      "POPULATION",
      "ACS"
   ],
   "remote_groups":"only_local",
   "default_groups":[
      "testgroup"
   ],
   "user":"harvest",
   "api_key":"3a3c9e64-45f1-40d5-ab04-c9ddc8157885",
   "override_extras":true,
   "api_version":2
}

Does anyone meet this same issue? any hint or suggestion, please?

Thanks

500 errors when trying to access some harvested records through http://foo.org/harvest/object/UID

We are working on a large deployment of CKAN and we are seeing some 500 errors when trying to access some harvested records (~30% of all harvested records). Our deployment system is CentOS 6.4

Here's the error for the 500s:
An error occurred: ['module' object has no attribute 'ParseError']

This traces to exactly here: https://github.com/okfn/ckanext-harvest/blob/master/ckanext/harvest/controllers/view.py#L118

The exception is triggered by here: https://github.com/okfn/ckanext-harvest/blob/master/ckanext/harvest/controllers/view.py#L104

The problem is etree and Python versioning.

ckanext-harvest code uses xml.etree.ElementTree as etree: https://github.com/okfn/ckanext-harvest/blob/master/ckanext/harvest/controllers/view.py#L2

xml.etree.ElementTree.ParseError is only available on Python 2.7+. It is not in Python 2.6 included in CentOS 6.x.

Hence this error.

So, the code here: https://github.com/okfn/ckanext-harvest/blob/master/ckanext/harvest/controllers/view.py#L104

is throwing exception when trying to parse JSON (via etree.fromstring).

Python 2.6 etree throws xml.parsers.expat.ExpatError

Python 2.7 etree throws xml.etree.ElementTree.ParseError (which is aliased to etree.ParseError in the code)

Allow bulk deletion of all datasets when deleting a source

Note: this targets 2.1

This needs some thought as there are some issues:

  • How to efficiently get all ids from the datasets that need to be deleted, via a) search or b) db. a) implies that we need to query for the full dict on potentially lots of datasets when we only need the id. b) would mean go through the harvest object table, as in https://github.com/okfn/ckanext-harvest/blob/release-v2.0/ckanext/harvest/logic/action/get.py#L90
  • The bulk_update_* functions in CKAN core require an organization
  • We will maybe need a bulk_update_activate function
  • On the frontend, the confirmation message should add a checkbox to select deleting all datasets. Current implementation of the confirm-action module does not allow this, so we either extend it or have a custom one in the harvest extension (probably easier)

/cc @kindly

Allow harvesting of tags without stripping accents and capital letters

Currently, harvesting of tags automatically strips out any accented characters or capital letters. This works fine for English-language sources but looks strange in German or French - see the list of Schlagworte here: http://opendata.admin.ch/de/dataset .

This is hardcoded in the function _create_or_update_package in ckanext/harvest/harvesters/base.py. It would be nice to be able to set a flag somewhere to determine whether or not stripping accents and capitals is desired behaviour.

Add source clear command

Sometimes it is useful to clear a source and start all over again. This commands remove all jobs, errors, objects and packages for a particular source.

TemplateNotFound when trying to get RDF version of a package with a non-default package type

For example ckanext-harvest creates a package for each harvest source, the "type" field of these packages is set to "harvest". If you visit the RDF version of the package's page: http://publicdata.eu/dataset/ckan-italia.rdf you get:

File '/usr/lib/ckan/src/ckan/ckan/controllers/package.py', line 362 in read
  return render(template, loader_class=loader)
...
TemplateNotFound: Template "source/read.rdf" not found

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.