Code Monkey home page Code Monkey logo

ckanext-xloader's Introduction

image

Latest Version

Supported Python versions

Development Status

License

XLoader - ckanext-xloader

Loads CSV (and similar) data into CKAN's DataStore. Designed as a replacement for DataPusher because it offers ten times the speed and more robustness (hence the name, derived from "Express Loader")

OpenGov Inc. has sponsored this development, with the aim of benefitting open data infrastructure worldwide.

Key differences from DataPusher

Speed of loading

DataPusher - parses CSV rows, converts to detected column types, converts the data to a JSON string, calls datastore_create for each batch of rows, which reformats the data into an INSERT statement string, which is passed to PostgreSQL.

XLoader - pipes the CSV file directly into PostgreSQL using COPY.

In tests, XLoader is over ten times faster than DataPusher.

Robustness

DataPusher - one cause of failure was when casting cells to a guessed type. The type of a column was decided by looking at the values of only the first few rows. So if a column is mainly numeric or dates, but a string (like "N/A") comes later on, then this will cause the load to error at that point, leaving it half-loaded into DataStore.

XLoader - loads all the cells as text, before allowing the admin to convert columns to the types they want (using the Data Dictionary feature). In future it could do automatic detection and conversion.

Simpler queueing tech

DataPusher - job queue is done by ckan-service-provider which is bespoke, complicated and stores jobs in its own database (sqlite by default).

XLoader - job queue is done by RQ, which is simpler, is backed by Redis, allows access to the CKAN model and is CKAN's default queue technology. You can also debug jobs easily using pdb. Job results are stored in Sqlite by default, and for production simply specify CKAN's database in the config and it's held there - easy.

(The other obvious candidate is Celery, but we don't need its heavyweight architecture and its jobs are not debuggable with pdb.)

Separate web server

DataPusher - has the complication that the queue jobs are done by a separate (Flask) web app, apart from CKAN. This was the design because the job requires intensive processing to convert every line of the data into JSON. However it means more complicated code as info needs to be passed between the services in http requests, more for the user to set-up and manage - another app config, another apache config, separate log files.

XLoader - the job runs in a worker process, in the same app as CKAN, so can access the CKAN config, db and logging directly and avoids many HTTP calls. This simplification makes sense because the xloader job doesn't need to do much processing - mainly it is streaming the CSV file from disk into PostgreSQL.

It is still entirely possible to run the XLoader worker on a separate server, if that is desired. The worker needs the following:

  • A copy of CKAN installed in the same Python virtualenv (but not running).
  • A copy of the CKAN config file.
  • Access to the Redis instance that the running CKAN app uses to store jobs.
  • Access to the database.

You can then run it via ckan jobs worker as below.

Caveat - column types

Note: With XLoader, all columns are stored in DataStore's database as 'text' type (whereas DataPusher did some rudimentary type guessing - see 'Robustness' above). However once a resource is xloaded, an admin can use the resource's Data Dictionary tab to change these types to numeric or datestamp and re-load the file. When migrating from DataPusher to XLoader you can preserve the types of existing resources by using the migrate_types command.

There is scope to add functionality for automatically guessing column type -offers to contribute this are welcomed.

Requirements

Compatibility with core CKAN versions:

CKAN version Compatibility
2.7 no longer supported (last supported version: 0.12.2)
2.8 no longer supported (last supported version: 0.12.2)
2.9 yes (Python3) (last supported version for Python 2.7: 0.12.2))
2.10 yes

Installation

To install XLoader:

  1. Activate your CKAN virtual environment, for example:

    . /usr/lib/ckan/default/bin/activate
  2. Install the ckanext-xloader Python package into your virtual environment:

    pip install ckanext-xloader
  3. Install dependencies:

    pip install -r https://raw.githubusercontent.com/ckan/ckanext-xloader/master/requirements.txt
    pip install -U requests[security]
  4. Add xloader to the ckan.plugins setting in your CKAN config file (by default the config file is located at /etc/ckan/default/production.ini).

    You should also remove datapusher if it is in the list, to avoid them both trying to load resources into the DataStore.

    Ensure datastore is also listed, to enable CKAN DataStore.

  5. Starting CKAN 2.10 you will need to set an API Token to be able to execute jobs against the server:

    ckanext.xloader.api_token = <your-CKAN-generated-API-Token>
  6. If it is a production server, you'll want to store jobs info in a more robust database than the default sqlite file. It can happily use the main CKAN postgres db by adding this line to the config, but with the same value as you have for sqlalchemy.url:

    ckanext.xloader.jobs_db.uri = postgresql://ckan_default:pass@localhost/ckan_default

    (This step can be skipped when just developing or testing.)

  7. Restart CKAN. For example if you've deployed CKAN with Apache on Ubuntu:

    sudo service apache2 reload
  8. Run the worker:

    ckan -c /etc/ckan/default/ckan.ini jobs worker

Config settings

Configuration:

See the extension's config_declaration.yaml file.

This plugin also supports the ckan.download_proxy setting, to use a proxy server when downloading files. This setting is shared with other plugins that download resource files, such as ckanext-archiver. Eg:

ckan.download_proxy = http://my-proxy:1234/

You may also wish to configure the database to use your preferred date input style on COPY. For example, to make [PostgreSQL](https://www.postgresql.org/docs/current/runtime-config-client.html#RUNTIME-CONFIG-CLIENT-FORMAT) expect European (day-first) dates, you could add to postgresql.conf:

datestyle=ISO,DMY

Developer installation

To install XLoader for development, activate your CKAN virtualenv and in the directory up from your local ckan repo:

git clone https://github.com/ckan/ckanext-xloader.git
cd ckanext-xloader
pip install -e .
pip install -r requirements.txt
pip install -r dev-requirements.txt

Upgrading from DataPusher

To upgrade from DataPusher to XLoader:

  1. Install XLoader as above, including running the xloader worker.
  2. (Optional) For existing datasets that have been datapushed to datastore, freeze the column types (in the data dictionaries), so that XLoader doesn't change them back to string on next xload:

    ckan -c /etc/ckan/default/ckan.ini migrate_types
  3. If you've not already, change the enabled plugin in your config - on the ckan.plugins line replace datapusher with xloader.
  4. (Optional) If you wish, you can disable the direct loading and continue to just use tabulator - for more about this see the docs on config option: ckanext.xloader.use_type_guessing
  5. Stop the datapusher worker:

    sudo a2dissite datapusher
  6. Restart CKAN:

    sudo service apache2 reload
    sudo service nginx reload

Command-line interface

You can submit single or multiple resources to be xloaded using the command-line interface.

e.g. :

ckan -c /etc/ckan/default/ckan.ini xloader submit <dataset-name>

For debugging you can try xloading it synchronously (which does the load directly, rather than asking the worker to do it) with the -s option:

ckan -c /etc/ckan/default/ckan.ini xloader submit <dataset-name> -s

See the status of jobs:

ckan -c /etc/ckan/default/ckan.ini xloader status

Submit all datasets' resources to the DataStore:

ckan -c /etc/ckan/default/ckan.ini xloader submit all

Re-submit all the resources already in the DataStore (Ignores any resources that have not been stored in DataStore e.g. because they are not tabular):

ckan -c /etc/ckan/default/ckan.ini xloader submit all-existing

Full list of XLoader CLI commands:

ckan -c /etc/ckan/default/ckan.ini xloader --help

Jobs and workers

Main docs for managing jobs: <https://docs.ckan.org/en/latest/maintaining/background-tasks.html#managing-background-jobs>

Main docs for running and managing workers are here: https://docs.ckan.org/en/latest/maintaining/background-tasks.html#running-background-jobs

Useful commands:

Clear (delete) all outstanding jobs:

ckan -c /etc/ckan/default/ckan.ini jobs clear [QUEUES]

If having trouble with the worker process, restarting it can help:

sudo supervisorctl restart ckan-worker:*

Troubleshooting

KeyError: "Action 'datastore_search' not found"

You need to enable the datastore plugin in your CKAN config. See 'Installation' section above to do this and restart the worker.

ProgrammingError: (ProgrammingError) relation "_table_metadata" does not exist

Your DataStore permissions have not been set-up - see: <https://docs.ckan.org/en/latest/maintaining/datastore.html#set-permissions>

Running the Tests

The first time, your test datastore database needs the trigger applied:

sudo -u postgres psql datastore_test -f full_text_function.sql

To run the tests, do:

pytest ckan-ini=test.ini ckanext/xloader/tests

Releasing a New Version of XLoader

XLoader is available on PyPI as https://pypi.org/project/ckanext-xloader.

To publish a new version to PyPI follow these steps:

  1. Update the version number in the setup.py file. See PEP 440 for how to choose version numbers.
  2. Update the CHANGELOG.
  3. Make sure you have the latest version of necessary packages:

    pip install --upgrade setuptools wheel twine
  4. Create source and binary distributions of the new version:

    python setup.py sdist bdist_wheel && twine check dist/*

    Fix any errors you get.

  5. Upload the source distribution to PyPI:

    twine upload dist/*
  6. Commit any outstanding changes:

    git commit -a
    git push
  7. Tag the new release of the project on GitHub with the version number from the setup.py file. For example if the version number in setup.py is 0.0.1 then do:

    git tag 0.0.1
    git push --tags

ckanext-xloader's People

Contributors

amercader avatar bellisk avatar dependabot-preview[bot] avatar duttonw avatar jvickery-tbs avatar kowh-ai avatar muhammed-ajmal avatar mutantsan avatar pdelboca avatar rabiasajjad avatar rjruizes avatar shashigharti avatar smotornyuk avatar stefina avatar tgurr avatar thrawnca avatar tino097 avatar tomecirun avatar wardi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ckanext-xloader's Issues

Deferred indexing - speed-up

Disable/remove all indexes while loading - it is much quicker to just recalculate them after loading.

stats with ckanext-shift

  • 100k rows of Boston 311 dataset
    241s - datastore_create (indexes created), load
    102s - datastore_create (indexes created), indexes deleted, load, indexes created
    This is measured as part of a test, including about 15s of test set-up time.
    Speed-up: (241-102)/(241-15)=61% quicker

61% is well worth it!

stats with datapusher

  • complete Boston 311 dataset
    2338s Standard DataPusher (indexed during)
    2105s DataPusher modified for post-indexing
    Speed-up (2105-2338)/2338=10% quicker

i.e. Datapusher would benefit from a 10% speed boost by switching to post-indexing. However it is probably not worth it due to the downside of the extra complication.

Table jobs permissioned denied/doesn't seem to exist

Hello!

I'm encountering a error when when the xloader tries to write to the job table.

The error I'm getting is:

ProgrammingError: (ProgrammingError) permission denied for relation jobs
 'UPDATE jobs SET status=%(status)s, error=%(error)s, finished_timestamp=%(finished_timestamp)s WHERE jobs.job_id = %(job_id_1)s' {'status': 'error', 'finished_timestamp': datetime.datetime(2018, 2, 15, 19, 33, 52, 708255), 'job_id_1': u'a855444c-0ed2-4898-8dbe-67f9866fdc04', 'error': u'{"message": "  File \\"/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/default.py\\", line 435, in do_execute\\n    cursor.execute(statement, parameters)\\nProgrammingError(\'(ProgrammingError) permission denied for relation jobs\\\\n\',)"}'}

Here is a gist of the full traceback

I went to check the permissions of the jobs table in the DB I specified in ckanext.xloader.jobs_db.uri but the table jobs did not exist. It also didn't exists in my ckan or datastore db.

I see that it's supposed to be created by this function https://github.com/davidread/ckanext-xloader/blob/master/ckanext/xloader/db.py#L396. Going through the code it seems that the jobs should be created on the fly perhaps since there is no command to manually init the db. But can't seem to find any explicit errors in the logs of where xloader attempted to create the table or any failures from doing so. I'm wondering what could be happening as I went through the README and a fresh install a few times and can't seem to find where I might've missed something.

I also found it a little strange that sqlalchmey is throwing a "permission denied" instead of a "does not exist" error so maybe the jobs table does exist somewhere.

Error when running populate_full_text_trigger

We're trying to set up XLoader on CKAN 2.8.3, but it gets an error whenever it runs. Adding logging statements reveals that the error message is:

(psycopg2.ProgrammingError) record "new" has no field "_full_text"
CONTEXT: SQL statement "SELECT NEW._full_text IS NOT NULL"
PL/pgSQL function populate_full_text_trigger() line 3 at IF
[SQL: \'INSERT INTO jobs (job_id, job_type, status, requested_timestamp, sent_data, result_url, api_key) VALUES (%(job_id)s, %(job_type)s, %(status)s, %(requested_timestamp)s, %(sent_data)s, %(result_url)s, %(api_key)s)\']

The populate_full_text_trigger function definitely exists, with the appropriate content. We've pointed ckanext.xloader.jobs_db.uri at the datastore, with the credentials of the database owner.

Any idea why this is happening?

Test private datasets

Check this is all functioning properly - there was some related datapusher code commented.

Incomplete Header causes file not to be loaded by xloader

When I try to upload a file to the xloader, that has more "content-columns" than "header-columns", the process fails with extra data after last expected column. This is a minimal example that causes said error:

subject,verb,object,
this,is,content,
that,is,content,but,it,has,more,content
also,is,content,

Is this known or respectively a wanted caveat of the xloader?

Provide an overview of the currently pending jobs

We just deployed xloader on a production server. And since we have harvesters running that fill up the pipeline and users that are manually triggering an upload it would be very nice to have some kind of overview how many jobs are currently pending and which are being processed.

MD5 hash on the resource

I just found out, that xloader creates MD5 hashes for all files to check if a file is already in the DataStore. Those hashes are then saved on the resource.

This is perfectly fine, but we already generated a hash for each resource using SHA1 for our own use case (update a custom field on the dataset, if a resource has changed). So together with xloader, this created the situation, that every file has suddenly been seen as changed, since the MD5 hash did not match our SHA1 hash.

For now we solved this by simply switching to MD5 (opendatazurich/ckanext-stadtzh-harvest#35). But maybe it might be worth having a config option to specify the algorithm used for the hashing or simply providing an interface, so plugins can actually change the way the hash is calculated.

At the very least, the current implementation needs a note in the README, so explain what's going on.

WDYT?

Unicode characters in column names break the loader

I tried to import a file with unicode characters in the header (in my case they were German umlauts), this leads to some ugly UnicodeDecode errors during the loading, e.g.

[ckanext.xloader.jobs] xloader error: 'ascii' codec can't encode character u'\xfc' in position 3: ordinal not in range(128)

Initially I could pin this issue down to some log messages that were mixing strings with unicode (e.g. logger.info('Fields: {}'.format(fields)) which fails if fields contains unicode characters. But upon fixing this, I realized that the current implemented of DataStore in CKAN core couldn't handle unicode column names as well. While it might be possible to fix this at some point, I'm not sure if it's generally a good idea to have unicode characters in column names.

I'll submit a PR later, where we could discuss the implementation of my solution.

Submit fails for dataset(s) when an upload for a single resource fails

When submitting all datasets to be xloadered and a resource can't be processed, the whole process is disrupted with an exception.

For context: We experienced this when a resource failed due to an encoding problem in its URL.
paster xloader submit dataset -c /etc/ckan/default/production.ini

We would still expect the command to keep processing the remaining resources (and furthermore all datasets, when using the submit_all-command).

Import existing column types into data dictionary?

We have a CKAN system that has been using the DataPusher, and we want to migrate to XLoader to improve performance and reduce administration overheads. However, we have thousands of resources parsed by messytables, and we want to preserve the type information (to format the data better for client applications), which XLoader doesn't by default.

To avoid updating each resource manually, we're looking at writing a paster command to inspect the existing columns and add Data Dictionary overrides for all of the non-text fields for when we move to XLoader.

Are there any obvious pitfalls with this approach?

Error during load: Could not create the database table: validator/converter not found: u'scheming_multiple_choice_output'

I've just install XLoader on a clean install. While trying to push a CSV resource to the DataStore, I get the following error message:
Erreur : File "/usr/lib/ckan/default/src/ckanext-scheming/ckanext/scheming/validation.py", line 319, in get_validator_or_converter raise SchemingException('validator/converter not found: %r' % name) SchemingException("validator/converter not found: u'scheming_multiple_choice_output'",)

The log which resides under says at one point:
Error during load: Could not create the database table: validator/converter not found: u'scheming_multiple_choice_output'

Any idea on how to fix this?

[Errno 28] No space left on device

Hello.

What is the maximum value for "ckanext.xloader.max_content_length" parameter? According the plugin's documentation the default value is 1GB. Is this the maximum value too?

KeyError: 'resource_id'

On current master branch, when uploading a resource to datastore I get the following error:

  File "/vagrant_data/ckan/default/src/ckanext-xloader/ckanext/xloader/jobs.py", line 378, in set_resource_metadata
    model.Resource.id == update_dict['resource_id']
KeyError: 'resource_id'

It appears that the set_resource_metadata function in ckanext/xloader/jobs.py overwrites the update_dict dict:

    update_dict = {'datastore_active': update_dict.get('datastore_active', True),
                   'datastore_contains_all_records_of_source_file': update_dict.get('datastore_contains_all_records_of_source_file', True)}

...so it no longer contains resource_id (see commit: 2a49a51#diff-c548135301733293716337251017abb7)

However it tries to use update_dict['resource_id'] in the query below:

https://github.com/ckan/ckanext-xloader/blob/master/ckanext/xloader/jobs.py#L378

Store jobs in main ckan database

rather than a personal ckanext-shift one. The amount of data is relatively small, so not a huge burden. And it saves set-up and something else to manage.

Could usefully delete jobs, so only the latest job for a resource is kept. (since we don't display them anyway)

Jobs are queued but not processed

We are experiencing that our resources are not loaded into the datastore anymore. It still works fine on our integration-environment.

Executing the xloader-command to list its status like so:
paster xloader status -c /var/www/ckan/default/development.ini

gives me a list of jobs which are dated about a month ago until today. I was not able yet to find out how this happened but was wondering if there is a way to "flush the queue" somehow, as a reload of a resource into the datastore does not work either.

Have you experienced something similar? Maybe you can point me towards a possible way to clean up the queue or how to debug this?

populate_full_text_trigger error when doing 'datastore set-permissions'

From CKAN 2.8+ when trying to initialize the datastore, and xloader is installed, you get this error:

(default) ubuntu@ubuntu-xenial:/vagrant/src/ckan$ paster --plugin=ckan datastore set-permissions
/usr/lib/ckan/default/local/lib/python2.7/site-packages/webassets/loaders.py:162: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  obj = self.yaml.load(f) or {}
Traceback (most recent call last):
  File "/usr/lib/ckan/default/bin/paster", line 11, in <module>
    sys.exit(run())
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 102, in run
    invoke(command, command_name, options, args[1:])
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 140, in invoke
    runner = command(command_name)
  File "/vagrant/src/ckan/ckan/lib/cli.py", line 294, in __call__
    obj={})
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/vagrant/src/ckan/ckanext/datastore/commands.py", line 29, in set_permissions
    load_config(config or ctx.obj['config'])
  File "/vagrant/src/ckan/ckan/lib/cli.py", line 241, in load_config
    load_environment(conf.global_conf, conf.local_conf)
  File "/vagrant/src/ckan/ckan/config/environment.py", line 118, in load_environment
    p.load_all()
  File "/vagrant/src/ckan/ckan/plugins/core.py", line 140, in load_all
    load(*plugins)
  File "/vagrant/src/ckan/ckan/plugins/core.py", line 168, in load
    plugins_update()
  File "/vagrant/src/ckan/ckan/plugins/core.py", line 122, in plugins_update
    environment.update_config()
  File "/vagrant/src/ckan/ckan/config/environment.py", line 305, in update_config
    plugin.configure(config)
  File "/vagrant/src/ckanext-xloader/ckanext/xloader/plugin.py", line 80, in configure
    raise Exception('populate_full_text_trigger is not defined. See '
Exception: populate_full_text_trigger is not defined. See ckanext-xloader's README.rst for more details.

The workaround is to remove ckanext-xloader from the CKAN config ckan.plugins line when running this command.

Unable to see DataStore option on web

After configuration of xloader and removal of datapusher from config file, the upload to data store option on ckan resource manage section is disappeared.
image

This is an issue, because large file upload (e.g. 5GB file using cloudstorage) takes time and xloader fails on first attempt to get the file, so there must be a way to re-trigger it from web when upload done.

Kindly guide.

xLoader re-submit all resources when adding a new one to a dataset

How to Reproduce

  • Using a plain vainilla CKAN 2.8.3 instance with xloader 0.4.0 installed (I have used plain ofkn/docker-ckan but this one was also detected in a custom implementation for our customer)
  • Create a Dataset
  • Add a new Resource
  • Add another Resource
  • Check logs

Details

In this logs I had a dataset with a Resource called mini-csv.csv, then I edited the dataset to add a new one called mini-csv-2.csv. As the logs shows, the instance was running, I created the new resource but the two resources where submited:

ckan-dev_1    | 2019-10-13 13:14:49,387 INFO  [rq.worker] *** Listening on ckan:default:default...
ckan-dev_1    | 2019-10-13 13:15:27,562 INFO  [ckan.lib.base]  /dataset/edit/testing-xloader render time 0.209 seconds
ckan-dev_1    | 2019-10-13 13:15:29,448 INFO  [ckan.lib.base]  /dataset/resources/testing-xloader render time 0.176 seconds
ckan-dev_1    | 2019-10-13 13:15:32,241 INFO  [ckan.lib.base]  /dataset/new_resource/testing-xloader render time 0.134 seconds
ckan-dev_1    | 2019-10-13 13:15:45,179 DEBUG [ckanext.xloader.plugin] Submitting resource 8371b91c-b59e-46f9-b6da-f4abf9e37ec9 to be xloadered
ckan-dev_1    | 2019-10-13 13:15:45,214 INFO  [ckanext.xloader.action] Added background job bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f to queue "default"
ckan-dev_1    | 2019-10-13 13:15:45,214 DEBUG [ckanext.xloader.action] Enqueued xloader job=bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f res_id=8371b91c-b59e-46f9-b6da-f4abf9e37ec9
ckan-dev_1    | 2019-10-13 13:15:45,215 INFO  [rq.worker] ckan:default:default: ckanext.xloader.jobs.xloader_data_into_datastore({'result_url': u'http://ckan:5000/api/3/action/xloader_hook', 'api_key': u'3b32cbad-c8cc-4798-b927-d5828374f0a2', 'job_type': 'xloader_to_datastore', 'metadata': {'original_url': u'http://ckan:5000/dataset/bdc78b01-c24d-4416-a9ff-a6a1cc748e59/resource/8371b91c-b59e-46f9-b6da-f4abf9e37ec9/download/mini-csv-2.csv', 'ckan_url': u'http://ckan:5000', 'resource_id': u'8371b91c-b59e-46f9-b6da-f4abf9e37ec9', 'set_url_type': False, 'task_created': '2019-10-13 13:15:45.205091', 'ignore_hash': False}}) (bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f)
ckan-dev_1    | 2019-10-13 13:15:45,216 INFO  [ckan.lib.jobs] Worker rq:worker:6da86188ebe4.72 starts job bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f from queue "default"
ckan-dev_1    | 2019-10-13 13:15:45,221 DEBUG [ckanext.xloader.plugin] Submitting resource 77e90203-3cd7-4b46-a30d-779d2c4659f5 to be xloadered
ckan-dev_1    | 2019-10-13 13:15:45,258 INFO  [ckanext.xloader.action] Added background job 05803921-a69c-4b42-a430-89f1444a0290 to queue "default"
ckan-dev_1    | 2019-10-13 13:15:45,258 DEBUG [ckanext.xloader.action] Enqueued xloader job=05803921-a69c-4b42-a430-89f1444a0290 res_id=77e90203-3cd7-4b46-a30d-779d2c4659f5
ckan-dev_1    | 2019-10-13 13:15:45,428 INFO  [ckan.lib.base]  /dataset/new_resource/testing-xloader render time 0.341 seconds

Down belong the logs shows how xloader executed the jobs for both:

ckan-dev_1    | Express Load starting: /dataset/testing-xloader/resource/8371b91c-b59e-46f9-b6da-f4abf9e37ec9
ckan-dev_1    | 2019-10-13 13:15:45,700 INFO  [bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f] Express Load starting: /dataset/testing-xloader/resource/8371b91c-b59e-46f9-b6da-f4abf9e37ec9
ckan-dev_1    | Fetching from: http://ckan:5000/dataset/bdc78b01-c24d-4416-a9ff-a6a1cc748e59/resource/8371b91c-b59e-46f9-b6da-f4abf9e37ec9/download/mini-csv-2.csv
ckan-dev_1    | 2019-10-13 13:15:45,703 INFO  [bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f] Fetching from: http://ckan:5000/dataset/bdc78b01-c24d-4416-a9ff-a6a1cc748e59/resource/8371b91c-b59e-46f9-b6da-f4abf9e37ec9/download/mini-csv-2.csv
ckan-dev_1    | 2019-10-13 13:15:45,792 INFO  [ckan.lib.base]  /dataset/bdc78b01-c24d-4416-a9ff-a6a1cc748e59/resource/8371b91c-b59e-46f9-b6da-f4abf9e37ec9/download/mini-csv-2.csv render time 0.037 seconds
ckan-dev_1    | Downloaded ok - 51.0 bytes
ckan-dev_1    | 2019-10-13 13:15:45,801 INFO  [bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f] Downloaded ok - 51.0 bytes
ckan-dev_1    | File hash: b5ae401bd035a5be32a120c675592683
ckan-dev_1    | 2019-10-13 13:15:45,805 INFO  [bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f] File hash: b5ae401bd035a5be32a120c675592683
ckan-dev_1    | Loading CSV
ckan-dev_1    | 2019-10-13 13:15:45,830 INFO  [bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f] Loading CSV
ckan-dev_1    | Ensuring character coding is UTF8
ckan-dev_1    | 2019-10-13 13:15:45,837 INFO  [bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f] Ensuring character coding is UTF8
ckan-dev_1    | Fields: [{'type': 'text', 'id': 'column_1'}, {'type': 'text', 'id': 'column_2'}]
ckan-dev_1    | 2019-10-13 13:15:45,857 INFO  [bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f] Fields: [{'type': 'text', 'id': 'column_1'}, {'type': 'text', 'id': 'column_2'}]
ckan-dev_1    | 2019-10-13 13:15:45,868 INFO  [ckan.lib.base]  /dataset/testing-xloader render time 0.420 seconds
ckan-dev_1    | 2019-10-13 13:15:45,910 DEBUG [ckanext.datastore.logic.action] Setting datastore_active=True on resource 8371b91c-b59e-46f9-b6da-f4abf9e37ec9
ckan-dev_1    | Copying to database...
ckan-dev_1    | 2019-10-13 13:15:46,069 INFO  [bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f] Copying to database...
ckan-dev_1    | ...copying done
ckan-dev_1    | 2019-10-13 13:15:46,082 INFO  [bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f] ...copying done
ckan-dev_1    | Creating search index...
ckan-dev_1    | 2019-10-13 13:15:46,101 INFO  [bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f] Creating search index...
ckan-dev_1    | ...search index created
ckan-dev_1    | 2019-10-13 13:15:46,113 INFO  [bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f] ...search index created
ckan-dev_1    | Calculating record count (running ANALYZE on the table)
ckan-dev_1    | 2019-10-13 13:15:46,122 INFO  [bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f] Calculating record count (running ANALYZE on the table)
ckan-dev_1    | Setting resource.datastore_active = True
ckan-dev_1    | 2019-10-13 13:15:46,126 INFO  [bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f] Setting resource.datastore_active = True
ckan-dev_1    | Setting resource.datastore_contains_all_records_of_source_file = True
ckan-dev_1    | 2019-10-13 13:15:46,142 INFO  [bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f] Setting resource.datastore_contains_all_records_of_source_file = True
ckan-dev_1    | Data now available to users: /dataset/testing-xloader/resource/8371b91c-b59e-46f9-b6da-f4abf9e37ec9
ckan-dev_1    | 2019-10-13 13:15:46,374 INFO  [bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f] Data now available to users: /dataset/testing-xloader/resource/8371b91c-b59e-46f9-b6da-f4abf9e37ec9
ckan-dev_1    | Creating column indexes (a speed optimization for queries)...
ckan-dev_1    | 2019-10-13 13:15:46,399 INFO  [bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f] Creating column indexes (a speed optimization for queries)...
ckan-dev_1    | ...column indexes created.
ckan-dev_1    | 2019-10-13 13:15:46,439 INFO  [bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f] ...column indexes created.
ckan-dev_1    | Express Load completed
ckan-dev_1    | 2019-10-13 13:15:46,451 INFO  [bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f] Express Load completed
ckan-dev_1    | 2019-10-13 13:15:46,560 INFO  [rq.worker] ckan:default:default: Job OK (bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f)
ckan-dev_1    | 2019-10-13 13:15:46,560 INFO  [rq.worker] Result is kept for 500 seconds
ckan-dev_1    | 2019-10-13 13:15:46,565 INFO  [ckan.lib.jobs] Worker rq:worker:6da86188ebe4.72 has finished job bfd457e6-61f4-43d3-bcdf-4ea1dc31b54f from queue "default"
ckan-dev_1    | 2019-10-13 13:15:46,566 INFO  [rq.worker]
ckan-dev_1    | 2019-10-13 13:15:46,566 INFO  [rq.worker] *** Listening on ckan:default:default...
ckan-dev_1    | 2019-10-13 13:15:46,568 INFO  [rq.worker] ckan:default:default: ckanext.xloader.jobs.xloader_data_into_datastore({'result_url': u'http://ckan:5000/api/3/action/xloader_hook', 'api_key': u'3b32cbad-c8cc-4798-b927-d5828374f0a2', 'job_type': 'xloader_to_datastore', 'metadata': {'original_url': u'http://ckan:5000/dataset/bdc78b01-c24d-4416-a9ff-a6a1cc748e59/resource/77e90203-3cd7-4b46-a30d-779d2c4659f5/download/mini-csv.csv', 'ckan_url': u'http://ckan:5000', 'resource_id': u'77e90203-3cd7-4b46-a30d-779d2c4659f5', 'set_url_type': False, 'task_created': '2019-10-13 13:15:45.248526', 'ignore_hash': False}}) (05803921-a69c-4b42-a430-89f1444a0290)
ckan-dev_1    | 2019-10-13 13:15:46,569 INFO  [ckan.lib.jobs] Worker rq:worker:6da86188ebe4.72 starts job 05803921-a69c-4b42-a430-89f1444a0290 from queue "default"
ckan-dev_1    | Express Load starting: /dataset/testing-xloader/resource/77e90203-3cd7-4b46-a30d-779d2c4659f5
ckan-dev_1    | 2019-10-13 13:15:47,062 INFO  [05803921-a69c-4b42-a430-89f1444a0290] Express Load starting: /dataset/testing-xloader/resource/77e90203-3cd7-4b46-a30d-779d2c4659f5
ckan-dev_1    | Fetching from: http://ckan:5000/dataset/bdc78b01-c24d-4416-a9ff-a6a1cc748e59/resource/77e90203-3cd7-4b46-a30d-779d2c4659f5/download/mini-csv.csv
ckan-dev_1    | 2019-10-13 13:15:47,088 INFO  [05803921-a69c-4b42-a430-89f1444a0290] Fetching from: http://ckan:5000/dataset/bdc78b01-c24d-4416-a9ff-a6a1cc748e59/resource/77e90203-3cd7-4b46-a30d-779d2c4659f5/download/mini-csv.csv
ckan-dev_1    | 2019-10-13 13:15:47,133 INFO  [ckan.lib.base]  /dataset/bdc78b01-c24d-4416-a9ff-a6a1cc748e59/resource/77e90203-3cd7-4b46-a30d-779d2c4659f5/download/mini-csv.csv render time 0.027 seconds
ckan-dev_1    | Downloaded ok - 40.0 bytes
ckan-dev_1    | 2019-10-13 13:15:47,135 INFO  [05803921-a69c-4b42-a430-89f1444a0290] Downloaded ok - 40.0 bytes
ckan-dev_1    | File hash: a4c2cdeaefdb2c659151f0a64706ed5a
ckan-dev_1    | 2019-10-13 13:15:47,160 INFO  [05803921-a69c-4b42-a430-89f1444a0290] File hash: a4c2cdeaefdb2c659151f0a64706ed5a
ckan-dev_1    | Loading CSV
ckan-dev_1    | 2019-10-13 13:15:47,172 INFO  [05803921-a69c-4b42-a430-89f1444a0290] Loading CSV
ckan-dev_1    | Ensuring character coding is UTF8
ckan-dev_1    | 2019-10-13 13:15:47,195 INFO  [05803921-a69c-4b42-a430-89f1444a0290] Ensuring character coding is UTF8
ckan-dev_1    | Deleting "77e90203-3cd7-4b46-a30d-779d2c4659f5" from DataStore.
ckan-dev_1    | 2019-10-13 13:15:47,258 INFO  [05803921-a69c-4b42-a430-89f1444a0290] Deleting "77e90203-3cd7-4b46-a30d-779d2c4659f5" from DataStore.
ckan-dev_1    | 2019-10-13 13:15:47,274 DEBUG [ckanext.datastore.logic.action] Setting datastore_active=False on resource 77e90203-3cd7-4b46-a30d-779d2c4659f5
ckan-dev_1    | Fields: [{'type': 'text', 'id': 'column_1'}, {'type': 'text', 'id': 'column_2'}]
ckan-dev_1    | 2019-10-13 13:15:47,410 INFO  [05803921-a69c-4b42-a430-89f1444a0290] Fields: [{'type': 'text', 'id': 'column_1'}, {'type': 'text', 'id': 'column_2'}]
ckan-dev_1    | 2019-10-13 13:15:47,459 DEBUG [ckanext.datastore.logic.action] Setting datastore_active=True on resource 77e90203-3cd7-4b46-a30d-779d2c4659f5
ckan-dev_1    | Copying to database...
ckan-dev_1    | 2019-10-13 13:15:47,621 INFO  [05803921-a69c-4b42-a430-89f1444a0290] Copying to database...
ckan-dev_1    | ...copying done
ckan-dev_1    | 2019-10-13 13:15:47,639 INFO  [05803921-a69c-4b42-a430-89f1444a0290] ...copying done
ckan-dev_1    | Creating search index...
ckan-dev_1    | 2019-10-13 13:15:47,649 INFO  [05803921-a69c-4b42-a430-89f1444a0290] Creating search index...
ckan-dev_1    | ...search index created
ckan-dev_1    | 2019-10-13 13:15:47,662 INFO  [05803921-a69c-4b42-a430-89f1444a0290] ...search index created
ckan-dev_1    | Calculating record count (running ANALYZE on the table)
ckan-dev_1    | 2019-10-13 13:15:47,677 INFO  [05803921-a69c-4b42-a430-89f1444a0290] Calculating record count (running ANALYZE on the table)
ckan-dev_1    | Setting resource.datastore_active = True
ckan-dev_1    | 2019-10-13 13:15:47,683 INFO  [05803921-a69c-4b42-a430-89f1444a0290] Setting resource.datastore_active = True
ckan-dev_1    | Setting resource.datastore_contains_all_records_of_source_file = True
ckan-dev_1    | 2019-10-13 13:15:47,704 INFO  [05803921-a69c-4b42-a430-89f1444a0290] Setting resource.datastore_contains_all_records_of_source_file = True
ckan-dev_1    | Data now available to users: /dataset/testing-xloader/resource/77e90203-3cd7-4b46-a30d-779d2c4659f5
ckan-dev_1    | 2019-10-13 13:15:47,905 INFO  [05803921-a69c-4b42-a430-89f1444a0290] Data now available to users: /dataset/testing-xloader/resource/77e90203-3cd7-4b46-a30d-779d2c4659f5
ckan-dev_1    | Creating column indexes (a speed optimization for queries)...
ckan-dev_1    | 2019-10-13 13:15:47,916 INFO  [05803921-a69c-4b42-a430-89f1444a0290] Creating column indexes (a speed optimization for queries)...
ckan-dev_1    | ...column indexes created.
ckan-dev_1    | 2019-10-13 13:15:47,955 INFO  [05803921-a69c-4b42-a430-89f1444a0290] ...column indexes created.
ckan-dev_1    | Express Load completed
ckan-dev_1    | 2019-10-13 13:15:47,965 INFO  [05803921-a69c-4b42-a430-89f1444a0290] Express Load completed
ckan-dev_1    | 2019-10-13 13:15:48,027 INFO  [rq.worker] ckan:default:default: Job OK (05803921-a69c-4b42-a430-89f1444a0290)
ckan-dev_1    | 2019-10-13 13:15:48,028 INFO  [rq.worker] Result is kept for 500 seconds
ckan-dev_1    | 2019-10-13 13:15:48,032 INFO  [ckan.lib.jobs] Worker rq:worker:6da86188ebe4.72 has finished job 05803921-a69c-4b42-a430-89f1444a0290 from queue "default"
ckan-dev_1    | 2019-10-13 13:15:48,033 INFO  [rq.worker]
ckan-dev_1    | 2019-10-13 13:15:48,033 INFO  [rq.worker] *** Listening on ckan:default:default...

If I repeat the process and add a new file called mini-csv-3.csv it happens again, after render new_resource all three resources are being submited to xloader:

ckan-dev_1    | 2019-10-13 13:15:48,033 INFO  [rq.worker] *** Listening on ckan:default:default...
ckan-dev_1    | 2019-10-13 13:32:03,759 INFO  [ckan.lib.base]  /dataset/edit/testing-xloader render time 0.202 seconds
ckan-dev_1    | 2019-10-13 13:32:12,751 INFO  [ckan.lib.base]  /dataset/resources/testing-xloader render time 0.173 seconds
ckan-dev_1    | 2019-10-13 13:32:19,004 INFO  [ckan.lib.base]  /dataset/new_resource/testing-xloader render time 0.106 seconds
ckan-dev_1    | 2019-10-13 13:32:28,162 DEBUG [ckanext.xloader.plugin] Submitting resource 8037364f-eabc-444f-a2a2-13dc00fb4ba3 to be xloadered
ckan-dev_1    | 2019-10-13 13:32:28,197 INFO  [ckanext.xloader.action] Added background job 8ec95761-7eb5-4e05-acf6-396108094f88 to queue "default"
ckan-dev_1    | 2019-10-13 13:32:28,198 DEBUG [ckanext.xloader.action] Enqueued xloader job=8ec95761-7eb5-4e05-acf6-396108094f88 res_id=8037364f-eabc-444f-a2a2-13dc00fb4ba3
ckan-dev_1    | 2019-10-13 13:32:28,198 INFO  [rq.worker] ckan:default:default: ckanext.xloader.jobs.xloader_data_into_datastore({'result_url': u'http://ckan:5000/api/3/action/xloader_hook', 'api_key': u'3b32cbad-c8cc-4798-b927-d5828374f0a2', 'job_type': 'xloader_to_datastore', 'metadata': {'original_url': u'http://ckan:5000/dataset/bdc78b01-c24d-4416-a9ff-a6a1cc748e59/resource/8037364f-eabc-444f-a2a2-13dc00fb4ba3/download/mini-csv-3.csv', 'ckan_url': u'http://ckan:5000', 'resource_id': u'8037364f-eabc-444f-a2a2-13dc00fb4ba3', 'set_url_type': False, 'task_created': '2019-10-13 13:32:28.189102', 'ignore_hash': False}}) (8ec95761-7eb5-4e05-acf6-396108094f88)
ckan-dev_1    | 2019-10-13 13:32:28,199 INFO  [ckan.lib.jobs] Worker rq:worker:6da86188ebe4.72 starts job 8ec95761-7eb5-4e05-acf6-396108094f88 from queue "default"
ckan-dev_1    | 2019-10-13 13:32:28,204 DEBUG [ckanext.xloader.plugin] Submitting resource 77e90203-3cd7-4b46-a30d-779d2c4659f5 to be xloadered
ckan-dev_1    | 2019-10-13 13:32:28,245 INFO  [ckanext.xloader.action] Added background job a16ff460-695f-4f4a-8551-68d3974f6995 to queue "default"
ckan-dev_1    | 2019-10-13 13:32:28,245 DEBUG [ckanext.xloader.action] Enqueued xloader job=a16ff460-695f-4f4a-8551-68d3974f6995 res_id=77e90203-3cd7-4b46-a30d-779d2c4659f5
ckan-dev_1    | 2019-10-13 13:32:28,256 DEBUG [ckanext.xloader.plugin] Submitting resource 8371b91c-b59e-46f9-b6da-f4abf9e37ec9 to be xloadered
ckan-dev_1    | 2019-10-13 13:32:28,307 INFO  [ckanext.xloader.action] Added background job 4bf98711-d2ef-4e6f-8045-134580a0bcc2 to queue "default"
ckan-dev_1    | 2019-10-13 13:32:28,308 DEBUG [ckanext.xloader.action] Enqueued xloader job=4bf98711-d2ef-4e6f-8045-134580a0bcc2 res_id=8371b91c-b59e-46f9-b6da-f4abf9e37ec9
ckan-dev_1    | 2019-10-13 13:32:28,483 INFO  [ckan.lib.base]  /dataset/new_resource/testing-xloader render time 0.427 seconds
ckan-dev_1    | 2019-10-13 13:32:28,529 WARNI [ckan.lib.maintain] Function _resource_preview() in module ckan.controllers.package has been deprecated and will be removed in a later release of ckan. Resource preview is deprecated. Please use the new resource views
ckan-dev_1    | 2019-10-13 13:32:28,533 WARNI [ckan.lib.maintain] Function _resource_preview() in module ckan.controllers.package has been deprecated and will be removed in a later release of ckan. Resource preview is deprecated. Please use the new resource views
ckan-dev_1    | 2019-10-13 13:32:28,538 WARNI [ckan.lib.maintain] Function _resource_preview() in module ckan.controllers.package has been deprecated and will be removed in a later release of ckan. Resource preview is deprecated. Please use the new resource views

Some debug insights

I don't know exactly why is this happening but here are some insights from my debugging.

  • Each time a new resource is added to the dataset, the method lib.dictization.resource_dict_save is executed for each resource
  • Inside that method, this condition is executed due that when uploading a resource the __init__ method of ResourceUpload object sets the last_modified field.
  • So for each Uploaded Resource the setting the obj.url_changed = True will be set each time another resource is edited/added to the package.
  • This causes the notify() method of xloader to be called since it implements IResourceUrlChanged

CKAN bad response. Status code: 404 Not Found.

I just installer ckanext-exloader, in replacement to datapusher. When I tried to push resources to datastore, i got the following error message: CKAN bad response. Status code: 404 Not Found. At: http://localhost:8080/api/3/action/resource_show. status=404.

I would add that my ckan.site_url is set to http://localhost:8080 and I also have a ckan.root_path set to /ckan/.

I was expecting the datapusher to try its thing at http://localhost:8080/ckan/api/3/action/ but it seems as if it is not taking this into consideration.

What can I do?

Edit
Seems as if updating [...]/ckanext-xloader/ckanext/xloader/action.py L72/L73 to
site_url = config['ckan.site_url'] + config['ckan.root_path'] instead of just site_url = config['ckan.site_url'] does the trick.

Could it be possible to integrate the rooth_path in upcoming realeases in a cleaner way?

Failure when running with numpy >= 1.16

We're experimenting with XLoader, and the initial results are very promising, but it fails when running with the default (latest) numpy version, 1.16.5. Looks like numpy/numpy#14012

mod_wsgi (pid=18160): Exception occurred processing WSGI script '/etc/ckan/default/apache.wsgi'.
Traceback (most recent call last):
  File "/etc/ckan/default/apache.wsgi", line 11, in <module>
    application = loadapp('config:%s' % config_filepath)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/paste/deploy/loadwsgi.py", line 247, in loadapp
    return loadobj(APP, uri, name=name, **kw)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/paste/deploy/loadwsgi.py", line 272, in loadobj
    return context.create()
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/paste/deploy/loadwsgi.py", line 710, in create
    return self.object_type.invoke(self)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/paste/deploy/loadwsgi.py", line 146, in invoke
    return fix_call(context.object, context.global_conf, **context.local_conf)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/paste/deploy/util.py", line 55, in fix_call
    val = callable(*args, **kw)
  File "/usr/lib/ckan/default/src/ckan/ckan/config/middleware/__init__.py", line 55, in make_app
    load_environment(conf, app_conf)
  File "/usr/lib/ckan/default/src/ckan/ckan/config/environment.py", line 116, in load_environment
    p.load_all()
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 140, in load_all
    load(*plugins)
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 154, in load
    service = _get_service(plugin)
  File "/usr/lib/ckan/default/src/ckan/ckan/plugins/core.py", line 256, in _get_service
    return plugin.load()(name=plugin_name)
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2443, in load
    return self.resolve()
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2449, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/usr/lib/ckan/default/src/ckanext-xloader/ckanext/xloader/plugin.py", line 6, in <module>
    from ckanext.xloader import action, auth
  File "/usr/lib/ckan/default/src/ckanext-xloader/ckanext/xloader/action.py", line 16, in <module>
    import jobs
  File "/usr/lib/ckan/default/src/ckanext-xloader/ckanext/xloader/jobs.py", line 23, in <module>
    import loader
  File "/usr/lib/ckan/default/src/ckanext-xloader/ckanext/xloader/loader.py", line 9, in <module>
    import messytables
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/messytables/__init__.py", line 22, in <module>
    from messytables.pdf import PDFTableSet, PDFRowSet
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/messytables/pdf.py", line 6, in <module>
    from pdftables import get_tables
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/pdftables/__init__.py", line 1, in <module>
    from pdftables import *
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/pdftables/pdftables.py", line 36, in <module>
    import numpy # TODO: remove this dependency
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/numpy/__init__.py", line 142, in <module>
    from . import core
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/numpy/core/__init__.py", line 40, in <module>
    from . import multiarray
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/numpy/core/multiarray.py", line 13, in <module>
    from . import overrides
  File "/usr/lib/ckan/default/lib/python2.7/site-packages/numpy/core/overrides.py", line 46, in <module>
    """)
RuntimeError: implement_array_function method already has a docstring

Perhaps the requirements.txt should explicitly set numpy==1.15.4 to work around this? I can confirm that manually installing 1.15.4 fixed the problem for us.

Speed comparison with DataPusher

Summary

Express Loader loads the data in 11.4 times the speed compared with DataPusher

Test conditions:

  • Load of Boston 311 dataset (1033882 rows, 475MB)
  • Run locally on a MacBook Pro (i7, 2013 model)

stats with ckanext-xloader

12s - retrieve the file (over HTTP) from local FileStore
23s - convert to UTF8
21s - copy CSV file into PostgreSQL table (one COPY command)
160s - create search index

Total: 206 seconds

At this point the full data is made available to the user.

Afterwards the column indexes are generated which simply speed up common queries - this takes a further 1262s. However we exclude this from the load time, as it is merely an optimization.

stats with datapusher

12s - retrieve the file (over HTTP) from local FileStore
2338s - convert to UTF8 and then to JSON, setup postgres indexes to be generated during load, load JSON into table (4000 INSERT statements).

Total: 2350s

TSVs loaded with "datapusher" mode

When loading a TSV, xloader uses the slower load_table, datapusher style mode, not the faster load_csv mode.

Can't the PostgreSQL COPY command handle TSVs too?

CKAN 2.9+ gives resource_revision_table error during job

When running against CKAN latest master (2.9a), I see this in tests:

  File "/vagrant/src/ckanext-xloader/ckanext/xloader/jobs.py", line 390, in set_resource_metadata
    if hasattr(model, 'resource_revision_table'):
AttributeError: 'module' object has no attribute 'resource_revision_table'

Enable Travis CI build

Is there a reason why Travis is not enabled for this repository? I think it's very helpful to have the tests executed on Travis CI.

xloader ignores datasets uploaded via datastore API

Using CKAN 2.8.2 on Ubuntu, Trying to create a data flow using the ckanapi to define the datasets and resouce metadata, and upload the dataset resources. Small datasets seem to work fine, 50k-100k rows. But larger sets like 500k + rows seem to not complete via the API. They can be uploaded and complete if the web UI is used, though they took a few hours when using datapusher but now just a few minutes with xloader.

I'm using the api/action/datastore_create to upload the dataset resources. The xloader worker skips over the resources uploaded via the datastore api. Says "Ignoring resource - url_type=datastore - dump files are managed with the Datastore API"

What API do I use to upload resources so that the xloader will do the work? or is there a different recommended API/method for larger resources?

Thanks

Unresponsive resource pages during xloading

There has been a report that while xloading a large CSV (100Mb+) into Postgres, users find that the resource pages time-out i.e. won't load. Supposedly this occurs to all resource pages, but I've only reproduced it for resources which are large e.g. 4 million rows.

Tests show that the bottleneck is CPU use. The two Postgres' processes involved each get 50% of the CPU, so it's not as if xloader is hogging more than its fair share. Profiling shows the problem is that during previewing a large resource it does SQL command count(*) four times, and because there is no indexing this is very expensive for a table with millions of rows.

The site in question has solved the issue by switching to a pair of Postgres replicated instances - the master instance for writes and the slave instance for reads. This required PR#2562 so CKAN 2.8.x or higher, and configuring ckan.datastore.write_url and ckan.datastore.read_url to the different database instances.

I'm leaving this open for anyone else to make observations of this being an issue or not, with larger datasets (100Mb+ / 1M+ rows).

Why is datapusher url referenced in xloader controller action?

See https://github.com/davidread/ckanext-xloader/blob/master/ckanext/xloader/action.py#L325

I've been testing installing xloader and when I got to view a Datastore tab for a resoruce in the web ui I get a stack trace...

File '/usr/lib/ckan/default/src/ckanext-xloader/ckanext/xloader/controllers.py', line 38 in resource_data
  None, {'resource_id': resource_id}
File '/usr/lib/ckan/default/src/ckan/ckan/logic/__init__.py', line 457 in wrapped
  result = _action(context, data_dict, **kw)
File '/usr/lib/ckan/default/src/ckanext-xloader/ckanext/xloader/action.py', line 328 in xloader_status
  {'configuration': ['ckan.datapusher.url not in config file']})
ValidationError: {'configuration': ['ckan.datapusher.url not in config file']}

If this module replaced datapusher altogether, why is this config option being referenced because it's causing a regression elsewhere in the application.

Reporting as a bug.

Cannot see the upload status

Shift only allow user to view the data after the upload is completed. So when we are trying to upload a large file, such as the 311 services request file on data.boston.gov, which is 500+ MB large, we have need to wait for a long time before we can see any data. So there may be some issues if there's no log file that indicates the uploading status.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.