Code Monkey home page Code Monkey logo

ckanext-archiver's Introduction

CKAN: The Open Source Data Portal Software

License

Documentation

Support on StackOverflow

Build Status

Coverage Status

Chat on Gitter

CKAN is the world’s leading open-source data portal platform. CKAN makes it easy to publish, share and work with data. It's a data management system that provides a powerful platform for cataloging, storing and accessing datasets with a rich front-end, full API (for both data and catalog), visualization tools and more. Read more at ckan.org.

Installation

See the CKAN Documentation for installation instructions.

Support

If you need help with CKAN or want to ask a question, use either the ckan-dev mailing list, the CKAN chat on Gitter, or the CKAN tag on Stack Overflow (try searching the Stack Overflow and ckan-dev archives for an answer to your question first).

If you've found a bug in CKAN, open a new issue on CKAN's GitHub Issues (try searching first to see if there's already an issue for your bug).

If you find a potential security vulnerability please email [email protected], rather than creating a public issue on GitHub.

Contributing to CKAN

For contributing to CKAN or its documentation, see CONTRIBUTING.

Mailing List

Subscribe to the ckan-dev mailing list to receive news about upcoming releases and future plans as well as questions and discussions about CKAN development, deployment, etc.

Community Chat

If you want to talk about CKAN development say hi to the CKAN developers and members of the CKAN community on the public CKAN chat on Gitter. Gitter is free and open-source; you can sign in with your GitHub, GitLab, or Twitter account.

The logs for the old #ckan IRC channel (2014 to 2018) can be found here: https://github.com/ckan/irc-logs.

Wiki

If you've figured out how to do something with CKAN and want to document it for others, make a new page on the CKAN wiki and tell us about it on the ckan-dev mailing list or on Gitter.

Copying and License

This material is copyright (c) 2006-2023 Open Knowledge Foundation and contributors.

It is open and licensed under the GNU Affero General Public License (AGPL) v3.0 whose full text may be found at:

http://www.fsf.org/licensing/licenses/agpl-3.0.html

ckanext-archiver's People

Contributors

amercader avatar avdata99 avatar brew avatar bzar avatar drmalex07 avatar duttonw avatar icmurray avatar johnglover avatar kindly avatar krzysztofmadejski avatar morty avatar nigelbabu avatar rossjones avatar rufuspollock avatar thenets avatar thrawnca avatar threeaims avatar volpino avatar zharktas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ckanext-archiver's Issues

Simplify "do we need to archive?" logic with IResourceUrlChange

The IResourceUrlChange interface would provide a much simpler and more efficient mechanism to check whether we need to run the archiver for a changed resource, rather than querying the activity history (which is entirely ineffective anyway for private datasets).

This would also neatly resolve #77

Support for old ckan releases breaks compatibility with CKAN 2.3+

Current implementation claims to support CKAN 2.1 and above that. Therefore it is taken into account the presence of ResourceGroup to check what CKAN version we are running:

https://github.com/ckan/ckanext-archiver/blob/master/ckanext/archiver/commands.py#L181

However this check is not done in all places and newer CKAN versions fail to work for some paster commands (e.g: update, migrate-archive-dirs:)

https://github.com/ckan/ckanext-archiver/blob/master/ckanext/archiver/commands.py#L200
https://github.com/ckan/ckanext-archiver/blob/master/ckanext/archiver/commands.py#L389

In ckanext-qa we find the same problem in commands.py:

https://github.com/ckan/ckanext-qa/blob/master/ckanext/qa/commands.py#L155

I would also advise to use the same check for all the places. Sometimes the following check is used:
if p.toolkit.check_ckan_version(max_version='2.2.99'):
whereas in some other places is used the following check:
if hasattr(model, 'ResourceGroup'):

I can make a pull request regarding this if I find the time

AttributeError: 'module' object has no attribute 'DomainObjectOperation'

ERROR [ckan.model.modification] 'module' object has no attribute 'DomainObjectOperation'
Traceback (most recent call last):
File "/root/pyenv/src/ckan/ckan/model/modification.py", line 68, in notify
observer.notify(entity, operation)
File "/root/pyenv/src/ckanext-archiver/ckanext/archiver/plugin.py", line 29, in notify
if operation == model.DomainObjectOperation.new:
AttributeError: 'module' object has no attribute 'DomainObjectOperation'
Error - <type 'exceptions.AttributeError'>: 'module' object has no attribute 'DomainObjectOperation'

Archiver task cannot handle non-standard resource IDs

tasks.py retrieves package details using the package_show action, and expects the retrieved resource IDs to be UUIDs (it validates them using is_id). However, it is possible for the IDs to have other formats, which makes the archival task fail.

Compatibility with CKAN 2.8.0

Hello,

is it possible to use archiver with CKAN 2.8.0? Starting CKAN gives me the error
"ImportError: No module named celery_app"

Howto start the two paster commands for celery in background

Hi,

I try to install this ckanext extension in CentOS 7.6. Then, the command line to start the celery processes launch the process as foreground. Exist options to start these processes as background/daemon, and use with monit yum package?

Regards

Error after add a package/dataset - CKAN 2.6.2

I got this error when I try to add a new package:

2017-06-21 10:36:32,052 INFO  [ckan.lib.base]  /api/i18n/en render time 0.008 seconds
/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py:79: SAWarning: Usage of the 'related attribute set' operation is not currently supported within the execution stage of the flush process. Results may not be consistent.  Consider using alternative event listeners or connection-level operations instead.
  sess._flush_warning("related attribute set")
2017-06-21 10:36:39,449 ERROR [ckan.model.modification] (ProgrammingError) relation "archival" does not exist
LINE 2: FROM archival JOIN resource ON archival.resource_id = resour...
             ^
 'SELECT archival.id AS archival_id, archival.package_id AS archival_package_id, archival.resource_id AS archival_resource_id, archival.resource_timestamp AS archival_resource_timestamp, archival.status_id AS archival_status_id, archival.is_broken AS archival_is_broken, archival.reason AS archival_reason, archival.url_redirected_to AS archival_url_redirected_to, archival.cache_filepath AS archival_cache_filepath, archival.cache_url AS archival_cache_url, archival.size AS archival_size, archival.mimetype AS archival_mimetype, archival.hash AS archival_hash, archival.etag AS archival_etag, archival.last_modified AS archival_last_modified, archival.first_failure AS archival_first_failure, archival.last_success AS archival_last_success, archival.failure_count AS archival_failure_count, archival.created AS archival_created, archival.updated AS archival_updated \nFROM archival JOIN resource ON archival.resource_id = resource.id \nWHERE archival.package_id = %(package_id_1)s AND resource.state = %(state_1)s' {'package_id_1': u'cc687cd5-3f7e-4011-8142-80d20965227f', 'state_1': 'active'}
Traceback (most recent call last):
  File "/usr/lib/ckan/default/src/ckan/ckan/model/modification.py", line 88, in notify
    observer.notify(entity, operation)
  File "/usr/lib/ckan/default/src/ckan/ckan/lib/search/__init__.py", line 130, in notify
    {'id': entity.id}),
  File "/usr/lib/ckan/default/src/ckan/ckan/logic/__init__.py", line 431, in wrapped
    result = _action(context, data_dict, **kw)
  File "/usr/lib/ckan/default/src/ckan/ckan/logic/action/get.py", line 1011, in package_show
    item.after_show(context, package_dict)
  File "/usr/lib/ckan/default/src/ckanext-archiver/ckanext/archiver/plugin.py", line 184, in after_show
    archivals = Archival.get_for_package(pkg_dict['id'])
  File "/usr/lib/ckan/default/src/ckanext-archiver/ckanext/archiver/model.py", line 143, in get_for_package
    .filter(model.Resource.state=='active') \
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2293, in all
    return list(self)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2405, in __iter__
    return self._execute_and_instances(context)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2420, in _execute_and_instances
    result = conn.execute(querycontext.statement, self._params)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 727, in execute
    return meth(self, multiparams, params)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 322, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 824, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 954, in _execute_context
    context)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1116, in _handle_dbapi_exception
    exc_info
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 189, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 947, in _execute_context
    context)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 435, in do_execute
    cursor.execute(statement, parameters)
ProgrammingError: (ProgrammingError) relation "archival" does not exist
LINE 2: FROM archival JOIN resource ON archival.resource_id = resour...
             ^
 'SELECT archival.id AS archival_id, archival.package_id AS archival_package_id, archival.resource_id AS archival_resource_id, archival.resource_timestamp AS archival_resource_timestamp, archival.status_id AS archival_status_id, archival.is_broken AS archival_is_broken, archival.reason AS archival_reason, archival.url_redirected_to AS archival_url_redirected_to, archival.cache_filepath AS archival_cache_filepath, archival.cache_url AS archival_cache_url, archival.size AS archival_size, archival.mimetype AS archival_mimetype, archival.hash AS archival_hash, archival.etag AS archival_etag, archival.last_modified AS archival_last_modified, archival.first_failure AS archival_first_failure, archival.last_success AS archival_last_success, archival.failure_count AS archival_failure_count, archival.created AS archival_created, archival.updated AS archival_updated \nFROM archival JOIN resource ON archival.resource_id = resource.id \nWHERE archival.package_id = %(package_id_1)s AND resource.state = %(state_1)s' {'package_id_1': u'cc687cd5-3f7e-4011-8142-80d20965227f', 'state_1': 'active'}
2017-06-21 10:36:39,464 DEBUG [ckanext.archiver.plugin] Notified of package event: 123123123 new
2017-06-21 10:36:39,472 DEBUG [ckanext.archiver.plugin] New package - will archive
2017-06-21 10:36:39,473 DEBUG [ckanext.archiver.plugin] Creating archiver task: 123123123
2017-06-21 10:36:39,604 DEBUG [ckanext.archiver.lib] Archival of package put into celery queue priority: 123123123
Debug at: http://localhost:5000/_debug/view/1498041399

It looks like some incompatible with 2.6.x version after some migration.

Maintainer needed!

I'm not been involved in ckanext-archiver for a while. There are a few outstanding issues - nothing big - some query about celery. I wonder if anyone who is actually using this CKAN extension is willing to take on a small responsibility for it, have a look at and tend to it going forward?

@Zharktas @thenets @KrzysztofMadejski

AttributeError: 'update_package' object has no attribute 'get_logger'

I am getting the following error in the archiver celery "priority" queue log for all datasets:

[2017-03-15 09:21:44,266: ERROR/MainProcess] Task archiver.update_package[fotografia-aerea-6236-rollo-50-camara-rc-30-proyecto-guayaquil_2005-6-escala-1-10000-color/6258] raised unexpected: AttributeError("'update_package' object has no attribute 'get_logger'",)
Traceback (most recent call last):
File "/usr/lib/ckan/default/lib/python2.7/site-packages/celery/app/trace.py", line 238, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/lib/ckan/default/lib/python2.7/site-packages/celery/app/trace.py", line 416, in protected_call
return self.run(*args, **kwargs)
File "/usr/lib/ckan/default/src/ckanext-archiver/ckanext/archiver/tasks.py", line 116, in update_package
log = update_package.get_logger()
File "/usr/lib/ckan/default/lib/python2.7/site-packages/celery/local.py", line 143, in getattr
return getattr(self._get_current_object(), name)
AttributeError: 'update_package' object has no attribute 'get_logger'

Error with the download: 'ascii' codec can't encode character

Hello,

I'm developing a spanish ckan instance version 2.8.3 on ubuntu 16.04. the datasets are harvested from other spanish ckan platforms. and all the datasets information and titles are "ascii" encoded. ckan archiver gives me this error in all harvested documents

 Link is broken
- Error with the download: 'ascii' codec can't encode character u'\xf3' in position 39: ordinal not in range(128)
This resource has failed 76 times in a row since it first failed: Noviembre 21, 2019
We do not have a past record of it working since the first check: Noviembre 21, 2019
Link checked: Enero 26, 2020

No cached copy available

here is the archiver priority log:

2020-01-26 06:33:36,903 INFO  [rq.worker] ckan:default:priority: ckanext.archiver.tasks.update_package('/etc/ckan/default/production.ini', u'5bafb6c1-a2c0-455c-85fb-dc28bd7a987e') (fc0c08f$
2020-01-26 06:33:36,904 INFO  [ckan.lib.jobs] Worker rq:worker:opendata.4792 starts job fc0c08f9-3146-4e0a-8f28-0ed99302b1b5 from queue "priority"
2020-01-26 06:33:37,238 DEBUG [ckanext.harvest.model] Harvest tables already exist
2020-01-26 06:33:37,738 DEBUG [ckanext.harvest.model] Harvest tables already exist
2020-01-26 06:33:37,876 INFO  [ckanext.archiver.tasks] Starting update_package task: package_id=u'5bafb6c1-a2c0-455c-85fb-dc28bd7a987e' queue=bulk
2020-01-26 06:33:38,270 DEBUG [ckanext.harvest.model] Harvest tables already exist
2020-01-26 06:33:38,451 INFO  [ckanext.archiver.tasks] Attempting to download resource: http://geoserver.villanuevadelaserena.es:80/geoserver/LG3_WS_MapPublish_public/ows?service=WMS&reque$
2020-01-26 06:33:38,455 INFO  [ckanext.archiver.tasks] GET error: Download error - DownloadException("Error with the download: 'ascii' codec can't encode character u'\\xf3' in position 39:$
2020-01-26 06:33:38,456 INFO  [ckanext.archiver.tasks] API <function wms_1_3_request at 0x7fca26daecf8> error: DownloadException("Error with the download: 'ascii' codec can't encode charac$
2020-01-26 06:33:38,458 INFO  [ckanext.archiver.tasks] API <function wms_1_1_1_request at 0x7fca26daed70> error: DownloadException("Error with the download: 'ascii' codec can't encode char$
2020-01-26 06:33:38,459 INFO  [ckanext.archiver.tasks] API <function wfs_request at 0x7fca26daede8> error: DownloadException("Error with the download: 'ascii' codec can't encode character $
2020-01-26 06:33:38,462 INFO  [ckanext.archiver.tasks] Archival from before: <Archival Broken /dataset/vias-verdes/resource/bd78deb5-5ada-4154-90a5-f473a45ca9b3 75 failures>
2020-01-26 06:33:38,466 INFO  [ckanext.archiver.tasks] First_archival=False Previous_broken=True Failure_count=75
2020-01-26 06:33:38,466 INFO  [ckanext.archiver.tasks] Archival saved: <Archival Broken /dataset/vias-verdes/resource/bd78deb5-5ada-4154-90a5-f473a45ca9b3 76 failures>
2020-01-26 06:33:38,755 DEBUG [ckanext.harvest.model] Harvest tables already exist
2020-01-26 06:33:38,933 INFO  [ckanext.archiver.tasks] Attempting to download resource: http://geoserver.villanuevadelaserena.es/geoserver/wfs/ows?service=WFS&version=1.0.0&request=GetFeat$
2020-01-26 06:33:38,937 INFO  [ckanext.archiver.tasks] GET error: Download error - DownloadException("Error with the download: 'ascii' codec can't encode character u'\\xf3' in position 39:$
2020-01-26 06:33:38,938 INFO  [ckanext.archiver.tasks] API <function wms_1_3_request at 0x7fca26daecf8> error: DownloadException("Error with the download: 'ascii' codec can't encode charac$
2020-01-26 06:33:38,940 INFO  [ckanext.archiver.tasks] API <function wms_1_1_1_request at 0x7fca26daed70> error: DownloadException("Error with the download: 'ascii' codec can't encode char$
2020-01-26 06:33:38,941 INFO  [ckanext.archiver.tasks] API <function wfs_request at 0x7fca26daede8> error: DownloadException("Error with the download: 'ascii' codec can't encode character $
2020-01-26 06:33:38,943 INFO  [ckanext.archiver.tasks] Archival from before: <Archival Broken /dataset/vias-verdes/resource/4fecc22a-896c-49f4-a44c-5ba43605cda3 75 failures>
2020-01-26 06:33:38,947 INFO  [ckanext.archiver.tasks] First_archival=False Previous_broken=True Failure_count=75
2020-01-26 06:33:38,947 INFO  [ckanext.archiver.tasks] Archival saved: <Archival Broken /dataset/vias-verdes/resource/4fecc22a-896c-49f4-a44c-5ba43605cda3 76 failures>
2020-01-26 06:33:39,318 DEBUG [ckanext.harvest.model] Harvest tables already exist

Use CKAN_INI in celeryd2 run all

paster --plugin=ckanext-archiver celeryd2 run all script could pull CKAN_INI environment variable instead of specifying it by hand

Document download proxy config option

This plugin supports the ckan.download_proxy option, but the README file doesn't mention it.

Configuring a secure proxy server for file downloads is important in any environment with privileged network access, such as running on an Amazon EC2 instance. Without a filter, and with a plugin that displays resource contents to the end user, anyone capable of creating a resource can point it at a private IP address and have CKAN display the potentially sensitive contents of that URL. Thus, the README file for this plugin should mention the importance of setting up a filtering proxy.

An example of an appropriate filter configuration is at https://feeding.cloud.geek.nz/posts/restricting-outgoing-webapp-requests-using-squid-proxy/ which gives a Squid config block (assuming Squid listen port 3128):

acl to_localnet dst 0.0.0.1-0.255.255.255 # RFC 1122 "this" network (LAN)
acl to_localnet dst 10.0.0.0/8            # RFC 1918 local private network (LAN)
acl to_localnet dst 100.64.0.0/10         # RFC 6598 shared address space (CGN)
acl to_localnet dst 169.254.0.0/16        # RFC 3927 link-local (directly plugged) machines
acl to_localnet dst 172.16.0.0/12         # RFC 1918 local private network (LAN)
acl to_localnet dst 192.168.0.0/16        # RFC 1918 local private network (LAN)
acl to_localnet dst fc00::/7              # RFC 4193 local private network range
acl to_localnet dst fe80::/10             # RFC 4291 link-local (directly plugged) machines

acl SSL_ports port 443
acl Safe_ports port 80
acl Safe_ports port 443
acl CONNECT method CONNECT

http_access deny !Safe_ports
http_access deny CONNECT !SSL_ports
http_access deny manager
http_access deny to_localhost
http_access deny to_localnet
http_access allow localhost
http_access deny all

http_port 127.0.0.1:3128

Unnecessary tasks created when uploaded resources are modified

The logic for determining whether a resource URL has changed (and therefore needs archiving) doesn't properly handle uploaded files, because the 'new' resource is the plain filename while the 'old' one has the full URL. Editing and saving an uploaded resource without making any further changes will result in an unnecessary archiver task, with a log message similar to:

DEBUG [ckanext.archiver.plugin] Resource url changed - will archive. id=3fbf pos=0 url="https://example.com/dataset/5bc484a7-8773-4301-b925-c7ba7ca5878c/resource/3fbf6fcb-fd65-4c73-b82d-21f53810c788/download/example.pdf"->"example.pdf"

Cached versions of datasets don't appear in the resources list

CKAN: 2.6.1

I've set up archiver for production with supervisor and I'm running into 2 issues.

The first is that the 'cached' indicator is not present on the dataset page after running paster archiver update-test and then verifying that the cached datasets are in fact copied to their respective directories.

Secondly, supervisor doesn't seem to automatically process the datasets that are on the queue. update-test forces that backup successfully and I can test that it's working.

Are there any obvious supervisor setup steps I might be missing or that anyone has run into?
Are there any more robust examples of the config options that need to be in .ini?
The documentation is clear on the steps but I don't understand how strict the configuration options are, whether the specific ports/names are important or defaults that can be copied straight in after installing redis-server.

Race condition when resource URL is changed

Because the notification come ultimately from the SQLAlchemy before_commit hook it is possible for the Celery task to get the old details out of the database.

To fix in this extension would require sending all of the information to the task rather than looking it up. But this would mean that if the task takes some time to be scheduled that the information might be out of date.

Better solution might be to fix CKAN so that it fires the notification on the after_commit hook.

ERROR [ckanext.archiver.tasks] Error occurred during archiving package: 'NoneType' object has no attribute 'ugettext'

When I use the extension in the terminal I get an error...

The backup/archive completes just fine, and CKAN adds all the info in the resource page (Link is ok, Link checked, Download cached copy, Size,Cached on)

After completing the archiving I get this and the process does to exit..

2018-07-19 11:59:34,908 INFO  [ckanext.archiver.tasks] Notifying package as 1 items were archived

2018-07-19 11:59:34,941 ERROR [ckanext.archiver.tasks] Error occurred during archiving package: 'NoneType' object has no attribute 'ugettext'
Package: 857b7063-f48a-46dc-9777-ed2f55742de5

[2018-07-19 11:59:34,950: ERROR/MainProcess] Task archiver.update_package[another-test/1738] raised unexpected: AttributeError("'NoneType' object has no attribute 'ugettext'",)
Traceback (most recent call last):
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/hit/ckan/lib/default/src/ckanext-archiver/ckanext/archiver/tasks.py", line 150, in update_package
    _update_package(ckan_ini_filepath, package_id, queue, log)
  File "/home/hit/ckan/lib/default/src/ckanext-archiver/ckanext/archiver/tasks.py", line 186, in _update_package
    _update_search_index(package_id, log)
  File "/home/hit/ckan/lib/default/src/ckanext-archiver/ckanext/archiver/tasks.py", line 200, in _update_search_index
    package = toolkit.get_action('package_show')(context_, {'id': package_id})
  File "/home/hit/ckan/lib/default/src/ckan/ckan/logic/__init__.py", line 431, in wrapped
    result = _action(context, data_dict, **kw)
  File "/home/hit/ckan/lib/default/src/ckan/ckan/logic/action/get.py", line 976, in package_show
    package_dict = model_dictize.package_dictize(pkg, context)
  File "/home/hit/ckan/lib/default/src/ckan/ckan/lib/dictization/model_dictize.py", line 301, in package_dictize
    result_dict['license_title'] = pkg.license.title.split('::')[-1]
  File "/home/hit/ckan/lib/default/src/ckan/ckan/model/license.py", line 50, in __getattr__
    return self._data[name]
  File "/home/hit/ckan/lib/default/src/ckan/ckan/model/license.py", line 201, in __getitem__
    value = getattr(self, key)
  File "/home/hit/ckan/lib/default/src/ckan/ckan/model/license.py", line 272, in title
    return _("Creative Commons Attribution")
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/pylons/i18n/translation.py", line 106, in ugettext
    return pylons.translator.ugettext(value)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/registry.py", line 137, in __getattr__
    return getattr(self._current_obj(), attr)
AttributeError: 'NoneType' object has no attribute 'ugettext'

Don't archive uploaded resources

My latest archiver: 5668c53 + #33 PR archives resources that were uploaded to server. It's a waste of space as these are files are already stored on the server.

Proposed solution:

  • Checking not to archive files from domain ckan.site_url
  • Do any postprocessing that will allow ckanext-qa and others to run successfuly. I guess it's putting an archiver-record pointing to local resource under ckan.storage_path and a remote url when the resource is served by ckan.
	[2017-03-15 11:07:56,236: INFO/MainProcess] Got task from broker: archiver.update_package[localhost/ecaa]
[2017-03-15 11:07:56,590: INFO/PoolWorker-1] archiver.update_package[localhost/ecaa]: Starting update_package task: package_id=u'514dcbb8-fabe-40cb-927a-3e408c431dfa' queue=priority
[2017-03-15 11:07:56,749: INFO/PoolWorker-1] archiver.update_package[localhost/ecaa]: Attempting to download resource: http://localhost/dataset/514dcbb8-fabe-40cb-927a-3e408c431dfa/resource/6be642a7-0e53-498c-b8ef-a8ca5e5062e6/download/b08a4d7b01a04852b914e7904a73b1b8.png
[2017-03-15 11:07:56,756: INFO/PoolWorker-1] Starting new HTTP connection (1): localhost
[2017-03-15 11:07:58,842: INFO/PoolWorker-1] archiver.update_resource[localhost/ecaa]: GET started successfully. Content headers: {'transfer-encoding': 'chunked', 'accept-ranges': 'bytes', 'server': 'Apache/2.4.7 (Ubuntu)', 'last-modified': 'Wed, 15 Mar 2017 10:07:56 GMT', 'content-range': 'bytes 0-22897/22898', 'etag': '"1489572476.31-22898"', 'pragma': 'no-cache', 'cache-control': 'no-cache', 'date': 'Wed, 15 Mar 2017 10:07:56 GMT', 'content-type': 'image/png'}
[2017-03-15 11:07:58,842: INFO/PoolWorker-1] archiver.update_resource[localhost/ecaa]: Downloading the body
[2017-03-15 11:07:58,842: INFO/PoolWorker-1] archiver.update_resource[localhost/ecaa]: Saving resource
[2017-03-15 11:07:58,843: INFO/PoolWorker-1] archiver.update_resource[localhost/ecaa]: Resource saved. Length: 22898 File: /tmp/tmpz8ExfA
[2017-03-15 11:07:58,843: INFO/PoolWorker-1] archiver.update_resource[localhost/ecaa]: Resource downloaded: id=6be642a7-0e53-498c-b8ef-a8ca5e5062e6 url='http://localhost/dataset/514dcbb8-fabe-40cb-927a-3e408c431dfa/resource/6be642a7-0e53-498c-b8ef-a8ca5e5062e6/download/b08a4d7b01a04852b914e7904a73b1b8.png' cache_filename=/tmp/tmpz8ExfA length=22898 hash=df011021f5d8135f55b8377160ba5503dbd6c316
[2017-03-15 11:07:58,843: INFO/PoolWorker-1] archiver.update_package[localhost/ecaa]: Attempting to archive resource
[2017-03-15 11:07:58,843: INFO/PoolWorker-1] archiver.update_package[localhost/ecaa]: Going to do chmod: /home/ckan/data/archiver/6b/6be642a7-0e53-498c-b8ef-a8ca5e5062e6/b08a4d7b01a04852b914e7904a73b1b8.png
[2017-03-15 11:07:58,844: INFO/PoolWorker-1] archiver.update_package[localhost/ecaa]: Archived resource as: /home/ckan/data/archiver/6b/6be642a7-0e53-498c-b8ef-a8ca5e5062e6/b08a4d7b01a04852b914e7904a73b1b8.png
[2017-03-15 11:07:58,850: INFO/PoolWorker-1] archiver.update_package[localhost/ecaa]: Archival saved: <Archival Downloaded OK /dataset/localhost/resource/6be642a7-0e53-498c-b8ef-a8ca5e5062e6 >
[2017-03-15 11:07:58,857: INFO/PoolWorker-1] archiver.update_package[localhost/ecaa]: Notifying package as 1 items were archived

SSL failures, but it works in a browser

e.g. https://services.historicengland.org.uk/NMRDataDownload/default.aspx loads in Chrome, with a green padlock, yet it gives me SSL error in ckanext-archiver/requests:

>>> import requests
>>> requests.get('https://services.historicengland.org.uk/NMRDataDownload/default.aspx')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/co/ckan/local/lib/python2.7/site-packages/requests/api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "/home/co/ckan/local/lib/python2.7/site-packages/requests/api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "/home/co/ckan/local/lib/python2.7/site-packages/requests/sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/co/ckan/local/lib/python2.7/site-packages/requests/sessions.py", line 573, in send
    r = adapter.send(request, **kwargs)
  File "/home/co/ckan/local/lib/python2.7/site-packages/requests/adapters.py", line 431, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: [Errno bad handshake] [('SSL routines', 'SSL3_GET_SERVER_CERTIFICATE', 'certificate verify failed')]

Versions:

$ pip freeze | egrep -i 'requests\=|certifi\='
certifi==2016.2.28
requests==2.7.0
$ python -c "import ssl; print ssl.OPENSSL_VERSION"
OpenSSL 1.0.1 14 Mar 2012

AttributeError: 'CKANFlask' object has no attribute 'app_context'

Hello guys,

When attempting to migrate to version 2.x the following error appears when opening any CKAN page:

Traceback (most recent call last):
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/httpserver.py", line 1068, in process_request_in_thread
    self.finish_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python2.7/SocketServer.py", line 649, in __init__
    self.handle()
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/httpserver.py", line 442, in handle
    BaseHTTPRequestHandler.handle(self)
  File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
    self.handle_one_request()
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/httpserver.py", line 437, in handle_one_request
    self.wsgi_execute()
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/httpserver.py", line 287, in wsgi_execute
    self.wsgi_start_response)
  File "/usr/lib/ckan/default/src/ckan/ckan/config/middleware/__init__.py", line 134, in __call__
    with flask_app.app_context():
AttributeError: 'CKANFlask' object has no attribute 'app_context'

Anyone know what's going on? Possible solutions?

-- I'm using CKAN 2.6.2

No handling for encoded URLs

I have run into this issue here: https://danepubliczne.gov.pl/dataset/informacja-kwartalna-o-stanie-finansow-publicznych/resource/86454cff-556a-4162-aa65-433158c133f4

Basically the provider has linked external resource as: http://www.mf.gov.pl/documents/764034/1002163/Informacja+kwartalna++III+kwarta%C5%82+2016+r.. To make it more clear let's assume the filename is kwarta%C5%82+2016

This file is saved to disk as is, meaning kwarta%C5%82+2016.
It is then served by Apache escaping percents: kwarta%25C5%2582+2016 while CKAN links archived version as in orginal URL kwarta%C5%82+2016. That leads to 404 error on the archived link.

I think we should decode any incoming urls (below) or erase all encoded chars. What do you think?

    # ckanext/archiver/tasks.py:556
    try:
        file_name = parsed_url.path.split('/')[-1] or 'resource'
        file_name = urllib.unquote(file_name) # DECODING ADDED HERE
        file_name = file_name.strip()  # trailing spaces cause problems
        file_name = file_name.encode('ascii', 'ignore')  # e.g. u'\xa3' signs

Large File leak in tasks._save_resource

Here: https://github.com/ckan/ckanext-archiver/blob/master/ckanext/archiver/tasks.py#L734

def _save_resource(resource, response, max_file_size, chunk_size=1024*16):
    """
    Write the response content to disk.
    Returns a tuple:
        (file length: int, content hash: string, saved file path: string)
    """
    resource_hash = hashlib.sha1()
    length = 0

    fd, tmp_resource_file_path = tempfile.mkstemp()

    with open(tmp_resource_file_path, 'wb') as fp:
        for chunk in response.iter_content(chunk_size=chunk_size,
                                           decode_unicode=False):
            fp.write(chunk)
            length += len(chunk)
            resource_hash.update(chunk)

            if length >= max_file_size:
                raise ChooseNotToDownload(
                    _("Content-length %s exceeds maximum allowed value %s") %
                    (length, max_file_size))

    os.close(fd)

    content_hash = unicode(resource_hash.hexdigest())
    return length, content_hash, tmp_resource_file_path

If the file is too large, it raises an error but there is not enough information in the exception to clean up the file.

Unfortunately, this means that "too large" resources will accumulate in the /tmp directory over time.

Cannot route message for exchange 'bulk': Table empty or key no longer exists.

Hi everyone,

I installed this extension in my CKAN instance and I can't make it work through my entire catalog. The 'queuing' proccess (paster --plugin=ckanext-archiver archiver update --queue=bulk-c <path to CKAN config>) stops with the next message at some point of the execution (usually when it has proccessed less than 500 datasets -347 last time-):

kombu.exceptions.InconsistencyError:` 
Cannot route message for exchange 'bulk': Table empty or key no longer exists.
Probably the key ('_kombu.binding.bulk') has been removed from the Redis database.

I can't find anything related on Internet and I'm a little bit lost here. I'm running:

  • ckan 2.5.2
  • redis 2.10.1
  • celery 3.1.25

Thank you in advance.

resource_cache question

kind of a newbie question, forgive me... but I installed the extension and when I try to archive a dataset a get [Errno 2] No such file or directory error.
Could you please elaborate the setup process.
Do I manually create the resource_cache folder? Do I put it under apache (/var/www/html)
or under nginx (/usr/share/nginx/html) or somewhere in /usr/lib/ckan/default

Do I need to run chown-chmod first?
Thanks.

Improve/start versioning

Define 2.0.0 as a first version and start creating new versions after fixes and improvements. We are using this version number from 4 years.

Tasks

  • Start using GitHub tags for each version.
  • Start a changelog file with info about CKAN version covered + bugs + fixes

Kombu==2.1.3 calls non-existing msgpack method

Source install of ckan 2.3a, ckanext-spatial, ckanext-archiver and ckanext-qa on Ubuntu 14.04 here.

I just noticed (and posted to ckan-dev) that the kombu version (2.1.3 as per plugin requirements) tries to import msgpack (which is no requirement of the involved packages and therefore needs to be installed manually) and throws an Internal Server Error after creating a resource, when ckanext-qa sends a job to celeryd using a kombu-serialised msgpack message (if I understood that correctly).

A hackaround was to pip install u-msgpack-python into ckan's virtualenv, and to change line 314 of the installed kombu.serializer at /var/lib/ckan/default/lib/python2.7/site-packages/kombu/serialization.py
from registry.register('msgpack', msgpack.packs, msgpack.unpacks, to registry.register('msgpack', msgpack.packb, msgpack.unpackb, and restart the web server.

Using the latest celery (presumably pulling in a more recent kombu) in requirements.txt wrecked other things but might be the way to a more permanent fix.
No pull request submitted as the problem isn't with ckanext-archiver code.

Double Encoding of URLs

Issue Description:

Currently, the CKAN database stores resource URLs in percent-encoded form. When the archiver extension attempts to download a file using these URLs, there's an unintended double percent-encoding that can lead to download errors.

The problematic code responsible for this issue is located in the tidy_url() function within the tasks.py file:

# ckanext/archiver/tasks.py:653

# Find out if it has unicode characters, and if it does, quote them
# so we are left with an ASCII string
try:
    url = url.decode('ascii')
except Exception:
    parts = list(urlparse(url))
    parts[2] = quote(parts[2].encode('utf-8'))
    url = urlunparse(parts)
url = str(url)

In Python 3, the attempt to decode the URL into ASCII is unnecessary and, in fact, causes the except block to be applied resulting in the double encoding issue. Therefore, the entire try-except block for ASCII handling can be safely removed in Python 3 environments.

Furthermore, it's crucial to prioritize upgrading the archiver extension to Python 3 as soon as possible to eliminate other py2 issues and ensure compatibility.

Moving to new background job system introduced in CKAN 2.7

CKAN 2.7 has introduced a new background job system.
Before version 2.7 (starting from 1.5), CKAN offered a different background job system built around Celery. That system is still available but deprecated and will be removed in future versions of CKAN.

Background task (2.7): http://docs.ckan.org/en/latest/maintaining/background-tasks.html
Migration guide: http://docs.ckan.org/en/latest/maintaining/background-tasks.html#background-jobs-migration

That relates to ckan/ckanext-qa#52 (Upgrade Celery to 3.x) as new jobs are based on RQ (Redis). It seems that instead of upgrading Celery it would be wise to move forward and handle new ckan jobs. @davidread, any thoughts on that?

settings from global CKAN configuration files not taken into account

I noticed that settings from the main CKAN configuration file (eg. development.ini file passed with -c to paster commands) are not taken into account. Not sure this is true for all options, but at least for ckanext-archiver.archive_dir. This seems to be due to the fact that these options' values are read from default_setttings.py file which, if imported before pylons.config gets initialized, would not get any values from the CKAN configuration file. For instance, module task.py import this default_settings module and the configuration is empty there, whereas it is not after load_config function is called, but this is too late since the latter function is called after the default_settings got imported.

archiver doesn't cache resources

I'm running ckan 2.8 on apache2, I have installed the archiver following the instructions and configured job workers to run the queues in the background.

paster --plugin=ckanext-archiver celeryd2 run priority -c production.ini
paster --plugin=ckanext-archiver celeryd2 run bulk -c production.ini

when I run paster --plugin=ckanext-archiver archiver update --queue=priority -c <path to CKAN config> it shows me this result:

2019-06-24 06:55:13,826 INFO  [ckanext.archiver.commands] Queuing dataset zonas-para-perros (1 resources)
2019-06-24 06:55:13,828 INFO  [ckan.lib.jobs] Added background job 25715cff-21c7-4a54-a32d-7fd5b24aae70 to queue "bulk"
2019-06-24 06:55:13,828 DEBUG [ckanext.archiver.lib] Archival of package put into celery queue bulk: zonas-para-perros
2019-06-24 06:55:13,932 INFO  [ckanext.archiver.commands] Queuing dataset zonas-verdes (1 resources)
2019-06-24 06:55:13,934 INFO  [ckan.lib.jobs] Added background job 846fcaf5-8754-4b03-9df8-48cfdff8295a to queue "bulk"
2019-06-24 06:55:13,934 DEBUG [ckanext.archiver.lib] Archival of package put into celery queue bulk: zonas-verdes
2019-06-24 06:55:14,046 INFO  [ckanext.archiver.commands] Completed queueing

but when I see view archival information by running: paster --plugin=ckanext-archiver archiver view --config=production.ini it shows that 0 resources are archived.

2019-06-24 06:39:13,511 INFO  [ckanext.geonetwork.harvesters.geonetwork] GeoNetwork harvester: extending ISODocument with TimeInstant
2019-06-24 06:39:13,511 INFO  [ckanext.geonetwork.harvesters.geonetwork] GeoNetwork harvester: adding old GML URI
2019-06-24 06:39:13,511 INFO  [ckanext.geonetwork.harvesters.geonetwork] Added old URI for gml to temporal-extent-begin
2019-06-24 06:39:13,512 INFO  [ckanext.geonetwork.harvesters.geonetwork] Added old URI for gml to temporal-extent-begin
2019-06-24 06:39:13,512 INFO  [ckanext.geonetwork.harvesters.geonetwork] Added old URI for gml to temporal-extent-end
2019-06-24 06:39:13,512 INFO  [ckanext.geonetwork.harvesters.geonetwork] Added old URI for gml to temporal-extent-end
2019-06-24 06:39:13,512 INFO  [ckanext.geonetwork.harvesters.geonetwork] Added old URI for gml to temporal-extent-instant
2019-06-24 06:39:13,936 DEBUG [ckanext.harvest.model] Harvest tables defined in memory
2019-06-24 06:39:13,951 DEBUG [ckanext.harvest.model] Harvest tables already exist
2019-06-24 06:39:13,978 DEBUG [ckanext.spatial.plugin] Setting up the spatial model
2019-06-24 06:39:13,999 DEBUG [ckanext.spatial.model.package_extent] Spatial tables defined in memory
2019-06-24 06:39:14,005 DEBUG [ckanext.spatial.model.package_extent] Spatial tables already exist
2019-06-24 06:39:14,433 DEBUG [ckanext.harvest.model] Harvest tables already exist
2019-06-24 06:39:14,460 DEBUG [ckanext.spatial.plugin] Setting up the spatial model
2019-06-24 06:39:14,465 DEBUG [ckanext.spatial.model.package_extent] Spatial tables already exist
Resources: 2029 total
Archived resources: 0 total
                    0 with cache_url
Latest archival: (no)

I need to archive the resources and make information appear on the resource page.

Celery on production guidelines - broken links; How to setup two queues with supervisor?

In readme.md those links are broken:

I've checked following links:

but none of them specifies how to setup two queues with supervisor. @davidread do you have any working example of such supervisor config file?

BTW: Warning: in 2.7 a new background job system has been introduced. As such the archiver probably won't with new CKAN. See #42

Resource URL when CKAN is in subpath

Hi,

We've installed CKAN 2.5 and we're trying this extension. Our CKAN site is in a subpath.

When we run the Archiver update command, the resource's path changes (the subpath gets removed). As result, the Archiver detects this resource as a broken link.

Our configuration:

ckan.site_url = http://ourcatalog.org/data
ckanext-archiver.archive_dir = /var/www/html/cache-datasets
ckanext-archiver.cache_url_root = http://ourcatalog.org/data/resource_cache
ckanext-archiver.max_content_length = 50000000
ckanext-archiver.user_agent_string = "Our Catalog (CKAN)"

Any ideas?

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.