cernopendata / cernopendata-client Goto Github PK
View Code? Open in Web Editor NEWCERN Open Data command-line client
Home Page: http://cernopendata-client.readthedocs.io/
License: GNU General Public License v3.0
CERN Open Data command-line client
Home Page: http://cernopendata-client.readthedocs.io/
License: GNU General Public License v3.0
This works:
$ cernopendata-client get-record --recid 4910
This does not:
$ cernopendata-client get-record --doi '10.7483/OPENDATA.LHCB.N75T.TJPE'
Traceback (most recent call last):
File "/home/simko/.virtualenvs/cernopendata/bin/cernopendata-client", line 8, in <module>
sys.exit(cernopendata_client())
File "/home/simko/.virtualenvs/cernopendata/lib/python3.8/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/home/simko/.virtualenvs/cernopendata/lib/python3.8/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/simko/.virtualenvs/cernopendata/lib/python3.8/site-packages/click/core.py", line 1135, in invoke
sub_ctx = cmd.make_context(cmd_name, args, parent=ctx)
File "/home/simko/.virtualenvs/cernopendata/lib/python3.8/site-packages/click/core.py", line 641, in make_context
self.parse_args(ctx, args)
File "/home/simko/.virtualenvs/cernopendata/lib/python3.8/site-packages/click/core.py", line 940, in parse_args
value, args = param.handle_parse_result(ctx, opts, args)
File "/home/simko/.virtualenvs/cernopendata/lib/python3.8/site-packages/click/core.py", line 1476, in handle_parse_result
value = invoke_param_callback(
File "/home/simko/.virtualenvs/cernopendata/lib/python3.8/site-packages/click/core.py", line 96, in invoke_param_callback
return callback(ctx, param, value)
File "/home/simko/.virtualenvs/cernopendata/lib/python3.8/site-packages/cernopendata_client/cli.py", line 31, in ensure_positive_int
if value < 0:
TypeError: '<' not supported between instances of 'NoneType' and 'int'
To be fixed.
Goal: Implement get-record
CLI function to fetch some wanted record metadata.
Inputs: --recid
or --doi
ot --title
(useful for CMS datasets). One of them should be required.
Outputs: Full JSON of the bibliographic record.
Optionally, if some CLI switch is used, output only that JSON subtree.
Examples:
$ cernopendata-client get-record --recid 14
$ cernopendata-client get-record --doi 10.7483/OPENDATA.ATLAS.AHKR.A3TA
$ cernopendata-client get-record --recid 14 --output-fields title,date_created
$ cernopendata-client get-record --tile '/Mu/Run2010B-v1/RAW' --output-fields recid
Exit status: 0 if OK, 1 if more then one record was matched. (Should not happen.)
Currently, we output only file locations when users use the get-file-locations
command:
$ cernopendata-client get-file-locations --recid 1 | head -3
http://opendata.cern.ch/eos/opendata/cms/Run2010B/BTau/AOD/Apr21ReReco-v1/0000/00E16FBB-9071-E011-83D3-003048673F12.root
http://opendata.cern.ch/eos/opendata/cms/Run2010B/BTau/AOD/Apr21ReReco-v1/0000/0248915F-EE71-E011-8894-0025902009E8.root
http://opendata.cern.ch/eos/opendata/cms/Run2010B/BTau/AOD/Apr21ReReco-v1/0000/0268F635-B671-E011-9090-002481E14E00.root
This does not inform users about the size or the checksum of what is going to be uploaded.
This information could be useful to estimate the download times and/or to plug this into automated scripts the users may have.
In order to improve the situation, we can introduce a new option --verbose
(or perhaps two new options --include-size
and --include-chechksum
?) that would also print out the file size and the checksum information.
Example:
$ cernopendata-client get-file-locations --recid 1 | head -3
http://opendata.cern.ch/eos/opendata/cms/Run2010B/BTau/AOD/Apr21ReReco-v1/0000/00E16FBB-9071-E011-83D3-003048673F12.root 123234234234 adler32:aaaaa
http://opendata.cern.ch/eos/opendata/cms/Run2010B/BTau/AOD/Apr21ReReco-v1/0000/0248915F-EE71-E011-8894-0025902009E8.root 234234324 adler32:bbbbb
http://opendata.cern.ch/eos/opendata/cms/Run2010B/BTau/AOD/Apr21ReReco-v1/0000/0268F635-B671-E011-9090-002481E14E00.root 67676767676 adler32:ccccc
I.e. the output would be space-separated URI SIZE CHECKSUM triad.
CC @jbenito3
Looking for records by title sometimes does not work due to returning more than one record:
$ cernopendata-client get-record --title '/BTau/Run2010B-Apr21ReReco-v1/AOD'
More than one record fit this title. This should not happen.
See two records: http://opendata.cern.ch/search?page=1&size=20&q=title:%22%2FBTau%2FRun2010B-Apr21ReReco-v1%2FAOD%22
However, there is precisely one dataset record matching the given title exactly, record ID 1, which should be returned...
We could look for "exact" string match instead of the "substring" match, as it were, to fix this problem. Note that the issue may need to amend Elasticsearch settings on the server side.
Originally posted by @tiborsimko in #19 (comment)
Following up what we did in @reanahub, adopt black code formatter.
Amend codebase.
Amend run-tests.sh
to check for black compliance.
Remove testing isort
.
$ cernopendata-client download-files --recid 3005 --verify --protocol https
==> Downloading file 1 of 1
-> File: ./3005/0d0714743f0204ed3c0144941e6ce248.configFile.py
Traceback (most recent call last):
File "/home/simko/.virtualenvs/cernopendata-client/bin/cernopendata-client", line 8, in <module>
sys.exit(cernopendata_client())
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/cernopendata_client/cli.py", line 291, in download_files
download_single_file(path=path, file_location=file_location, protocol=protocol)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/cernopendata_client/downloader.py", line 68, in download_single_file
c.perform()
pycurl.error: (1, 'Protocol "root" not supported or disabled in libcurl')
Discussion on MM:
This is for big dataset records, and we always have a record with both TXT and JSON files. (In the past we had a mixture so that some records did not have yet JSON file, but nowadays all do.)
The codebase should be ready to assume there might be missing, but actually we don't have any "live" test records... We would have to use mock and mock the REST API reply to have a test case. Which might be a good thing to do to catch all those various non-live corner cases that the client should be ready to deal nicely with...
I'm trying to understand how the search from the title is supposed to work.
Is there a way to also specify (part of) the additional_title
key? If not, the search seems to often just throw More than one record fit this title.This should not happen.
This is what I'm trying:
from cernopendata_client import searcher
SERVER_HTTP_URI = "http://opendata.cern.ch"
# Check if record with the given recid exists
searcher.verify_recid(server=SERVER_HTTP_URI, recid=1)
metadata_from_recid = searcher.get_record_as_json(server=SERVER_HTTP_URI, recid=1)
metadata_from_doi = searcher.get_record_as_json(server=SERVER_HTTP_URI, doi=metadata_from_recid["metadata"]["doi"])
metadata_from_title = searcher.get_record_as_json(server=SERVER_HTTP_URI, title=metadata_from_recid["metadata"]["title"])
print(metadata_from_recid == metadata_from_doi == metadata_from_title)
Following up #79, it would be good to do the following:
1) Mention in https://cernopendata-client.readthedocs.io/en/latest/installation.html that people can specify two installation options, pycurl
and xrootd
, with a brief description what each one means.
2) Currently, we actually use xroot
name in setup.py
, but I think we should use xrootd
name because this is the official protocol name (XRootD).
"xroot": ["xrootdpyfs>=0.2"],
--protocol root
to use --protocol xrootd
, so that we say everywhere "xrootd" both in the code and the documentation.Environment: Python3.8, pip 19.2.3
Running pip install cernopendata-client
gives:
ERROR: Command errored out with exit status 1:
command: /home/avivace/.local/share/virtualenvs/bagit-create-tuzAGD-s/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-he1mpu_d/pycurl/setup.py'"'"'; __file__='"'"'/tmp/pip-install-he1mpu_d/pycurl/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-ej5txkkk
cwd: /tmp/pip-install-he1mpu_d/pycurl/
Complete output (22 lines):
Traceback (most recent call last):
File "/tmp/pip-install-he1mpu_d/pycurl/setup.py", line 236, in configure_unix
p = subprocess.Popen((self.curl_config(), '--version'),
File "/usr/lib/python3.8/subprocess.py", line 854, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.8/subprocess.py", line 1702, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'curl-config'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-he1mpu_d/pycurl/setup.py", line 988, in <module>
ext = get_extension(sys.argv, split_extension_source=split_extension_source)
File "/tmp/pip-install-he1mpu_d/pycurl/setup.py", line 649, in get_extension
ext_config = ExtensionConfiguration(argv)
File "/tmp/pip-install-he1mpu_d/pycurl/setup.py", line 101, in __init__
self.configure()
File "/tmp/pip-install-he1mpu_d/pycurl/setup.py", line 241, in configure_unix
raise ConfigurationError(msg)
__main__.ConfigurationError: Could not run curl-config: [Errno 2] No such file or directory: 'curl-config'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Maybe a paragraph in the README could also explain how to manually install this, not using pip?
In addition to http protocol implemented in #22, we should add support for xrootd protocol which provides more bandwidth.
Check whether client system contains xrootd commands (xrdcp
) or Python libraries (xrootd
).
If yes, enable support for --protocol root
.
If not, report help string what people has to do do install them.
This is not an MVP functionality for the first public release; it can be added later.
The current test suite only has one example for testing version import, see tests/test_version.py
.
We should start introducing tests as the feature set grows, see #19 (comment)
The tests to introduce are of diverse nature:
"true unit tests" for helper functions such as validate_recid()
and validate_server()
;
"integration tests" or "regression tests" for CLI commands such as cernopendata-client get-file-locations --title '/DoubleElectron/Run2012B-v1/RAW'
For inspiration, see reana-client
or reana-dev
test suite.
RTFD is activated: https://cernopendata-client.readthedocs.io/en/latest/index.html
Amend structure.
Amend README.
Move old help elsewhere.
Test issue for github mm integration. :)
This works:
$ cernopendata-client get-file-locations --recid 282 --protocol xrootd | head -3
root://eospublic.cern.ch//eos/opendata/cms/Run2011B/Photon/AOD/12Oct2013-v1/00000/2A227E10-C949-E311-B033-003048FEAF50.root
root://eospublic.cern.ch//eos/opendata/cms/Run2011B/Photon/AOD/12Oct2013-v1/10000/0002E1AB-5A40-E311-9B07-0025901AF6E6.root
root://eospublic.cern.ch//eos/opendata/cms/Run2011B/Photon/AOD/12Oct2013-v1/10000/02C84E11-6040-E311-8B9A-C86000151BEC.root
This does not:
$ cernopendata-client get-file-locations --recid 282 --protocol xrootd --verbose | head -3
root://eospublic.cern.ch//eos/opendata/cms/Run2011B/Photon/AOD/12Oct2013-v1/file-indexes/CMS_Run2011B_Photon_AOD_12Oct2013-v1_00000_file_index.json 269 sha1:5800af77c12d31bb76ef138d0b68cd6901facd9a
root://eospublic.cern.ch//eos/opendata/cms/Run2011B/Photon/AOD/12Oct2013-v1/file-indexes/CMS_Run2011B_Photon_AOD_12Oct2013-v1_10000_file_index.json 18615 sha1:5d91605740704cfba546e4766c67c8733d8de0c5
root://eospublic.cern.ch//eos/opendata/cms/Run2011B/Photon/AOD/12Oct2013-v1/file-indexes/CMS_Run2011B_Photon_AOD_12Oct2013-v1_20000_file_index.json 266736 sha1:70410acbcc711592115a14866e7485739622d188
In other words, the --expand/--no-expand
command-line option is not respected when one uses --verbose
.
Current behaviour:
$ cernopendata-client get-record --recid fixme
ERROR: The record id number you supplied is not valid.
404 Client Error: NOT FOUND for url: http://opendata.cern.ch/record/fixme
Expected behaviour:
Just fail after validation (recid should be integer) and do not contact remote server.
P.S. Ditto for --recid
values such as -10 or 1.23.
Later improvement: instead of having commands written twice, we could define shell functions like:
check_black () {
black --check .
}
so that we can simply call check_black
in appropriate places here...
In this way the commands to run will be defined in a single place (inside the shell function).
Originally posted by @tiborsimko in #36 (comment)
Current behaviour:
$ cernopendata-client get-record --title fixme | head -10
Record with given title does not exist.
{
"created": "2019-07-18T04:51:25.556890+00:00",
"id": 1,
"links": {
"bucket": "http://opendata.cern.ch/api/files/5266a82b-96f1-43a8-874e-b51c1c87d43c",
"self": "http://opendata.cern.ch/api/records/1"
},
Expected behaviour:
Should just fail and not fall back to recid=1 record and its JSON.
One complexity for download-files
command is that some records, such as recid 1, have only index files listed. These index files contain locations to actual data files. Other records, such as recid 5500, have actual data files directly attached.
This difference exists because of large experimental AOD/AODSIM datasets which can consist of 10,000 files and it was not possible to store these is Invenio 3 JSON at reasonable performance, see cernopendata/opendata.cern.ch#1562
This nuance exists already for get-file-locations
command where it was solved in this way: the command return list of actual data file locations, unless option --no-expand
is specified (which would return rich index files only). Compare:
$ cernopendata-client get-file-locations --recid 1 --protocol http | wc -l
2916
$ cernopendata-client get-file-locations --recid 1 --protocol http --no-expand | wc -l
12
The goal of this issue is to make sure the download-files
command behaves the same:
--no-expand
option so that records such as 1 will also work (similarly to get-file-locations
)If a user wishes to download files belonging to a record, the current technique is to list file locations:
$ cernopendata-client get-file-locations --recid 5500 --protocol http
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/BuildFile.xml
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/HiggsDemoAnalyzer.cc
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/List_indexfile.txt
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/M4Lnormdatall.cc
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/M4Lnormdatall_lvl3.cc
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/demoanalyzer_cfg_level3MC.py
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/demoanalyzer_cfg_level3data.py
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/demoanalyzer_cfg_level4MC.py
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/demoanalyzer_cfg_level4data.py
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/mass4l_combine.pdf
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/mass4l_combine.png
and then launch wget
or curl
commands to download them.
The goal of this issue is to simplify this process by introducing new command download-files
that would do this for the user.
Possible options:
$ cernopendata-client download-files --recid 5500 --protocol http --parallel-processes 2
This would launch two parallel downloading processes, using a suitable Python library, to download the files into current directory.
P.S.: MVP is simply to download files; resuming interrupted downloads will be part of another issue, but it is good to think about this functionality upfront.
P.S. An option --target-directory
could be introduced which would recreate directory structure known from the original record. This will be important for AOD files which have subdirectory structure such as this one. So the corresponding subdirectories would have to be created in the target directory.
Current behaviour
This is OK:
$ cernopendata-client get-file-locations --recid 5500
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/BuildFile.xml
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/HiggsDemoAnalyzer.cc
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/List_indexfile.txt
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/M4Lnormdatall.cc
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/M4Lnormdatall_lvl3.cc
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/demoanalyzer_cfg_level3MC.py
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/demoanalyzer_cfg_level3data.py
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/demoanalyzer_cfg_level4MC.py
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/demoanalyzer_cfg_level4data.py
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/mass4l_combine.pdf
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/mass4l_combine.png
This is not:
$ cernopendata-client get-file-locations --recid 14
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0000_file_index.json
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0000_file_index.txt
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0001_file_index.json
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0001_file_index.txt
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0002_file_index.json
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0002_file_index.txt
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0003_file_index.json
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0003_file_index.txt
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0004_file_index.json
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0004_file_index.txt
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0005_file_index.json
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0005_file_index.txt
Expected behaviour
The location to data files should be returned, i.e. if there are index files in the output then these should be "unwound", i.e. the client should read their content and return what's inside them, as it were.
Introduce new command list-directory
that would take an EOSPUBLIC path and would output files belonging to this directory and its subdirectories.
Example:
$ cernopendata-client list-directory /eos/opendata/cms/validated-runs/Commissioning10
root://eospublic.cern.ch//eos/opendata/cms/validated-runs/Commissioning10/Commissioning10-May19ReReco_7TeV.json
root://eospublic.cern.ch//eos/opendata/cms/validated-runs/Commissioning10/Commissioning10-May19ReReco_900GeV.json
Beware of several situations:
the path not starting with /eos/opendata/...
should be refused as not valid
the path could give many hits, e.g. /eos/opendata/cms
would want to list millions of files, so we have to stop it.
The implementation could use xrootdpyfs
and a snippet like:
fs = XRootDPyFS("root://eospublic.cern.ch//eos/opendata/cms/Run2010B/BTau/AOD/Apr21ReReco-v1/0000/")
files = fs.listdir()
/opendata_analysis_query.py
contains old code that is not used anymore.
Check functionality and port necessary one to the new structure.
Remove the files and its documentation afterwards.
Stemming from #8 , we should also fix the behaviour of --output-fields
. This currently allows to select only metadata
, created
, ... and other top-level fields from the /api/records
output. This is not very useful.
While the power users can do things like:
$ cernopendata-client get-metadata --recid 1 | jq -S '.metadata.title'
"/BTau/Run2010B-Apr21ReReco-v1/AOD"
$ cernopendata-client get-metadata --recid 1 | jq -S '.metadata.system_details.global_tag'
"FT_R_42_V10A::All"
it would be useful to amend --output-fields
to allow searching inside metadata section, so that users could write instead:
$ cernopendata-client get-metadata --recid 1 --output-fields title
/BTau/Run2010B-Apr21ReReco-v1/AOD
$ cernopendata-client get-metadata --recid 1 --output-fields system_details.global_tag
FT_R_42_V10A::All
IOW, --output-fields
should be able to print out a wanted JSON path value directly.
Instead of having print()
and click.echo()
hard-coded throughout the code base, let's introduce a new printer.py
module that can take care of displaying info/error messages in unified manner.
Example: display_message()
function in reana-dev
.
We could then print success in green, progress in yellow, errors in red, etc.
The code base should then be amended to use the new display_message()
function instead of printing or echoing.
$ cernopendata-client get-metadata --doi 10.7483/OPENDATA.CMS.A342.9982
...
return complexjson.loads(self.text, **kwargs)
File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Currently, the files are being verified at the end of a download:
$ cernopendata-client download-files --recid 5500 --filter-range 1-3 --verify
==> Downloading file 1 of 3
-> File: ./5500/BuildFile.xml
-> Progress: 0/0 kiB (100%)
==> Downloading file 2 of 3
-> File: ./5500/HiggsDemoAnalyzer.cc
-> Progress: 81/81 kiB (100%)
==> Downloading file 3 of 3
-> File: ./5500/List_indexfile.txt
-> Progress: 1/1 kiB (100%)
==> Verifying downloaded files for record 5500.
==> Verifying file BuildFile.xml...
-> expected size 305, found 305
-> expected checksum adler32:ff63668a, found adler32:ff63668a
==> Verifying file HiggsDemoAnalyzer.cc...
-> expected size 83761, found 83761
-> expected checksum adler32:f205f068, found adler32:f205f068
==> Verifying file List_indexfile.txt...
-> expected size 1669, found 1669
-> expected checksum adler32:46a907fc, found adler32:46a907fc
==> Success!
It would be good to improve this so that each file is immediately verified after it is downloaded:
This is so that the user can spot troubles early, not only after downloading all the files.
Originally posted by @tiborsimko in #58 (comment)
Currently the cernopendata-client
is querying opendata.cern.ch
instance, and this is hard-coded:
$ rg opendata.cern.ch
cernopendata_client/cli.py
24:SEARCH_URL = "http://opendata.cern.ch/api/records/"
103: "root://eospublic.cern.ch/", "http://opendata.cern.ch"
117: f.replace("root://eospublic.cern.ch/", "http://opendata.cern.ch")
It would be good to allow querying opendata-qa.cern.ch
or opendata-dev.cern.ch
instances, both for testing purposes and when working with open data releases that take a long time to "mature" on the DEV or QA instances before making it to the PROD instance.
We could solve it by adding a new command-line option (--server <url>
) to each CLI command, where users could pass a wanted server instance, for example:
$ cernopendata-client get-file-locations --recid 15007 --protocol http --server http://opendata-dev.cern.ch
If the option is not used, the client would connect by default to http://opendata.cern.ch
.
(1) The test suite for the filtering functionality (see test_download_files_filter_*()
) currently use the test record 3005. However, this record contains only a single file, which cannot test various scenarios (such as getting files 1-3).
It would be therefore good to amend these tests to use record 5500 which has several inputs files to choose from. It could also test exactly the same examples that we promise in the docstring when somebody does cernopendata-client download-files --help
.
That said, the filters work fine ๐ this issue is only about enriching tests to be able to have more files to filter from to cover more usage scenarios.
(2) One could also add three more simple test such as filtering for a non-existing file name (--filter-name notexisting
) should return error message, and (2b) the same for regexp and (2c) the same for fwrong range (--filter-range 0
and --filter-range 99998-99999
). Albeit the last one is kind of tested elsewhere in test_validate_range()
as well.
Current status:
$ head -1 .travis.yml
# TODO: Add License header
Expected:
"Minified" license snippet.
The code coverage looks great. We can still write more tests for not-tested else
branches and stuff. No priority, something to be done on the side when time permits.
$ python setup.py test
...
----------- coverage: platform linux, python 3.8.6-final-0 -----------
Name Stmts Miss Cover Missing
-----------------------------------------------------------------
cernopendata_client/__init__.py 4 0 100%
cernopendata_client/cli.py 137 16 88% 48, 54, 105-110, 112, 232-237, 282-285, 300-304, 342-349
cernopendata_client/config.py 7 0 100%
cernopendata_client/downloader.py 50 0 100%
cernopendata_client/printer.py 17 1 94% 37
cernopendata_client/searcher.py 104 22 79% 24-26, 41-46, 55-64, 98, 111-112, 127-133, 156, 203-207, 209
cernopendata_client/utils.py 11 0 100%
cernopendata_client/validator.py 33 2 94% 82-89
cernopendata_client/verifier.py 40 0 100%
cernopendata_client/version.py 3 0 100%
-----------------------------------------------------------------
TOTAL 406 41 90%
We are deploying support for HTTPS for CERN Open Data portal, e.g. see https://opendata-dev.cern.ch
It would be therefore good to allow https
protocol everywhere, so that we shall offer three protocols: http
, https
, and xrootd
.
We should amend the checker etc and also mention this in the documentation.
Report nicer error messages instead of Python tracebacks, for example two situations:
(1) download-files
$ cernopendata-client download-files --recid 5500 --server foo
Traceback (most recent call last):
File "/home/simko/.virtualenvs/cernopendata-client/bin/cernopendata-client", line 8, in <module>
sys.exit(cernopendata_client())
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/cernopendata_client/cli.py", line 235, in download_files
record_json = get_record_as_json(server, recid, doi, title)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/cernopendata_client/searcher.py", line 167, in get_record_as_json
record_id = verify_recid(server=server, recid=record_id)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/cernopendata_client/searcher.py", line 49, in verify_recid
input_record_url_check = requests.get(input_record_url)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/requests/sessions.py", line 519, in request
prep = self.prepare_request(req)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/requests/sessions.py", line 452, in prepare_request
p.prepare(
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/requests/models.py", line 313, in prepare
self.prepare_url(url, params)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/requests/models.py", line 387, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL 'foo/record/5500': No schema supplied. Perhaps you meant http://foo/record/5500?
(2) verify-files
$ cernopendata-client verify-files
Traceback (most recent call last):
File "/home/simko/.virtualenvs/cernopendata-client/bin/cernopendata-client", line 8, in <module>
sys.exit(cernopendata_client())
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/cernopendata_client/cli.py", line 323, in verify_files
validate_recid(recid)
File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/cernopendata_client/validator.py", line 23, in validate_recid
if recid <= 0:
TypeError: '<=' not supported between instances of 'NoneType' and 'int'
Note that some commands are OK:
$ cernopendata-client get-file-locations
ERROR: Please provide at least one of following arguments: (recid, doi, title)
$ cernopendata-client get-file-locations --server foo
Usage: cernopendata-client get-file-locations [OPTIONS]
Error: Invalid value for '--server': Server should be a valid URL
Now that we are catching downloading errors via --retry-limit
and --retry-sleep
parameters thanks to #91, we can build upon this basis and enrich the error handling logic to catch more errors (such as Error 500 received from server, etc).
Example how to reproduce:
$ cd opendata.cern.ch
$ docker-compose build
$ docker-compose up
$ docker exec -i -t opendatacernch_web_1 ./scripts/populate-instance.sh --skip-records
$ docker exec -i -t opendatacernch_web_1 cernopendata fixtures records --mode insert-or-replace -f cernopendata/modules/fixtures/data/records/cms-tools-higgsexample20112012.json
$ firefox http://localhost/record/5500
$ cernopendata-client download-files --recid 5500 --verify --server http://localhost --retry-sleep 30
Current behaviour of the client is to output Python traceback:
==> Verifying file demoanalyzer_cfg_level4data.py...
-> Expected size 3821, found 3821
-> Expected checksum adler32:177b49c0, found adler32:177b49c0
==> Downloading file 10 of 11
-> File: ./5500/mass4l_combine.pdf
Traceback (most recent call last):
File ".../bin/cernopendata-client", line 8, in <module>
sys.exit(cernopendata_client())
File ".../lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File ".../lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File ".../lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File ".../lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File ".../lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File ".../lib/python3.8/site-packages/cernopendata_client/cli.py", line 337, in download_files
download_single_file(path=path, file_location=file_location, protocol=protocol)
File ".../lib/python3.8/site-packages/cernopendata_client/downloader.py", line 109, in download_single_file
c.perform()
pycurl.error: (52, 'Empty reply from server')
Expected behaviour is to:
docker-compose up
After user downloads a set of files belonging to a dataset, see #22, it will be interesting to provide a functionality to check the sizes and adler32 checksums of downloaded files.
This information is available over REST API, here is one example.
The user would launch:
$ cernopendata-client verify-files --recid 5500 --directory ./mydata/5550
The command would go through the given directory, calculate size and adler32-checksum of present files, and compare this with the information obtained from the server via the REST API.
The command would report back to user:
The command will exit with different error codes so that users could plug this command to their harvesting workflows.
This "internal" issue is about refactoring some functions between cernopendata-client
components.
The searcher
component is basically contacting CERN Open Data portal and its API. So it is meant as a web client for /api/records
and friends.
Currently, the same component contains also get_list_directory()
function, which is not exactly using the above search APIs, but it is rather working directly with the EOSPUBLIC /eos/opendata
filesystem and directory structure.
This issue proposes to separate EOSPUBLIC directory handling functions (that are not using opendata.cern.ch's portal API in any way) into a new component.
We can invent some nice name, e.g. eospublic
(but this is not a verb) or walker.py
or filer.py
(but this may be weird) or xrootdexplorer
(for example) or something?
Create initial package structure as we do e.g. in reana-client
so that the client could be released on PyPI.
Set up RTFD documentation, Travis CI, and everything in the usual manner.
(The cernopendata-client
CLI schema amendments and enrichments will come later...)
(1) Introduce pydocstyle
into run-tests.sh
following e.g. reana-client
example.
(2) Fix existing warnings in master
branch:
$ pydocstyle .
./docs/conf.py:1 at module level:
D100: Missing docstring in public module
./cernopendata_client/search.py:1 at module level:
D100: Missing docstring in public module
./cernopendata_client/cli.py:40 in public function `cernopendata_client`:
D103: Missing docstring in public function
./cernopendata_client/cli.py:68 in public function `get_metadata`:
D301: Use r""" if any backslashes in a docstring
./cernopendata_client/cli.py:123 in public function `get_file_locations`:
D301: Use r""" if any backslashes in a docstring
./cernopendata_client/cli.py:180 in public function `download_files`:
D202: No blank lines allowed after function docstring (found 1)
./cernopendata_client/cli.py:180 in public function `download_files`:
D301: Use r""" if any backslashes in a docstring
./cernopendata_client/downloader.py:1 at module level:
D100: Missing docstring in public module
./cernopendata_client/downloader.py:15 in public function `show_download_progress`:
D400: First line should end with a period (not 'e')
./cernopendata_client/downloader.py:28 in public function `download_single_file`:
D400: First line should end with a period (not 'e')
./cernopendata_client/downloader.py:49 in public function `get_download_files_by_name`:
D400: First line should end with a period (not 'e')
./cernopendata_client/downloader.py:49 in public function `get_download_files_by_name`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
./cernopendata_client/downloader.py:57 in public function `get_download_files_by_regexp`:
D205: 1 blank line required between summary line and description (found 0)
./cernopendata_client/downloader.py:57 in public function `get_download_files_by_regexp`:
D400: First line should end with a period (not ')')
./cernopendata_client/downloader.py:57 in public function `get_download_files_by_regexp`:
D403: First word of the first line should be properly capitalized ('Dload', not 'dload')
./cernopendata_client/downloader.py:72 in public function `get_download_files_by_range`:
D205: 1 blank line required between summary line and description (found 0)
./cernopendata_client/downloader.py:72 in public function `get_download_files_by_range`:
D400: First line should end with a period (not ')')
./cernopendata_client/downloader.py:72 in public function `get_download_files_by_range`:
D403: First word of the first line should be properly capitalized ('Dload', not 'dload')
./cernopendata_client/validator.py:1 at module level:
D100: Missing docstring in public module
./cernopendata_client/version.py:9 at module level:
D205: 1 blank line required between summary line and description (found 0)
Use case: look up record by CMS dataset titles:
$ cernopendata-client get-record --title '/BTau/Run2010B-Apr21ReReco-v1/AOD'
This is not working in a similar manner as #18, so the fix will be similar.
Note that it could happen that some title would be identical in two records, for example when the record title is not the dataset name, but some free text. We should issue an error when title lookup returns more than one hit.
Observation: the progress bar, once a file is downloaded, makes the display broken a bit, see:
$ cernopendata-client download-files --recid 1 --filter-range 1-2
==> Downloading file 1 of 2
==> Downloading file: ./1/0248915F-EE71-E011-8894-0025902009E8.root
==> Downloading file 2 of 2kiB (100%)
==> Downloading file: ./1/0268F635-B671-E011-9090-002481E14E00.root
Downloading: 842237/842237 kiB (100%)
Download completed!
Note the Downloading file 2 of 2kiB (100%)
part where some bits remained from the previous progress bar.
Mentioning just for completeness; if you don't have a quick solution, we can also address this cosmetics issues later via another separate issue...
Originally posted by @tiborsimko in #44 (comment)
The magic string opendata.cern.ch
is hard-coded in many places in the client.
It would be good to create config.py
to centralise the server location.
Example: see reana-client
and e.g. TIMECHECK
value set up there.
After addition of pycurl
, the ReadTheDocs builds fail with:
FileNotFoundError: [Errno 2] No such file or directory: 'curl-config': 'curl-config'
This is most probably because some system dependencies such as libssl-dev
and friends are not installed.
Investigate a best solution to fix the documentation builds.
autodoc_mock_imports
?In --download-files
, improve verification and add a verification after each download.
_Originally posted by @tiborsimko in
Due to special treatment of files, some of the default API output (notably file buckets) are not useful, since the files are being served from EOS.
These should be hidden. We should probably return directly the content of metadata
.
It would be good to add shellcheck
to test run-tests.sh
itself:
run-tests.sh --check-shell
Goal: Get list of files belonging to a dataset.
Complexity: Should resolve index files for datasets such as recid 14. The output should be data files, not data index files. (For these people could presumably use get-record --output-fields files
as part of task #2).
Inputs: the same as for #2 for example --recid 14
, or --doi
or --title
.
Outputs: list of dataset locations such as:
root://eospublic.cern.ch//eos/opendata/atlas/MasterclassDatasets/WPath/2014/1/1A.zip
root://eospublic.cern.ch//eos/opendata/atlas/MasterclassDatasets/WPath/2014/1/1B.zip
Options: --protocol root
(default), --protocol http
which shoud return http://opendata.cern.ch/eos/opendata/...
instead of root://...
paths.
Record 1 contains 2916 files and is 2.7 TB big.
Chances are people would like to download it in batches.
Currently, cernopendata-client download-files
would download everything. We need to introduce finer granularity.
The goal of this issue is to introduce a new option for download-files
, called perhaps --file
, which would download only one particular file:
$ cernopendata-client download-files --recid 1 --filename 105FD6D0-8B71-E011-9613-00E081791775.root
Alternatively, we could offer regexp-like matching:
$ cernopendata-client download-files --recid 1 --filename '*E011*'
which would download all files matching the given glob expression.
Alternatively, since the file order is perfectly defined in JSON, we can offer downloading by chunks, either given file number N1, or several files from file number N2 to file number N3:
$ cernopendata-client download-files --recid 1 --filenumber 13
$ cernopendata-client download-files --recid 1 --filenumber 20-29
$ cernopendata-client download-files --recid 1 --filenumber 30-39
...
Current behaviour:
$ cernopendata-client get-record --doi fixme | head -10
Record with given doi does not exist.
{
"created": "2019-07-18T04:51:25.556890+00:00",
"id": 1,
"links": {
"bucket": "http://opendata.cern.ch/api/files/5266a82b-96f1-43a8-874e-b51c1c87d43c",
"self": "http://opendata.cern.ch/api/records/1"
},
...
Expected behaviour:
Should just fail and not print any record=1 JSON afterwards.
Let's enrich the CLI API docs page with all commands and options that would be automatically generated.
I thought of looking into generating the CLI API in other way... E.g. we could run click-sphinx
locally and add the file into `docs` so that ReadTheDocs would simply consume it from
there. (A bit like we are doing with OpenAPI stuff in REANA.) Hence I used `addresses`
and not `closes` yet... I'll move the issue into "Ready for work" kanban column.
Originally posted by @tiborsimko in #40 (comment)
The download-files
command has now useful filtering options to be able to download only files matching certain name (see #44).
It could be useful to have the same filtering functionality for get-file-locations
too so that users do not have to parse the output themselves.
Let's muse about possible improvements such as: (1) adding this option also to get-file-locations
command; or (2) merging the commands together, keeping only download-files
command, and introducing a sort of --dry-run
option that would not do the job but only print file locations that it would download. The latter option would also simplify test suite, as we could test filtering without really doing the download tasks more easily.
WDYT? RFC IRL for later
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.