cernopendata / cernopendata-client Goto Github PK

View Code? Open in Web Editor NEW

10.0 38.0 9.0 295 KB

CERN Open Data command-line client

Home Page: http://cernopendata-client.readthedocs.io/

License: GNU General Public License v3.0

Python 95.03% Shell 2.82% Dockerfile 2.15%

cernopendata-client's Issues

cli: get-record

Goal: Implement get-record CLI function to fetch some wanted record metadata.

Inputs: --recid or --doi ot --title (useful for CMS datasets). One of them should be required.

Outputs: Full JSON of the bibliographic record.

Optionally, if some CLI switch is used, output only that JSON subtree.

Examples:

$ cernopendata-client get-record --recid 14
$ cernopendata-client get-record --doi 10.7483/OPENDATA.ATLAS.AHKR.A3TA
$ cernopendata-client get-record --recid 14 --output-fields title,date_created
$ cernopendata-client get-record --tile '/Mu/Run2010B-v1/RAW' --output-fields recid

Exit status: 0 if OK, 1 if more then one record was matched. (Should not happen.)

cli: `get-file-locations --verbose`

Currently, we output only file locations when users use the get-file-locations command:

$ cernopendata-client get-file-locations --recid 1 | head -3
http://opendata.cern.ch/eos/opendata/cms/Run2010B/BTau/AOD/Apr21ReReco-v1/0000/00E16FBB-9071-E011-83D3-003048673F12.root
http://opendata.cern.ch/eos/opendata/cms/Run2010B/BTau/AOD/Apr21ReReco-v1/0000/0248915F-EE71-E011-8894-0025902009E8.root
http://opendata.cern.ch/eos/opendata/cms/Run2010B/BTau/AOD/Apr21ReReco-v1/0000/0268F635-B671-E011-9090-002481E14E00.root

This does not inform users about the size or the checksum of what is going to be uploaded.

This information could be useful to estimate the download times and/or to plug this into automated scripts the users may have.

In order to improve the situation, we can introduce a new option --verbose (or perhaps two new options --include-size and --include-chechksum?) that would also print out the file size and the checksum information.

Example:

$ cernopendata-client get-file-locations --recid 1 | head -3
http://opendata.cern.ch/eos/opendata/cms/Run2010B/BTau/AOD/Apr21ReReco-v1/0000/00E16FBB-9071-E011-83D3-003048673F12.root 123234234234 adler32:aaaaa
http://opendata.cern.ch/eos/opendata/cms/Run2010B/BTau/AOD/Apr21ReReco-v1/0000/0248915F-EE71-E011-8894-0025902009E8.root 234234324 adler32:bbbbb
http://opendata.cern.ch/eos/opendata/cms/Run2010B/BTau/AOD/Apr21ReReco-v1/0000/0268F635-B671-E011-9090-002481E14E00.root 67676767676 adler32:ccccc

I.e. the output would be space-separated URI SIZE CHECKSUM triad.

CC @jbenito3

select record by title should use exact string matching

Looking for records by title sometimes does not work due to returning more than one record:

$ cernopendata-client get-record --title '/BTau/Run2010B-Apr21ReReco-v1/AOD'
More than one record fit this title. This should not happen.

See two records: http://opendata.cern.ch/search?page=1&size=20&q=title:%22%2FBTau%2FRun2010B-Apr21ReReco-v1%2FAOD%22

However, there is precisely one dataset record matching the given title exactly, record ID 1, which should be returned...

We could look for "exact" string match instead of the "substring" match, as it were, to fix this problem. Note that the issue may need to amend Elasticsearch settings on the server side.

Originally posted by @tiborsimko in #19 (comment)

adopt black formatter

Following up what we did in @reanahub, adopt black code formatter.

Amend codebase.

Amend run-tests.sh to check for black compliance.

Remove testing isort.

cernopendata-client download-files --recid 3005 --verify --protocol https

$ cernopendata-client download-files --recid 3005 --verify --protocol https
==> Downloading file 1 of 1
  -> File: ./3005/0d0714743f0204ed3c0144941e6ce248.configFile.py
Traceback (most recent call last):
  File "/home/simko/.virtualenvs/cernopendata-client/bin/cernopendata-client", line 8, in <module>
    sys.exit(cernopendata_client())
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/cernopendata_client/cli.py", line 291, in download_files
    download_single_file(path=path, file_location=file_location, protocol=protocol)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/cernopendata_client/downloader.py", line 68, in download_single_file
    c.perform()
pycurl.error: (1, 'Protocol "root" not supported or disabled in libcurl')

tests: add test for record with TXT and JSON files.

Discussion on MM:

This is for big dataset records, and we always have a record with both TXT and JSON files. (In the past we had a mixture so that some records did not have yet JSON file, but nowadays all do.)
The codebase should be ready to assume there might be missing, but actually we don't have any "live" test records... We would have to use mock and mock the REST API reply to have a test case. Which might be a good thing to do to catch all those various non-live corner cases that the client should be ready to deal nicely with...

Searching for a record from the title

I'm trying to understand how the search from the title is supposed to work.
Is there a way to also specify (part of) the additional_title key? If not, the search seems to often just throw More than one record fit this title.This should not happen.

This is what I'm trying:

from cernopendata_client import searcher

SERVER_HTTP_URI = "http://opendata.cern.ch"

# Check if record with the given recid exists
searcher.verify_recid(server=SERVER_HTTP_URI, recid=1)

metadata_from_recid = searcher.get_record_as_json(server=SERVER_HTTP_URI, recid=1)

metadata_from_doi = searcher.get_record_as_json(server=SERVER_HTTP_URI, doi=metadata_from_recid["metadata"]["doi"])

metadata_from_title = searcher.get_record_as_json(server=SERVER_HTTP_URI, title=metadata_from_recid["metadata"]["title"])

print(metadata_from_recid == metadata_from_doi == metadata_from_title)

docs: mention xrootd and pycurl installation options

Following up #79, it would be good to do the following:

1) Mention in https://cernopendata-client.readthedocs.io/en/latest/installation.html that people can specify two installation options, pycurl and xrootd, with a brief description what each one means.
2) Currently, we actually use xroot name in setup.py, but I think we should use xrootd name because this is the official protocol name (XRootD).

    "xroot": ["xrootdpyfs>=0.2"],

3) For consistency, let's also amend everywhere in our code base the terms like --protocol root to use --protocol xrootd, so that we say everywhere "xrootd" both in the code and the documentation.

pip3 installation fails looking for "curl-config"

Environment: Python3.8, pip 19.2.3

Running pip install cernopendata-client gives:

    ERROR: Command errored out with exit status 1:
     command: /home/avivace/.local/share/virtualenvs/bagit-create-tuzAGD-s/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-he1mpu_d/pycurl/setup.py'"'"'; __file__='"'"'/tmp/pip-install-he1mpu_d/pycurl/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-ej5txkkk
         cwd: /tmp/pip-install-he1mpu_d/pycurl/
    Complete output (22 lines):
    Traceback (most recent call last):
      File "/tmp/pip-install-he1mpu_d/pycurl/setup.py", line 236, in configure_unix
        p = subprocess.Popen((self.curl_config(), '--version'),
      File "/usr/lib/python3.8/subprocess.py", line 854, in __init__
        self._execute_child(args, executable, preexec_fn, close_fds,
      File "/usr/lib/python3.8/subprocess.py", line 1702, in _execute_child
        raise child_exception_type(errno_num, err_msg, err_filename)
    FileNotFoundError: [Errno 2] No such file or directory: 'curl-config'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-he1mpu_d/pycurl/setup.py", line 988, in <module>
        ext = get_extension(sys.argv, split_extension_source=split_extension_source)
      File "/tmp/pip-install-he1mpu_d/pycurl/setup.py", line 649, in get_extension
        ext_config = ExtensionConfiguration(argv)
      File "/tmp/pip-install-he1mpu_d/pycurl/setup.py", line 101, in __init__
        self.configure()
      File "/tmp/pip-install-he1mpu_d/pycurl/setup.py", line 241, in configure_unix
        raise ConfigurationError(msg)
    __main__.ConfigurationError: Could not run curl-config: [Errno 2] No such file or directory: 'curl-config'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Maybe a paragraph in the README could also explain how to manually install this, not using pip?

download-files: xrootd protocol

In addition to http protocol implemented in #22, we should add support for xrootd protocol which provides more bandwidth.

Check whether client system contains xrootd commands (xrdcp) or Python libraries (xrootd).

If yes, enable support for --protocol root.

If not, report help string what people has to do do install them.

This is not an MVP functionality for the first public release; it can be added later.

tests: enrich test suite

The current test suite only has one example for testing version import, see tests/test_version.py.

We should start introducing tests as the feature set grows, see #19 (comment)

The tests to introduce are of diverse nature:

"true unit tests" for helper functions such as validate_recid() and validate_server();
"integration tests" or "regression tests" for CLI commands such as cernopendata-client get-file-locations --title '/DoubleElectron/Run2012B-v1/RAW'

For inspiration, see reana-client or reana-dev test suite.

docs: amend structure

RTFD is activated: https://cernopendata-client.readthedocs.io/en/latest/index.html

Amend structure.

Amend README.

Move old help elsewhere.

Test Issue

Test issue for github mm integration. :)

cernopendata-client get-file-locations --recid 282 --protocol xrootd --verbose

This works:

$ cernopendata-client get-file-locations --recid 282 --protocol xrootd | head -3
root://eospublic.cern.ch//eos/opendata/cms/Run2011B/Photon/AOD/12Oct2013-v1/00000/2A227E10-C949-E311-B033-003048FEAF50.root
root://eospublic.cern.ch//eos/opendata/cms/Run2011B/Photon/AOD/12Oct2013-v1/10000/0002E1AB-5A40-E311-9B07-0025901AF6E6.root
root://eospublic.cern.ch//eos/opendata/cms/Run2011B/Photon/AOD/12Oct2013-v1/10000/02C84E11-6040-E311-8B9A-C86000151BEC.root

This does not:

$ cernopendata-client get-file-locations --recid 282 --protocol xrootd --verbose | head -3
root://eospublic.cern.ch//eos/opendata/cms/Run2011B/Photon/AOD/12Oct2013-v1/file-indexes/CMS_Run2011B_Photon_AOD_12Oct2013-v1_00000_file_index.json     269     sha1:5800af77c12d31bb76ef138d0b68cd6901facd9a
root://eospublic.cern.ch//eos/opendata/cms/Run2011B/Photon/AOD/12Oct2013-v1/file-indexes/CMS_Run2011B_Photon_AOD_12Oct2013-v1_10000_file_index.json     18615   sha1:5d91605740704cfba546e4766c67c8733d8de0c5
root://eospublic.cern.ch//eos/opendata/cms/Run2011B/Photon/AOD/12Oct2013-v1/file-indexes/CMS_Run2011B_Photon_AOD_12Oct2013-v1_20000_file_index.json     266736  sha1:70410acbcc711592115a14866e7485739622d188

In other words, the --expand/--no-expand command-line option is not respected when one uses --verbose.

get-record --recid fixme

Current behaviour:

$ cernopendata-client get-record --recid fixme
ERROR: The record id number you supplied is not valid.
404 Client Error: NOT FOUND for url: http://opendata.cern.ch/record/fixme

Expected behaviour:

Just fail after validation (recid should be integer) and do not contact remote server.

P.S. Ditto for --recid values such as -10 or 1.23.

Improve the run-tests shell script

Later improvement: instead of having commands written twice, we could define shell functions like:

check_black () {
    black --check .
}

so that we can simply call check_black in appropriate places here...

In this way the commands to run will be defined in a single place (inside the shell function).

Originally posted by @tiborsimko in #36 (comment)

get-record --title fixme

Current behaviour:

$ cernopendata-client get-record --title fixme | head -10
Record with given title does not exist.
{
    "created": "2019-07-18T04:51:25.556890+00:00",
    "id": 1,
    "links": {
        "bucket": "http://opendata.cern.ch/api/files/5266a82b-96f1-43a8-874e-b51c1c87d43c",
        "self": "http://opendata.cern.ch/api/records/1"
    },

Expected behaviour:

Should just fail and not fall back to recid=1 record and its JSON.

download-files: actual data files vs index files

One complexity for download-files command is that some records, such as recid 1, have only index files listed. These index files contain locations to actual data files. Other records, such as recid 5500, have actual data files directly attached.

This difference exists because of large experimental AOD/AODSIM datasets which can consist of 10,000 files and it was not possible to store these is Invenio 3 JSON at reasonable performance, see cernopendata/opendata.cern.ch#1562

This nuance exists already for get-file-locations command where it was solved in this way: the command return list of actual data file locations, unless option --no-expand is specified (which would return rich index files only). Compare:

$ cernopendata-client get-file-locations --recid 1 --protocol http  | wc -l
2916
$ cernopendata-client get-file-locations --recid 1 --protocol http --no-expand | wc -l
12

The goal of this issue is to make sure the download-files command behaves the same:

in issue #22 we shall deal notably with records such as 5500 that are easy to download
in this issue we shall introduce --no-expand option so that records such as 1 will also work (similarly to get-file-locations)

download-files: initial release

If a user wishes to download files belonging to a record, the current technique is to list file locations:

$ cernopendata-client get-file-locations --recid 5500 --protocol http
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/BuildFile.xml
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/HiggsDemoAnalyzer.cc
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/List_indexfile.txt
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/M4Lnormdatall.cc
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/M4Lnormdatall_lvl3.cc
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/demoanalyzer_cfg_level3MC.py
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/demoanalyzer_cfg_level3data.py
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/demoanalyzer_cfg_level4MC.py
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/demoanalyzer_cfg_level4data.py
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/mass4l_combine.pdf
http://opendata.cern.ch/eos/opendata/cms/software/HiggsExample20112012/mass4l_combine.png

and then launch wget or curl commands to download them.

The goal of this issue is to simplify this process by introducing new command download-files that would do this for the user.

Possible options:

$ cernopendata-client download-files --recid 5500 --protocol http --parallel-processes 2

This would launch two parallel downloading processes, using a suitable Python library, to download the files into current directory.

P.S.: MVP is simply to download files; resuming interrupted downloads will be part of another issue, but it is good to think about this functionality upfront.

P.S. An option --target-directory could be introduced which would recreate directory structure known from the original record. This will be important for AOD files which have subdirectory structure such as this one. So the corresponding subdirectories would have to be created in the target directory.

get-file-locations: unwind index files

Current behaviour

This is OK:

$ cernopendata-client get-file-locations --recid 5500
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/BuildFile.xml
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/HiggsDemoAnalyzer.cc
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/List_indexfile.txt
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/M4Lnormdatall.cc
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/M4Lnormdatall_lvl3.cc
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/demoanalyzer_cfg_level3MC.py
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/demoanalyzer_cfg_level3data.py
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/demoanalyzer_cfg_level4MC.py
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/demoanalyzer_cfg_level4data.py
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/mass4l_combine.pdf
root://eospublic.cern.ch//eos/opendata/cms/software/HiggsExample20112012/mass4l_combine.png

This is not:

$ cernopendata-client get-file-locations --recid 14
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0000_file_index.json
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0000_file_index.txt
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0001_file_index.json
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0001_file_index.txt
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0002_file_index.json
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0002_file_index.txt
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0003_file_index.json
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0003_file_index.txt
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0004_file_index.json
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0004_file_index.txt
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0005_file_index.json
root://eospublic.cern.ch//eos/opendata/cms/Run2010B/Mu/AOD/Apr21ReReco-v1/file-indexes/CMS_Run2010B_Mu_AOD_Apr21ReReco-v1_0005_file_index.txt

Expected behaviour

The location to data files should be returned, i.e. if there are index files in the output then these should be "unwound", i.e. the client should read their content and return what's inside them, as it were.

cli: new command list-directory

Introduce new command list-directory that would take an EOSPUBLIC path and would output files belonging to this directory and its subdirectories.

Example:

$ cernopendata-client list-directory /eos/opendata/cms/validated-runs/Commissioning10
root://eospublic.cern.ch//eos/opendata/cms/validated-runs/Commissioning10/Commissioning10-May19ReReco_7TeV.json
root://eospublic.cern.ch//eos/opendata/cms/validated-runs/Commissioning10/Commissioning10-May19ReReco_900GeV.json

Beware of several situations:

the path not starting with /eos/opendata/... should be refused as not valid
the path could give many hits, e.g. /eos/opendata/cms would want to list millions of files, so we have to stop it.

The implementation could use xrootdpyfs and a snippet like:

fs = XRootDPyFS("root://eospublic.cern.ch//eos/opendata/cms/Run2010B/BTau/AOD/Apr21ReReco-v1/0000/")
files = fs.listdir()

deprecate opendata_analysis_query.py

/opendata_analysis_query.py contains old code that is not used anymore.

Check functionality and port necessary one to the new structure.

Remove the files and its documentation afterwards.

get-metadata: amend `--output-fields` behaviour

Stemming from #8 , we should also fix the behaviour of --output-fields. This currently allows to select only metadata, created, ... and other top-level fields from the /api/records output. This is not very useful.

While the power users can do things like:

$ cernopendata-client get-metadata --recid 1 | jq -S '.metadata.title'
"/BTau/Run2010B-Apr21ReReco-v1/AOD"
$ cernopendata-client get-metadata --recid 1 | jq -S '.metadata.system_details.global_tag'
"FT_R_42_V10A::All"

it would be useful to amend --output-fields to allow searching inside metadata section, so that users could write instead:

$ cernopendata-client get-metadata --recid 1 --output-fields title
/BTau/Run2010B-Apr21ReReco-v1/AOD
$ cernopendata-client get-metadata --recid 1 --output-fields system_details.global_tag
FT_R_42_V10A::All

IOW, --output-fields should be able to print out a wanted JSON path value directly.

printer: initial release

Instead of having print() and click.echo() hard-coded throughout the code base, let's introduce a new printer.py module that can take care of displaying info/error messages in unified manner.

Example: display_message() function in reana-dev.

We could then print success in green, progress in yellow, errors in red, etc.

The code base should then be amended to use the new display_message() function instead of printing or echoing.

search by DOI is broken

$ cernopendata-client get-metadata --doi 10.7483/OPENDATA.CMS.A342.9982
...
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

download-files: --verify each file as they are downloaded

Currently, the files are being verified at the end of a download:

$ cernopendata-client download-files --recid 5500 --filter-range 1-3 --verify
==> Downloading file 1 of 3
  -> File: ./5500/BuildFile.xml
  -> Progress: 0/0 kiB (100%)
==> Downloading file 2 of 3
  -> File: ./5500/HiggsDemoAnalyzer.cc
  -> Progress: 81/81 kiB (100%)
==> Downloading file 3 of 3
  -> File: ./5500/List_indexfile.txt
  -> Progress: 1/1 kiB (100%)
==> Verifying downloaded files for record 5500.
==> Verifying file BuildFile.xml...
  -> expected size 305, found 305
  -> expected checksum adler32:ff63668a, found adler32:ff63668a
==> Verifying file HiggsDemoAnalyzer.cc...
  -> expected size 83761, found 83761
  -> expected checksum adler32:f205f068, found adler32:f205f068
==> Verifying file List_indexfile.txt...
  -> expected size 1669, found 1669
  -> expected checksum adler32:46a907fc, found adler32:46a907fc
==> Success!

It would be good to improve this so that each file is immediately verified after it is downloaded:

first filter which files user wants to download
then for each file to be downloaded, do:
- download it
- check its size and sha1
print success

This is so that the user can spot troubles early, not only after downloading all the files.

Originally posted by @tiborsimko in #58 (comment)

configuration: allow setting alternative server URL

Currently the cernopendata-client is querying opendata.cern.ch instance, and this is hard-coded:

$ rg opendata.cern.ch
cernopendata_client/cli.py
24:SEARCH_URL = "http://opendata.cern.ch/api/records/"
103:                    "root://eospublic.cern.ch/", "http://opendata.cern.ch"
117:            f.replace("root://eospublic.cern.ch/", "http://opendata.cern.ch")

It would be good to allow querying opendata-qa.cern.ch or opendata-dev.cern.ch instances, both for testing purposes and when working with open data releases that take a long time to "mature" on the DEV or QA instances before making it to the PROD instance.

We could solve it by adding a new command-line option (--server <url>) to each CLI command, where users could pass a wanted server instance, for example:

$ cernopendata-client get-file-locations --recid 15007 --protocol http --server http://opendata-dev.cern.ch

If the option is not used, the client would connect by default to http://opendata.cern.ch.

tests: download-files filters could use record 5500

(1) The test suite for the filtering functionality (see test_download_files_filter_*()) currently use the test record 3005. However, this record contains only a single file, which cannot test various scenarios (such as getting files 1-3).

It would be therefore good to amend these tests to use record 5500 which has several inputs files to choose from. It could also test exactly the same examples that we promise in the docstring when somebody does cernopendata-client download-files --help.

That said, the filters work fine 👍 this issue is only about enriching tests to be able to have more files to filter from to cover more usage scenarios.

(2) One could also add three more simple test such as filtering for a non-existing file name (--filter-name notexisting) should return error message, and (2b) the same for regexp and (2c) the same for fwrong range (--filter-range 0 and --filter-range 99998-99999). Albeit the last one is kind of tested elsewhere in test_validate_range() as well.

docs: enrich file headers with license statement

Current status:

$ head -1 .travis.yml
# TODO: Add License header

Expected:

"Minified" license snippet.

tests: complete code coverage

The code coverage looks great. We can still write more tests for not-tested else branches and stuff. No priority, something to be done on the side when time permits.

$ python setup.py test
...
----------- coverage: platform linux, python 3.8.6-final-0 -----------
Name                                Stmts   Miss  Cover   Missing
-----------------------------------------------------------------
cernopendata_client/__init__.py         4      0   100%
cernopendata_client/cli.py            137     16    88%   48, 54, 105-110, 112, 232-237, 282-285, 300-304, 342-349
cernopendata_client/config.py           7      0   100%
cernopendata_client/downloader.py      50      0   100%
cernopendata_client/printer.py         17      1    94%   37
cernopendata_client/searcher.py       104     22    79%   24-26, 41-46, 55-64, 98, 111-112, 127-133, 156, 203-207, 209
cernopendata_client/utils.py           11      0   100%
cernopendata_client/validator.py       33      2    94%   82-89
cernopendata_client/verifier.py        40      0   100%
cernopendata_client/version.py          3      0   100%
-----------------------------------------------------------------
TOTAL                                 406     41    90%

protocol: add 'https' option

We are deploying support for HTTPS for CERN Open Data portal, e.g. see https://opendata-dev.cern.ch

It would be therefore good to allow https protocol everywhere, so that we shall offer three protocols: http, https, and xrootd.

We should amend the checker etc and also mention this in the documentation.

cli: do not show Python tracebacks

Report nicer error messages instead of Python tracebacks, for example two situations:

(1) download-files

$ cernopendata-client download-files --recid 5500 --server foo
Traceback (most recent call last):
  File "/home/simko/.virtualenvs/cernopendata-client/bin/cernopendata-client", line 8, in <module>
    sys.exit(cernopendata_client())
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/cernopendata_client/cli.py", line 235, in download_files
    record_json = get_record_as_json(server, recid, doi, title)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/cernopendata_client/searcher.py", line 167, in get_record_as_json
    record_id = verify_recid(server=server, recid=record_id)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/cernopendata_client/searcher.py", line 49, in verify_recid
    input_record_url_check = requests.get(input_record_url)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/requests/sessions.py", line 519, in request
    prep = self.prepare_request(req)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/requests/sessions.py", line 452, in prepare_request
    p.prepare(
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/requests/models.py", line 313, in prepare
    self.prepare_url(url, params)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/requests/models.py", line 387, in prepare_url
    raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL 'foo/record/5500': No schema supplied. Perhaps you meant http://foo/record/5500?

(2) verify-files

$ cernopendata-client verify-files
Traceback (most recent call last):
  File "/home/simko/.virtualenvs/cernopendata-client/bin/cernopendata-client", line 8, in <module>
    sys.exit(cernopendata_client())
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/cernopendata_client/cli.py", line 323, in verify_files
    validate_recid(recid)
  File "/home/simko/.virtualenvs/cernopendata-client/lib/python3.8/site-packages/cernopendata_client/validator.py", line 23, in validate_recid
    if recid <= 0:
TypeError: '<=' not supported between instances of 'NoneType' and 'int'

Note that some commands are OK:

$ cernopendata-client get-file-locations
ERROR: Please provide at least one of following arguments: (recid, doi, title)

$ cernopendata-client get-file-locations --server foo
Usage: cernopendata-client get-file-locations [OPTIONS]

Error: Invalid value for '--server': Server should be a valid URL

downloader: catch more error situations

Now that we are catching downloading errors via --retry-limit and --retry-sleep parameters thanks to #91, we can build upon this basis and enrich the error handling logic to catch more errors (such as Error 500 received from server, etc).

Example how to reproduce:

Run locally a CERN Open Data instance:

$ cd opendata.cern.ch
$ docker-compose build
$ docker-compose up
$ docker exec -i -t opendatacernch_web_1 ./scripts/populate-instance.sh --skip-records
$ docker exec -i -t opendatacernch_web_1 cernopendata fixtures records --mode insert-or-replace -f cernopendata/modules/fixtures/data/records/cms-tools-higgsexample20112012.json
$ firefox http://localhost/record/5500

Launch downloading of files with big enough retry sleep time:

$ cernopendata-client download-files --recid 5500 --verify  --server http://localhost --retry-sleep 30

While the download is running, bring down the web site.

Current behaviour of the client is to output Python traceback:

==> Verifying file demoanalyzer_cfg_level4data.py...
  -> Expected size 3821, found 3821
  -> Expected checksum adler32:177b49c0, found adler32:177b49c0
==> Downloading file 10 of 11
  -> File: ./5500/mass4l_combine.pdf
Traceback (most recent call last):
  File ".../bin/cernopendata-client", line 8, in <module>
    sys.exit(cernopendata_client())
  File ".../lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File ".../lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File ".../lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File ".../lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File ".../lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File ".../lib/python3.8/site-packages/cernopendata_client/cli.py", line 337, in download_files
    download_single_file(path=path, file_location=file_location, protocol=protocol)
  File ".../lib/python3.8/site-packages/cernopendata_client/downloader.py", line 109, in download_single_file
    c.perform()
pycurl.error: (52, 'Empty reply from server')

Expected behaviour is to:

client should catch errors of this kind in some try/except and not exit immediately
.. or use pycurl/requests/xrootd client native retry functionalities directly!
client should wait 30 seconds for a retry
in the meantime, you bring the server up again via docker-compose up
client should happily reconnect and download the file

The same test can be performed not only when the server is down, but simulating various responses from the server, such as Error 500, Error 206, etc. This can be done easily by editing Nginx configuration (or the Flask app) to return various kinds of errors.

verify-files: initial release

After user downloads a set of files belonging to a dataset, see #22, it will be interesting to provide a functionality to check the sizes and adler32 checksums of downloaded files.

This information is available over REST API, here is one example.

The user would launch:

$ cernopendata-client verify-files --recid 5500 --directory ./mydata/5550

The command would go through the given directory, calculate size and adler32-checksum of present files, and compare this with the information obtained from the server via the REST API.

The command would report back to user:

exit code 1 if any of the locally available files is having different size or different checksum
exit code 2 if some of the locally available files is not present on the remote opendata.cern.ch instance
exit code 3 if some of the opendata.cern.ch files are not present locally
exit code 0 if everything corresponds perfectly well

The command will exit with different error codes so that users could plug this command to their harvesting workflows.

searcher: separate out get_list_directory() and friends

This "internal" issue is about refactoring some functions between cernopendata-client components.

The searcher component is basically contacting CERN Open Data portal and its API. So it is meant as a web client for /api/records and friends.

Currently, the same component contains also get_list_directory() function, which is not exactly using the above search APIs, but it is rather working directly with the EOSPUBLIC /eos/opendata filesystem and directory structure.

This issue proposes to separate EOSPUBLIC directory handling functions (that are not using opendata.cern.ch's portal API in any way) into a new component.

We can invent some nice name, e.g. eospublic (but this is not a verb) or walker.py or filer.py (but this may be weird) or xrootdexplorer (for example) or something?

package: initial release

Create initial package structure as we do e.g. in reana-client so that the client could be released on PyPI.

Set up RTFD documentation, Travis CI, and everything in the usual manner.

(The cernopendata-client CLI schema amendments and enrichments will come later...)

docs: better docstrings

(1) Introduce pydocstyle into run-tests.sh following e.g. reana-client example.

(2) Fix existing warnings in master branch:

$ pydocstyle .
./docs/conf.py:1 at module level:
        D100: Missing docstring in public module
./cernopendata_client/search.py:1 at module level:
        D100: Missing docstring in public module
./cernopendata_client/cli.py:40 in public function `cernopendata_client`:
        D103: Missing docstring in public function
./cernopendata_client/cli.py:68 in public function `get_metadata`:
        D301: Use r""" if any backslashes in a docstring
./cernopendata_client/cli.py:123 in public function `get_file_locations`:
        D301: Use r""" if any backslashes in a docstring
./cernopendata_client/cli.py:180 in public function `download_files`:
        D202: No blank lines allowed after function docstring (found 1)
./cernopendata_client/cli.py:180 in public function `download_files`:
        D301: Use r""" if any backslashes in a docstring
./cernopendata_client/downloader.py:1 at module level:
        D100: Missing docstring in public module
./cernopendata_client/downloader.py:15 in public function `show_download_progress`:
        D400: First line should end with a period (not 'e')
./cernopendata_client/downloader.py:28 in public function `download_single_file`:
        D400: First line should end with a period (not 'e')
./cernopendata_client/downloader.py:49 in public function `get_download_files_by_name`:
        D400: First line should end with a period (not 'e')
./cernopendata_client/downloader.py:49 in public function `get_download_files_by_name`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
./cernopendata_client/downloader.py:57 in public function `get_download_files_by_regexp`:
        D205: 1 blank line required between summary line and description (found 0)
./cernopendata_client/downloader.py:57 in public function `get_download_files_by_regexp`:
        D400: First line should end with a period (not ')')
./cernopendata_client/downloader.py:57 in public function `get_download_files_by_regexp`:
        D403: First word of the first line should be properly capitalized ('Dload', not 'dload')
./cernopendata_client/downloader.py:72 in public function `get_download_files_by_range`:
        D205: 1 blank line required between summary line and description (found 0)
./cernopendata_client/downloader.py:72 in public function `get_download_files_by_range`:
        D400: First line should end with a period (not ')')
./cernopendata_client/downloader.py:72 in public function `get_download_files_by_range`:
        D403: First word of the first line should be properly capitalized ('Dload', not 'dload')
./cernopendata_client/validator.py:1 at module level:
        D100: Missing docstring in public module
./cernopendata_client/version.py:9 at module level:
        D205: 1 blank line required between summary line and description (found 0)

search by title not working

Use case: look up record by CMS dataset titles:

$ cernopendata-client get-record --title '/BTau/Run2010B-Apr21ReReco-v1/AOD'

This is not working in a similar manner as #18, so the fix will be similar.

Note that it could happen that some title would be identical in two records, for example when the record title is not the dataset name, but some free text. We should issue an error when title lookup returns more than one hit.

download-files: prettify progress bar display

Observation: the progress bar, once a file is downloaded, makes the display broken a bit, see:

$ cernopendata-client download-files --recid 1 --filter-range 1-2
==> Downloading file 1 of 2
==> Downloading file: ./1/0248915F-EE71-E011-8894-0025902009E8.root
==> Downloading file 2 of 2kiB (100%)
==> Downloading file: ./1/0268F635-B671-E011-9090-002481E14E00.root
Downloading: 842237/842237 kiB (100%)
Download completed!

Note the Downloading file 2 of 2kiB (100%) part where some bits remained from the previous progress bar.

Mentioning just for completeness; if you don't have a quick solution, we can also address this cosmetics issues later via another separate issue...

Originally posted by @tiborsimko in #44 (comment)

config: opendata.cern.ch

The magic string opendata.cern.ch is hard-coded in many places in the client.

It would be good to create config.py to centralise the server location.

Example: see reana-client and e.g. TIMECHECK value set up there.

docs: ReadTheDocs builds fail due to pycurl

After addition of pycurl, the ReadTheDocs builds fail with:

   FileNotFoundError: [Errno 2] No such file or directory: 'curl-config': 'curl-config'

This is most probably because some system dependencies such as libssl-dev and friends are not installed.

Investigate a best solution to fix the documentation builds.

Perhaps add these libraries if RTD build system allows this?
Perhaps use autodoc_mock_imports?

Verify files individually after each download

In --download-files, improve verification and add a verification after each download.

_Originally posted by @tiborsimko in

#58 (comment)

get-metadata: expose only metadata

Due to special treatment of files, some of the default API output (notably file buckets) are not useful, since the files are being served from EOS.

These should be hidden. We should probably return directly the content of metadata.

ci: shellcheck

It would be good to add shellcheck to test run-tests.sh itself:

add it as a target run-tests.sh --check-shell
add it to GA as a check

cli: get-file-locations

Goal: Get list of files belonging to a dataset.

Complexity: Should resolve index files for datasets such as recid 14. The output should be data files, not data index files. (For these people could presumably use get-record --output-fields files as part of task #2).

Inputs: the same as for #2 for example --recid 14, or --doi or --title.

Outputs: list of dataset locations such as:

root://eospublic.cern.ch//eos/opendata/atlas/MasterclassDatasets/WPath/2014/1/1A.zip
root://eospublic.cern.ch//eos/opendata/atlas/MasterclassDatasets/WPath/2014/1/1B.zip

Options: --protocol root (default), --protocol http which shoud return http://opendata.cern.ch/eos/opendata/... instead of root://... paths.

download-files: download particular files only

Record 1 contains 2916 files and is 2.7 TB big.

Chances are people would like to download it in batches.

Currently, cernopendata-client download-files would download everything. We need to introduce finer granularity.

The goal of this issue is to introduce a new option for download-files, called perhaps --file, which would download only one particular file:

$ cernopendata-client download-files --recid 1 --filename 105FD6D0-8B71-E011-9613-00E081791775.root

Alternatively, we could offer regexp-like matching:

$ cernopendata-client download-files --recid 1 --filename '*E011*'

which would download all files matching the given glob expression.

Alternatively, since the file order is perfectly defined in JSON, we can offer downloading by chunks, either given file number N1, or several files from file number N2 to file number N3:

$ cernopendata-client download-files --recid 1 --filenumber 13
$ cernopendata-client download-files --recid 1 --filenumber 20-29
$ cernopendata-client download-files --recid 1 --filenumber 30-39
...

get-record --doi fixme

Current behaviour:

$ cernopendata-client get-record --doi fixme | head -10
Record with given doi does not exist.
{
    "created": "2019-07-18T04:51:25.556890+00:00",
    "id": 1,
    "links": {
        "bucket": "http://opendata.cern.ch/api/files/5266a82b-96f1-43a8-874e-b51c1c87d43c",
        "self": "http://opendata.cern.ch/api/records/1"
    },
...

Expected behaviour:

Should just fail and not print any record=1 JSON afterwards.

docs: enrich CLI API page

Let's enrich the CLI API docs page with all commands and options that would be automatically generated.

I thought of looking into generating the CLI API in other way... E.g. we could run click-sphinx
locally and add the file into `docs` so that ReadTheDocs would simply consume it from 
there.  (A bit like we are doing with OpenAPI stuff in REANA.) Hence I used `addresses` 
and not `closes` yet... I'll move the issue into "Ready for work" kanban column.

Originally posted by @tiborsimko in #40 (comment)

get-file-locations: add --filter functions

The download-files command has now useful filtering options to be able to download only files matching certain name (see #44).

It could be useful to have the same filtering functionality for get-file-locations too so that users do not have to parse the output themselves.

Let's muse about possible improvements such as: (1) adding this option also to get-file-locations command; or (2) merging the commands together, keeping only download-files command, and introducing a sort of --dry-run option that would not do the job but only print file locations that it would download. The latter option would also simplify test suite, as we could test filtering without really doing the download tasks more easily.

WDYT? RFC IRL for later

cernopendata / cernopendata-client Goto Github PK

cernopendata-client's Issues

Recommend Projects

Recommend Topics

Recommend Org