nils-braun / b2luigi Goto Github PK

Task scheduling and batch running for basf2 jobs made simple

License: GNU General Public License v3.0

Python 100.00%

b2luigi's Introduction

Hi there, I am Nils! 👋

I am making my contributions to all projects on GitHub solely in my personal capacity and am not conveying any rights to any intellectual property of any third parties.

b2luigi's People

Stargazers

Watchers

Forkers

felixmetzner welschma meliache sognetic bilokin arturakh anselm-baur marcelhoh mschnepf alexanderheidelbach eckerpatrick giacomoxt

b2luigi's Issues

Gbasf2: Set default install directory (or `setup` script) to latest version on CVMFS

Before gbasf2 cli or python interface can be used, the user has to source a setup script in the gbasf2 install directory, which sets different environment variables.

In the get_gbasf2_env function, I therefore use the b2luigi setting gbasf2_install_directory, which should be the path to the directory where gbasf2 is installed. The default was ~/gbasf2KEK, which was the install path used in the previous installation instructions. The get_gbasf2_env functions then joins this directory with the sub-path BelleDIRAC/gbasf2/tools/setup to get the full path to the setup file to be sourced.

However, since recently, gbasf2 is available on CVMFS, which is the recommended way to use it. On CVMFS, there are different gbasf2 install directories for different gbasf2 releases, e.g. to use the current release v5r1p3, the user can set in their settings.json the setting:

"gbasf2_install_directory": "/cvmfs/belle.kek.jp/grid/BelleDIRAC/Belle-BNL-Certification.v5r1p3"

However, it would be useful to have a default location which always points to the latest release. For that, there is already a symlink on CVMFS. The only "problem" is that it points to a subdirectory of what gbasf2 assumes to be the gbasf2_install_directory:

/cvmfs/belle.kek.jp/grid/gbasf2/pro → /cvmfs/belle.kek.jp/grid/BelleDIRAC/Belle-KEK.v5r1p3/BelleDIRAC/gbasf2

I am considering setting the default gbasf2_install_directory to the real parent of that, i.e. to

os.path.realpath(os.path.join("/cvmfs/belle.kek.jp/grid/gbasf2/pro", os.pardir))

However, I am not sure how stable that symlink is, hopefully it is stable.

Since we don't need any other files from the gbasf2 install directory except the location of the setup script, in retrospect I think it was a mistake to make the setting point to the install directory, it would be better to have it point to the setup script. Otherwise, the function get_gbasf2_env might break if the location of the setup script within the install directory changes in the future. But now it's late. Maybe I could add a gbasf2_setup_path setting and add a DeprecationWarning when the install directory setting is used. Not sure about that though.

condor_q getting job status failing results in exception

Jobs on the HTCondor sometimes fail because condor_q sometimes returns an error code. A colleague reported the stacktrace below to me, however I had also ones seen the error when running with over 1000 workers. I don't know the cause yet, but back when I saw this I guess that maybe due to the many workers condor couldn't handle the sheer number of requests. But this is just a guess.

@welschma, have you ever seen this and a suggestion with respect to the cause?

Even if it is not our error, I think we could handle it more gracefully, for example by giving a better error message or maybe just retrying obtaining the job status a couple of times until some maximum number or retries or timeout is reached (I did something similar to gbasf2 in #108). But for that, knowing the cause would be useful.

-- Failed to fetch ads from: <131.169.223.41:9618?addrs=131.169.223.41-9618+[2001-638-700-10df--1-29]-9618&alias=bird-htc-sched13.desy.de&noUDP&sock=schedd_1649_a7c2_7> : bird-htc-sched13.desy.de
SECMAN:2007:Failed to end classad message.
INFO: Worker Worker(salt=379961246, workers=30, host=naf-belle12.desy.de, username=alina, pid=11924) was stopped. Shutting down Keep-Alive thread
-- Failed to fetch ads from: <131.169.223.41:9618?addrs=131.169.223.41-9618+[2001-638-700-10df--1-29]-9618&alias=bird-htc-sched13.desy.de&noUDP&sock=schedd_1649_a7c2_7> : bird-htc-sched13.desy.de
SECMAN:2007:Failed to end classad message.
Traceback (most recent call last):
  File "/afs/desy.de/user/a/alina/.local/lib/python3.8/site-packages/luigi/interface.py", line 173, in _schedule_and_run
    success &= worker.run()
  File "/afs/desy.de/user/a/alina/.local/lib/python3.8/site-packages/luigi/worker.py", line 1203, in run
    self._handle_next_task()
  File "/afs/desy.de/user/a/alina/.local/lib/python3.8/site-packages/luigi/worker.py", line 1058, in _handle_next_task
    self._purge_children()  # Deal with subprocess failures
  File "/afs/desy.de/user/a/alina/.local/lib/python3.8/site-packages/luigi/worker.py", line 1034, in _purge_children
    if not p.is_alive() and p.exitcode:
  File "/afs/desy.de/user/a/alina/.local/lib/python3.8/site-packages/b2luigi/batch/processes/__init__.py", line 135, in is_alive
    job_status = self.get_job_status()
  File "/afs/desy.de/user/a/alina/.local/lib/python3.8/site-packages/b2luigi/batch/processes/htcondor.py", line 154, in get_job_status
    job_status = _batch_job_status_cache[self._batch_job_id]
  File "/cvmfs/belle.cern.ch/el7/externals/v01-10-00/Linux_x86_64/common/lib/python3.8/site-packages/cachetools/ttl.py", line 81, in __getitem__
    return self.__missing__(key)
  File "/afs/desy.de/user/a/alina/.local/lib/python3.8/site-packages/b2luigi/batch/cache.py", line 28, in __missing__
    self._ask_for_job_status(job_id=None)
  File "/afs/desy.de/user/a/alina/.local/lib/python3.8/site-packages/b2luigi/batch/processes/htcondor.py", line 46, in _ask_for_job_status
    output = subprocess.check_output(q_cmd)
  File "/cvmfs/belle.cern.ch/el7/externals/v01-10-00/Linux_x86_64/common/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/cvmfs/belle.cern.ch/el7/externals/v01-10-00/Linux_x86_64/common/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['condor_q', '-json', '-attributes', 'ClusterId,JobStatus,ExitStatus']' returned non-zero exit status 1.

Fix grid/gbasf2 dataset download for gbasf2 v5 duplicated job outputs

When rescheduling jobs (e.g. after job failure), gbasf2 doesn't remove the old job outputs, but appends a incremented number job outputs. Both are visible when running gb2_ds_list, but only the latest outputs are downloaded when running gb2_ds_get.

This lead to errors in the b2luigi Gbasf2Process._download_dataset() method, which checks if the download was complete by comparing the output of gb2_ds_list with the list of downloaded files.

@Bilokin also noticed this error and notified the gbasf2 maintainers, so this issue will be addressed in BIIDCD-1240, but maybe we can implement a hotfix before that.

For any users reading this, a workaround might be to simply delete the duplicated remote files with gb2_ds_rm or disabling the check in b2luigi.

Deprecated `message` property for exceptions is used

In b2luigi.core.utils.map_folder an exception is re-raised with a modified exception message, where the original message is appended viaex.message.

    except AttributeError as ex:
        raise type(ex)(
            "Could not determine the current script location. "
            "If you are running in an interactive shell (such as jupyter notebook) "
            "make sure to only provide absolute paths in your settings.\nMore Info:\n" +
            ex.message
        ).with_traceback(sys.exc_info()[2])

I get an error in python 3.9 that the exception has no message property, but this seems to be already deprecated since python 2.6 (see PEP-352).

The obvious solution would be to just use the str(ex) or a format string like f"More info: {ex}". But before fixing this in a PR, my questions is what's the "best" way to re-raise an exception with a modified message in python 3. The raise type(ex)(...).with_traceback(...) construct seems a bit cumbersome to me, is it really necessary? I admit I don't fully understand it, maybe @FelixMetzner who introduced this code in #34 (thanks for that) can help.

Adapt gbasf2 batch to work gbasf2 v5 with rucio

The output directory structure changed in release 5 of b2luigi and this needs to be incorporated in the automatic download procedure. The grid cannot be used anymore without rucio, so I don't have to think too much about backwards compatibility. I will need to do some further testing and research to see if there are other things we should change.

Belle II confluence links about the new gbasf2

Log outputs of gbasf2 wrapper

One major task that I would like to do in the future is to introduce proper logging to the gbasf2 batch. When I first wrote it, my idea was that the log_file_dir should only contain logs of the actual job on the batch, which is why I copy the grid output sandbox there by default. All the outputs of the gbasf2 wrapper itself are printed to stdout and stderr, same for the output of gbasf2 commands.

But in hindsight I think it would be best to get also all the gbasf2 output into log files, both the output of gbasf2 subprocesses and all the warnings and printouts in the wrapper. The exception are gbasf2 commands that require user input, which at the moment is only the gb2_proxy_init command when the user has to enter their certificate password for their grid certificate. Also I think the submission and download commands require user input if the --force is disabled via settings. But it should be possible to both print the output and log it. Maybe I should create an issue for that.

One issue which I've been eyeing for a long time but didn't take the time yet to implement is #5 ("Include proper task overview while running") for a gridcontrol-like task view. But then there would be no screen real-estate for very verbose print-outs, which would go to the log files or only to stdout if the task is e.g. run with --test or with the task overview disabled via some setting.

Originally posted by @meliache in #98 (comment)

Read the Docs not defaulting to version on Pypi

I realized that the default version of the hosted documentation shows the latest one. Since most users will install b2luigi from Pypi, it would be a good idea to show the current release documentation by default (if its possible?). I had the problem where i used the new default keys for settings implemented on the master branch which are not in a release yet and had to think a lot before I realized that I'm using not the master branch but the current release version which was not compatible with the new settings.

Add static linter to workflows

At least for error-detection.

In #78 @nils-braun added github workflows and mentioned

This can be expanded to also include things like stylechecks (pycodestyle, autopep8, black), isort, flake8 etc - but so far let's start with the basics.

I think what can't hurt in any way is to at least add static syntax checking for errors with pyflakes (the error-part of flake8) or the error-detection of pylint. I'm not against going the extra step to do style checks, but that is a bit more work since it involves discussing which style we actually want (e.g. line length), defining it in a config file and also mentioning it in the documentation, so imo this could be split into a separate PR.

I have little experience with github workflows, but personally I have used pre-commit to do pre-commit checks locally. There's also a pre-commit action in the github marketplace and that would be my first idea how to implement style checks, since I find that extendible and straightforward and would feel confident of adding that myself. I just saw that there are also existing actions for e.g. flake8 that annotate the code, but I don't think we really need that. Or we could write our own actions. Don't know what's the most common approach.

Do not rely on same environment in batch nodes

Currently in the LSF task, the environment is copied to the batch nodes in all cases.
The user should have additional options (controllable by settings):

use a pregenerated environments file (execute a command which sets the environments)
set additional variables from a dict
copy the environment

Number 3 is already in and number 2 should be just a matter of changing the environment when sending the file.
Number 1 means one needs to change the LSFTask, which is also quite simple. I can help here.

Support setting grid input files via a text file

The gbasf2 parameter --input_dslist allows providing the input datasets on the
grid via text file which contains a list of grid paths (LPN's), one per line, as
an alternative to setting the input files on the command line. This is especially important since the dirac dataset searcher allows creating those text files from data queries.

That's not supported by b2luigi currently by I could add it. The simplest way
would be just to add another gbasf2-specific setting and check that at least one
of the input options is set.

When thinking about it I had the idea that we could just create an ExternalTask for luigi input datasets. That would be quite elegant and could replace the gbasf2_input dataset setting altogether.

My initial idea was that this option is not necessary and the user can just convert the file into a string via

gbasf_input_dataset = ",".join(open("lpns.txt")).read().splitlines())

But when using this in my own code I found out that too long command-line options result in OS-errors...

Don't print verbose gbasf2 download log when download successful

if not self._local_gb2_dataset_is_complete(output_file_name, check_temp_dir=True, verbose=True):

I think what I wanted was only to print the verbose output if the local dataset is not complete, but I just realized that the verbose output is printed in any cases whether the download was successful or not 🤦. Not your fault so I could fix this separately from this PR.

Originally posted by @meliache in #98 (comment)

Continue running other tasks when one gbasf2 download failed

The Gbasf2Process marks a luigi task that is associated with a gbasf2 project as successful when all the jobs in that project are DONE. However, what then remains is downloading the outputs of those jobs and there it often happens that gb2_ds_get doesn't manage to download all files. Currently, luigi then validates the download, sees that the downloaded directory is missing files and then instead of moving the downloaded dataset from the partial download directory to the final output directy, it raises a RuntimeError that the download wasn't successful.

But that exception stops the whole luigi process and the processing of all tasks. It would be more conventient to let the processing of all other tasks/projects continue in such a case.

That could be solve by just printing out a warning and doing an early return in the download function so that the partial download is never moved to the output directory.

However, I must check that the process isn't at any point marked as successful just because the get_job_status() method returns successful. However, it is currently coupled to the status on the grid and not to the download status, as I use it to decide when to do the download. Here I'll have to think how to solve that cleverly.

Improve dirac proxy validity time handling

When working with gbasf2 subprocesses in the gbasf2 BatchProcess, in addition to a gbasf2 environment most commands also need an alive dirac proxy. On the command line, users initialize it with gb2_proxy_init -g belle, then they get asked for their certificate pasdsword and then it stays alive for 24 hours. So in my wrapper, I check if there is still an alive/valid dirac proxy on the system and run the initalize command only if there is not. The main reason I did that was to ensure the proxy is valid but also I did
want to reduce superfluous password prompts.

@philiptgrace made me aware here that users can set the proxy validity time explicitly via -v HH:MM and this can surprisingly be much more than 24 hours. He also suggested here that we could consider renewing the proxy not when the remaining validity time reaches 0, but before that for increased stability.

In addition to having a user option called something like gb2_proxy_time (to be passed to the -v flag), it may be useful to have an option gb2_proxy_minimum_time (defaulting to 0), to re-initialise the proxy if the time left is less than the given number of hours. The usecase I am imagining is a long submission/monitoring/downloading time where the user doesn't want their proxy to timeout part way through.

Though I currently have a branch in progress where I want to implement that this check is done for every run_with_gbasf call, so I'm not sure whether this is important, but still I don't see why not.

add setting for proxy validity time
maybe increase default time
re-initialize dirac proxy before remaining validity time reaches 0 and add setting for that minimum time

PyPi upload fails with "Could not find username for upload."

See https://github.com/nils-braun/b2luigi/runs/2448812114?check_suite_focus=true

I wonder if we can somehow test this easily without doing commits to the workflows. Until this is fixed, I'll publish manually again.

fei_merge_files command is about to have its name changed

b2luigi/b2luigi/basf2_helper/tasks.py

Line 113 in b1e3b2c

cmd = ["b2file-merge", "-f"]

The fei_merge_files command that is used in the ntuple merger task might have its name changed soon, because there's an open issue that analysistools should follow the b2... convention (maybe b2fei-merge-fei-output). I just made a small not really related PR to the fei_merge_files and as a result the analysis librarian asked me if I could work on the issue.

Secondly, there's also a separate already existing b2file-merge command. If possible I would suggest to use that, since we aren't doing anything FEI-related, but I assume there's a reason I don't know why we can't use it, maybe because of the restrictions that it has, that are described in its help string:

This program is intended to merge files created by separate basf2 jobs. It's
similar to hadd but does correctly update the metadata in the file and merges
the objects in the persistent tree correctly.

The following restrictions apply:

The files have to be created with the same release and steering file

The persistent tree is only allowed to contain FileMetaData and objects
inheriting from Mergeable and the same list of objects needs to be present
in all files.

The event tree needs to contain the same DataStore entries in all files.

Maybe support non-path basf2 configuration in gbasf2/grid batch

I just sent the basf2 path to the grid, but sometimes the user might want to call some basf2 configuration functions which change the internal state of basf2, but are not stored in the path, and they might not be available via CLI arguments. Examples include

basf2 aliases (stored in the variable manager)
testing payloads
conditions
... more?
In principle I could let the user add all these things via additional settings. E.g. let the user provide a dictionary with all the aliases and a list with the testing payloads, then in the steering file wrapper I can add the aliases and the testing payloads. But it's hard to anticipate everything the user might want and it might bloat my code.

It would be nice if the user could provide a function or a file in which all basf2 configuration functions, that the user needs, are called. I'm not sure however how to call that from the steering file wrapper. I could send an additional basf2 configuration file to the grid, but not sure how I could import that configuration from the steering file wrapper, without forcing the user to supply a specific function name.

Another possiblity would be to have the user supply his own steering file template. Code-wise this seems the simplest solution, so I already did a test-implementation on a branch, but it is dangerous and would require careful documentation, here is my first attempt

But I am not sure I like this solution, it's not something normal users should have to use for some simple configuration.

I would be happy for some suggestions if you have any ideas how to support basf2 configuration without adding much complexity and in a simple way for the user. I would be happy to do implement that myself.

Allow per-task configuration of gbasf2 release

I have a particular use-case for this feature request: I would like to use release-05 for generating ntuples (to access newly introduced variables), but the FEI skimming of signal samples must be done with release-04 (since FEI performance changed significantly with the redefinition of some angular variables between releases, and it needs to be retrained).

This would not be a large change. I'm picturing a task definition like the following:

class GridFEISkim(Basf2PathTask):
    batch_system = "gbasf2"
    gbasf2_download_dataset = False
    gbasf2_release = "release-04-02-08"

where _build_gbasf2_submit_command checks for the attribute gbasf2_release. However, this would mean that gbasf2_release exists as both a setting and an attribute, which is maybe not ideal.

Rename `MasterTask`'s in examples to something less culturally sensitive

We have already made our main branch main and we could follow through by replacing all examples in the documentation where the implementation of luigi.WrapperTask is named MasterTask by something else. Suggestions:

MainTask
WrapperTask -- In this case the base name would be the same as the class it inherits from, just in a different namespace. Maybe it would be better a meaningful prefix to convey what it does, e.g. ReconstructionWrapperTask or AnalysisWrapperTask

b2luigi file paths depend on order in which parameters are declared

It's nice that the directory structure contains the metadata on the used luigi parameters. It might be intended behavior, that the order of the parameters in the path is the order in which they are declared. However, this means that you might get very different behavior of your task depending on the order you declare the parameters. It's best explained with this example:

import b2luigi

class TaskXY(b2luigi.Task):
    "Some task requiring some luigi parameters"
    x = b2luigi.Parameter()
    y = b2luigi.Parameter()

    def output(self):
        yield self.add_to_output("output.txt")

    def run(self):
        with open(self.get_output_file_name("output.txt"), "w") as file:
            file.write(f"{self.x}{self.y}")

class TaskYX(b2luigi.Task):
    "Same as task YX, but declaring parameters in a different order"
    y = b2luigi.Parameter()
    x = b2luigi.Parameter()

    def output(self):
        yield self.add_to_output("output.txt")

    def run(self):
        with open(self.get_output_file_name("output.txt"), "w") as file:
            file.write(f"{self.x}{self.y}")

class MasterTask(b2luigi.WrapperTask):
    def requires(self):
        yield TaskYX(x=1, y=2)
        yield TaskXY(x=1, y=2)

if __name__ == "__main__":
    b2luigi.process(MasterTask(), workers=1)

This task results in the following directory structure:

├── x=1
│   └── y=2
│       └── output.txt
└── y=1
    └── x=2
        └── output.txt

The output output.txt is thus duplicated, even though it is created both times with the same parameter set (it thus can't be thought of as a set) and contains both times the same string "12". However, when you change TaskYX that it defines first x, then y, then you get the expected output

└── x=1
    └── y=2
        └── output.txt

You can't change the fact the the signatures of a task/class depends on the order of its static members. But maybe, when creating the output paths, it is possible to sort them somehow (e.g. alphabetically). I imagine this might be problematic when incrementally adding new tasks in subsequent runs, resulting in a need to re-shuffle the directory structure to make place for new parameters. Maybe this is a fundamental limitation of the simple directory tree approach and changing it might make things worse? At least it does not result in wrong results, only duplication which does not do too much harm.

Raise error if `gb2_proxy_init` prints error message

gb2_proxy_init (similar to other gbasf2 commands) often fails without error code, printing error messages to stdout. Thus, just running subprocess.run(check=True) doesn't raise a CalledProcessError, which often happens when initalizing the proxy. If it fails, the errors often get noticed lates, e.g. when trying to get the dirac username.

One challenge with implementing this is, that I need to capture the output, but at the same time I need to make sure the output is printed to stdout and allow the user to provide input

Frequent errors:

Bad passphrase

Generating proxy...
Enter Certificate password:
Bad passphrase
Error: Operation not permitted ( 1 : )

Retry on bad passphrase

Can't find Certificates / Keys:

Generating proxy...
Can't find user certificate and key
Error: Operation not permitted ( 1 : )

Raise CalledProcessError when keys not found

Issue with creating aliases in gbasf2 v5r1p1

A new version of gbasf2 for Belle2, v5r1p1, has been released, allowing for compatibility with the new releases of basf2 (light-2106-rhea) that are run on a new externals package. However, this caused aliases to no longer be set correctly on any gbasf2 steering tasks.
I have attached a minimal (not-)working example to recreate this issue on the new gbasf2 release, where the regular lsf task will run, but the gbasf2 task will not run unless lines 19 and 20 are commented out.

Excerpt of error:
File "/home/belle/dfer/.local/lib/python3.8/site-packages/b2luigi/batch/processes/gbasf2_utils/pickle_utils.py", line 13, in
alias_dictionary = {alias_name: vm.getVariable(alias_name).name for alias_name in list(vm.getAliasNames())}
cppyy.ll.SegmentationViolation: const Belle2::Variable::Manager::Var* Belle2::Variable::Manager::getVariable(string name) =>
SegmentationViolation: segfault in C++; program state was reset

basic_pipeline.txt

Include more basf2 and non-basf2 related examples

Please feel free to pose a PR with all your examples!

wrong spelling in contributor list

thank you for adding me to the contributor list, buuuuut Baur is written without an 'e' :)

Issue with large projects separated into subfolders

Hello,

thanks for adding the support for multiple sub-directories in the grid projects, i.e. that have multiple sub* folders in their output, but unfortunately, I think there is a bug linked to it:
When b2luigi moves the downloaded output from temporary folders of sub00, sub01, subXX, to the output folder, let's call it result/, the contents of sub00 folder is not in the result/, but I suspect that it is in the result/sub00 folder.
I tried to run the this code in the python interpreter using sub00 and sub01 input folders and not existing output folder ./result and I get the following tree:

./result
    sub01_content.root
    sub00/
         sub00_content.root

This is not expected by the tasks downstream and I would prefer that the output folder would contain all files from all sub* folders directly. There shouldn't be any conflicts in the file names, but one has to check for that anyway.

Proposal: Make downloaded gbasf2 dataset persist after failure

Currently, we download the gbasf2 output into a temporary directory, since this output directory is used as the output() of the task and we want to prevent falsely marking the task as done. When the download fails, the output directory disappears.

However, it turns out that downloading gbasf2 datasets is quite an operation that can take a long time and is prone to failure. However, gbasf2 is also smart enough to recognize existing local files and only download those that are missing from local storage (like rsync). So it would be useful to have the local temporary directory persist and only move it to the final output when the job was successful.

The download directories can be quite large, so it would be good to warn the user about their existence and path. And maybe a setting to disable (or enable) them would be good.

@philiptgrace suggested using multiple retries in my _download_dataset method, but I think with the proposed addition the users could just use the built-in task retry-policy via the retry_count setting.

Maybe we could include that in a setting or link to it to make the users aware of it.

`get_dirac_user()` fails due to empty `get_proxy_info()`

Jake Bennet tried using b2luigi with gbasf2 for the systematic corrections framework and reported the error

Traceback (most recent call last):
File "/home/belle2/jbennett/.local/lib/python3.6/site-packages/b2luigi/batch/processes/gbasf2.py", line 947, in get_dirac_user
return get_proxy_info()["username"]
KeyError: 'username'

Seems like get_proxy_info() returns an empty dictionary, which happens when there is CalledProcessError. Not sure if that's because the proxy hasn't been initialized before get_dirac_user() is called or due to some other subprocess error. Gbasf2 and basf2 by themselves supposedly work for Jake he tried initializing the proxy in the terminal.

I would be super happy @philiptgrace, who contributed the code, could also help or comment.

One suggestion of me for making debugging easier in the feature is not to catch the CalledProcessError (returning {} instead) in the get_proxy_info() function. In cases where we want to ignore the error, it might be better to ignore it via try...except only in those places, e.g. in get_dirac_user it would have been useful to get the CalledProcessError.

I just asked Jake to try

from b2luigi.batch.processes.gbasf2 import get_proxy_info, setup_dirac_proxy
setup_dirac_proxy()
print(get_proxy_info())

and am currently waiting for feedback. He already tried the above the above without the setup_dirac_proxy() and got an empty dictionary then.

find the cause of the error
fix the error
#139
#140

Improvement: Parallelize gbasf2 dataset download

@welschma suggested that gbasf2 output dataset downloads from the grid take a long time, since each job has its own output file that is potentially on another GRID site and the files are downloaded sequentially. Max managed to increase the download speed on the command line by parallelizing the downloads.

Points for discussion:

Are there any plans to implement this in gbasf2? Maybe someone should ask the devs.
Do parallel downloads have negative side-effects? I imagine this might increase the network I/O load on the KEKCC/NAF clusters and on the grid sited, if all users did that, so maybe that is why it is not officially implemented?
How to partition the file download? Should each process just get a list of N_outputs / N_processes to download? This is not optimal if one download process finishes much earlier than another. So maybe it is just better to download all files separately and have each free process pick one file from the pool of undownloaded files. This might have disadvantages if there's an overhead of starting a single gbasf2_download_dataset process, but I don't know about that.

Add `CHANGELOG.md` generated from github releases

Add a CHANGELOG.md file similar to the format promoted by https://keepachangelog.com. One advantage is greater protability, e.g. if someone forks the project to a non-github forge. Another advantage is additional offline visibility, e.g. for people who update b2luigi via git pull and don't subscribe to the releases.

The advantages are imo not worth a lot of duplicate effort with every release, however, maybe creating a changelog file from gh releases can be automatized. I did a quick google search and there's many projects for that, e.g.

https://github.com/rhysd/changelog-from-release: GO CLI program that creates changelog from GH releases, has github action (31 stars)
https://github.com/github-tools/github-release-notes: NPM program that can do a all sorts of stuff, create releases from issues and create changelogs from releases. However, modern github has built-in support for releases from issues, so I'm not sure if I need that.

Default BoolParameter has a default already set

Quite inconfortable, luigis BoolParameter already includes a default, which is False.

One could overload the BoolParameter in b2luigi and set the default to _no_value again.

This is a very simple task and the perfect start into the project.

Failed gbasf2 grid download because failed_files.txt contains directory

My task was all successful and tried to download the job outputs, but it failed with the following message:

Downloading remaining files from dataset with command  gb2_ds_get --force /belle/user/meliache/2021-06-106ac9f82805/sub00/upsilon_4s_*.root --input_dslist /nfs/dust/belle2/user/meliache/git_hash=light-2002-ichep/
job_name=2021-06-10/input_files=hashed_eb2e36a81e191f2c9fe4b2962bb9de29/mc=True/fei_charge_type=B0/upsilon_4s.root.partial/failed_files.txt
No file found

/afs/desy.de/user/m/meliache/.local/lib/python3.6/site-packages/b2luigi/batch/processes/gbasf2.py:325: RuntimeWarning: RuntimeError('No output data for gbasf2 project 2021-06-106ac9f82805 found.',)
  warnings.warn(repr(err), RuntimeWarning)

I checked the output of the failed_files.txt and it only contains the line

/belle/user/meliache/2021-06-106ac9f82805/sub00/upsilon_4s_00039_job194473040_00.root in /nfs/dust/belle2/user/meliache/git_hash=light-2002-ichep/job_name=2021-06-10/input_files=hashed_eb2e36a81e191f2c9fe4b2962bb9de29/mc=True/fei_charge_type=B0/upsilon_4s.root.partial/2021-06-106ac9f82805/sub00

Seems to be caused by PR #91 by @ArturAkh, which introduced this log file and included some command output parsing, which might have failed. As I assume from looking at the helper method _failed_files_from_dataset_download it should only contain root files, but maybe something went wrong in all the string splitting?

Future proof gbasf2 batch by downloading all sub<xy> directories

At the moment, the gbasf2 batch process expects that all output files of a job are in a sub00 subdirectory. However, it is planned that for gbasf2 projects with more than 1000 outputs, new sub<xy> directories will be created, so it would be good to look for such directories as well in the gbasf2 output download. At the moment it would be just for future-proofing, so I don't plan working on this immediately, but created this issue as a reminder for myself and for b2luigi gbasf2 users.

I found that out in an email to comp-users-forum

Subject: Re: [comp-users-forum:1701] [dc_operations] new gbasf2 folder structure issue
Date: Wed, 10 Mar 2021 23:44:23 CET

If your jobs produce single-output, then you will have 1K files per datablock
(subXX). Of course for now we are still limiting the jobs per project to 1K
while we validate we can proceed with massive submission, then for now the
output is expected to be only inside sub00.

Nothing prevents that jobs have multiple output, but in that case, in the current implementation you will see more > 1K files per datablock.
We are working to fix this, by design only 1K files are expected inside the datablock and some of our tools may not work properly.

Implementing support for other subxy-directories will be easier once gbasf2 allows downloading files without explicitly providing the sub-directory. Also this might be achieved with wildcard-downloads, which hopefully will get more stable in the future.

Include new batch systems

In the moment, there is only a reference implementation for LSF, although it should be very easy to include more batch systems.
Some examples include:

htcondor
grid (e.g. dirac)
Amazon Batch
Amazon Lambda
Google Cloud
Azure

It would also be very nice to have a batch system discovery at some point.

Handle udst/mdst etc. data level outputs in gbasf2 v5

This is an extension of the fix #56. Gbasf2 release 5 with Rucio has a new directory structure. b2luigi currently expects the outputs on the grid to be in the sub00 subdirectory, but mdst and udst data-level files get moved into additional subdirectories, as seen in the following output structure of a test project that I ran:

/belle/user/meliache/test_new_gb2_7
    ├─ mdst
    │   ╰─ sub00
    │       ╰─ output_mdst_00000_job180712020_00.root
    ├─ sub00
    │   ├─ B_ntuple_00000_job180712020_00.root
    │   ├─ D_ntuple_00000_job180712020_00.root
    │   ╰─ output_datastore_00000_job180712020_00.root
    ╰─ udst
        ╰─ sub00
            ╰─ output_udst_00000_job180712020_00.root

Improve gbasf2 project progress bar and status in central scheduler gui

Show project name in status
also update progessbar/status when project failed, or is successful and started download (currently, while downloading, it's not updated and shows 1 job as running)
round percentage

Don't suppress `CalledProcessError` in `gbasf2.get_proxy_info()`earlier than necessary

Respect parameter visiblity in output

In a jobs I have quite a long DictParameter.

b2luigi.DictParameter(hashed=True, visibility=ParameterVisibility.HIDDEN)

The hidden visibility makes it so that the parameter is not shown in the scheduler (at least in the web gui), while not being insignificant, but in the b2luigi standard-output task status updates it is still shown, which doesn't result in a nice overview.

(As a workaround it's possible to just use the hash of the dict as a string parameter, so that we get the hash in the output.)

Feature Request for a dry-run Mode Executed Generic Task Method

I am missing feature as discussed in
#70

Replacing `gb2_ds_get` output parsing by `--failed_lfns` option to get file of failed LFNs

When download a grid dataset with gb2_ds_get, gbasf2 lists the failed files (LFN) that failed to download. @ArturAkh implemented a feature that we parse the output to save those failed lfns in a text-file. When gbasf2 retries the download, it only tries to download those files from this text file.

Since a recent release, gb2_ds_get has the --failed_lfns option to generate a text file of failed LFNs. I think it would be good to just use that to avoid string-parsing, which seems to work for now, but breaks easily when the output format changes between releases. Sadly it's not easy to reproduce failed LFNs during a download, so I'm not sure how that file would look, but I expect it to be one LFN per line. Is this file even created when no downloads failed? I guess we should code the solution to be prepared for both possibilities of an empty file and a non-existing file.

If @ArturAkh wants to help I'd be happy, but otherwise I might also give this a look in the future

Work with different non-file luigi targets

@Bilokin reported that he created a custom database-target found that he had to create a dummy self.path attribute and a dummy makedirs() method to get it to work, otherwise exceptions were encountered.

I was surprised, I had expected that as long as you don't use any b2luigi-specific output methods like self.add_to_output() are used and b2luigi is basically only used for working on remote batch systems, all b2luigi targets should be supported. But I haven't looked at this properly yet, so I'm not sure if this should be considered a bug in b2luigi. Probably should create a minimal example for reproducing that error first.

Enhancement: Async gbasf2 submission and/or download to avoid delay of scheduling

The gbasf2 submission and dataset download operations take a long time. Even when remote workers work in parallel, scheduling happens by default in serial. (Except when the parallel_scheduling config option is set tue true. However, this didn't work for me, if you had success with it please message me.) The long gbasf2 submission and the dataset download seem to block the scheduling until that operation is done. This is something that I can live with, since usually only few gbasf2 projects are required, but it would be cool to do something about it.

This gbasf2 dataset download is currently triggered in the get_job_status method as a subroutine call when the gbasf2 project is all done. Maybe we can call initiate the download as an async subprocess and only mark the job as really complete when the download is done. At least when the gbasf2_download_dataset b2luigi option is set.

Something similar might be done for the submission.

This is not easy and I don't know if we can do both cases. The subprocess sometimes might require user input, e.g. and ca-certificate or ssh key password, so this should still work. And error handling should also be thought about. As I have not much experience with async subprocesses, I'd be happy about help.

If I'm just too stupid for parallel_scheduling and with that properly enabled these blocking operations are no problem, then this can be closed. (Though parallel_scheduling also only works for pickable tasks.)

Decrease waiting time and number of calls in LSF

Depending on the concrete tasks, it is possible that the LSF batch system tasks wait a long time until they are marked as finished, as not all processes are checked all the time. This time can be decreased by maybe decreasing some waiting times or conditions to wait or to include a check for all tasks at once.

Use the luigi-own configuration

Instead of relying on our own settings implementation, we can also leverage the luigi configuration

Tasks fail if part of the outputs (but not all) already exist.

When task (with b2luigi.on_temporary_files has outputs A and B, and A is missing but B exists, the moving of the temporary files after the task had been successful fails if one of the outputs already exists.

Due to the nature of on_temporary_files, this doesn't happen naturally when running tasks because the final outputs can only exist if the task had been successful and thus all outputs should be there. However, this can happen when changing output definitions or when deleting some of the outputs but not others manually.

I solved this in my tasks by extending the process method with a clean-up step, but maybe this is something we want to fix in the moving of temporary files.

Or is this a behaviour a feature and we want the users to be explicitwhat to do with existing data to prevent the loss of it? After all, luigi itself does not clean-up of outputs either. Anyway the user uses a temporary wrapper, if they force-writes the output in the process-method, this will not help because he will only force-write the temporary file location, so I think we should fix that...

Allow dash in gbasf2 project names

Hello @meliache,

thanks a lot for the gbasf2 implementation in b2luigi, it is incredibly useful to us!
I have a small feature request: To allow "-" symbol in gbasf2 project names, that are submitted by b2luigi.
It is an allowed symbol for BelleDIRAC and it would increase readability of project names.

Also it would be nice to have an option to not to add random string at the end of project name, but leave project name entirely up to user. The motivation is that the project names in BelleDIRAC can be only up to 32 symbols in length, and if the limit is exceeded, the project name is just truncated without any notification.

Fix sphinx warnings when building docs

When building the sphinx documentation (via running make html in the docs dir, using sphinx 3.0.2) I get the warnings seen below. I think it should be not much work to fix them.
Further, it would be good to have some automated CI-procedure (test, hook or whatever) to do the build and disallow commits/PR's which introduce warnings.

/home/michael/basf2/externals/v01-09-01/Linux_x86_64/common/lib/python3.6/site-packages/b2luigi/core/task.py:docstring of b2luigi.Task.get_input_file_names_from_dict:13: WARNING: Definition list ends without a blank line; unexpected unindent.
/home/michael/basf2/externals/v01-09-01/Linux_x86_64/common/lib/python3.6/site-packages/b2luigi/core/task.py:docstring of b2luigi.Task.get_input_file_names_from_dict:18: WARNING: Definition list ends without a blank line; unexpected unindent.
/home/michael/basf2/externals/v01-09-01/Linux_x86_64/common/lib/python3.6/site-packages/b2luigi/core/task.py:docstring of b2luigi.Task.get_input_file_names_from_dict:20: WARNING: Unexpected indentation.
/home/michael/basf2/externals/v01-09-01/Linux_x86_64/common/lib/python3.6/site-packages/b2luigi/core/task.py:docstring of b2luigi.Task.get_input_file_names_from_dict:21: WARNING: Block quote ends without a blank line; unexpected unindent.
/home/michael/basf2/externals/v01-09-01/Linux_x86_64/common/lib/python3.6/site-packages/b2luigi/core/task.py:docstring of b2luigi.Task.get_input_file_names_from_dict:27: WARNING: Definition list ends without a blank line; unexpected unindent.
/home/michael/basf2/externals/v01-09-01/Linux_x86_64/common/lib/python3.6/site-packages/b2luigi/batch/processes/gbasf2.py:docstring of b2luigi.batch.processes.gbasf2.Gbasf2Process:53: WARNING: Unexpected indentation.

/home/michael/basf2/externals/v01-09-01/Linux_x86_64/common/lib/python3.6/site-packages/b2luigi/basf2_helper/data.py:docstring of b2luigi.basf2_helper.data.CdstDataTask.output:13: WARNING: undefined label: task.output (if the link has no caption the label must precede a section header)
/home/michael/basf2/externals/v01-09-01/Linux_x86_64/common/lib/python3.6/site-packages/b2luigi/basf2_helper/data.py:docstring of b2luigi.basf2_helper.data.DstDataTask.output:13: WARNING: undefined label: task.output (if the link has no caption the label must precede a section header)
/home/michael/basf2/externals/v01-09-01/Linux_x86_64/common/lib/python3.6/site-packages/b2luigi/basf2_helper/data.py:docstring of b2luigi.basf2_helper.data.MdstDataTask.output:13: WARNING: undefined label: task.output (if the link has no caption the label must precede a section header)
/home/michael/basf2/externals/v01-09-01/Linux_x86_64/common/lib/python3.6/site-packages/b2luigi/basf2_helper/data.py:docstring of b2luigi.basf2_helper.data.RawDataTask.output:13: WARNING: undefined label: task.output (if the link has no caption the label must precede a section header)
/home/michael/basf2/externals/v01-09-01/Linux_x86_64/common/lib/python3.6/site-packages/b2luigi/basf2_helper/data.py:docstring of b2luigi.basf2_helper.data.SkimmedRawDataTask.output:13: WARNING: undefined label: task.output (if the link has no caption the label must precede a section header)
/home/michael/basf2/externals/v01-09-01/Linux_x86_64/common/lib/python3.6/site-packages/b2luigi/basf2_helper/tasks.py:docstring of b2luigi.basf2_helper.tasks.MergerTask.output:13: WARNING: undefined label: task.output (if the link has no caption the label must precede a section header)
/home/michael/basf2/externals/v01-09-01/Linux_x86_64/common/lib/python3.6/site-packages/b2luigi/basf2_helper/tasks.py:docstring of b2luigi.basf2_helper.tasks.SimplifiedOutputBasf2Task.output:13: WARNING: undefined label: task.output (if the link has no caption the label must precede a section header)

Include proper task overview while running

luigi prints a lot of messages to the screen while running and the important ones (which jobs are finished, which are failed etc.) may get lost.
It would be nice to have kind of a "console overview" like gridcontrol with the current status. The failed tasks could for example be marked with their corresponding log files.
I thought about using a multiprocessing.Queue for sending information on some task.events (e.g. on failure, on creation etc.) as a JSON and read them in later.
One could use the same mechanism also for accessing b2luigi from jupyter notebooks, which would give a very nice additional feature.

Automate deployment

Automate the bumpversion, push and publishing to PyPi, which are currently separate steps as described in the development documentation. It previously contained the comment by @nils-braun

At a later stage, I will try to automate this.

But as far as I know this had not been implemented it. I removed that comment in the development PR #73 and decided to turn it into this issue here for more visibility and for keeping the docs up-to-date.

Replace gbasf2 output string parsing with our own scripts interacting with gbasf2

Assign this to myself, but @philiptgrace was also interested and asked about the string parsing here

Interacting with gbasf2 has to be via subprocesses, there is no other way around, because gbasf2 runs with another environment and another PYTHONPATH than b2luigi. Often, I was just lazy and called the user scripts provided by gbasf2 from the gbasf2 BatchProcess via subprocesses and captured and parsed their output when I needed it.

But it should be more stable to just write for each instance where I need it a short python2 script that directly imports and works with the BelleDirac python package which is available from the gbasf2 environment. Then I could run these scripts as subprocesses and obtain information e.g. via json or returncodes.
It is possible to just import the gbasf2 python packages and work with them.

Places where I could do that replacement of string-parsing via my own scripts

check if dirac proxy is inititialized instead of parsing gb2_proxy_info here -> PR #51
for getting the list of filenames in the output dataset instead of using gb2_ds_list here
for checking if an output dataset even exists on the grid instead of this
checking if the project already exists on the grid here -> PR #51
for getting the dirac user in get_dirac_user()

Proposal: add Michael as the main developer on readthedocs

As @meliache is now doing the main project work, he should also be added (or even replace me) on the main documentation page (https://b2luigi.readthedocs.io/en/stable/#the-team).

Set exit code of `b2luigi.process` according to pipeline result, not scheduling result

I'm using pytest and want to also test my full luigi pipeline. Currently I'm facing two issues here:

Issue 1: Calling process more than once: Once there is more than one test, b2luigi will make a fuzz about multiple process calls.
Issue 2: process doesn't return exit code: Is there a clever way to figure out if the tasks ran successfully?

Currently I'm using two workarounds, neither of which are particularly pretty:

import b2luigi as luigi
import b2luigi.cli.process


def test_real_conversion_set(capsys):
    ...
    luigi.process(
        ...,
        batch=False,
        workers=1,
        ignore_additional_command_line_args=True,
    )
    captured = capsys.readouterr()
    assert "This progress looks :)" in captured.err
    # Deactivate the "run process only once" check, which might be useful IRL
    # but doesn't work with >= 2 tests
    luigi.cli.process.__has_run_already = False

This works fine, but eventually prettier solutions would be nice of course.

I also wanted to open this issue to document the workarounds. I could also open a PR for some documentation on that if you want.