Code Monkey home page Code Monkey logo

fast-curator's Introduction

pypi package pipeline status coverage report Gitter

fast-curator

Create, read and write dictionary descriptions of input datasets to process. Currently all datasets are expected to be built from sets of ROOT Trees.

Installing

pip install --user fast-curator

Usage

# Local files:
fast_curator -o output_file_list.txt -t tree_name -d dataset_name --mc input/files/*root

# Single XROOTD files:
fast_curator -o output_file_list.txt --mc root://my.domain.with.files://input/files/one_file.root

# XROOTD files with several globs
fast_curator -o output_file_list.txt --mc root://my.domain.with.files://inp*/files/*.root

Notes: 1. If the command is called multiple times with the same output file (using the -o option), the additional files specified will be appended to the output file. 2. Arbitrary meta-data (such as cross-section, data quality, generator precision, etc) can be added to each dataset with the -m option.

For more guidance try the built-in help:

fast_curator --help

Reading dataset files back

import fast_curator
datasets = fast_curator.read.from_yaml("my_dataset_file.yml")

Will return a list of datasets with the default section applied to each dataset.

Further Documentation

Is on its way...

fast-curator's People

Contributors

alexander-held avatar benkrikler avatar kreczko avatar seriksen avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

fast-curator's Issues

Add method / tool to convert set of dataset configs to a pandas dataframe

The YAML-based configs that curator handles are intended to contain useful meta-data, that's specific to each dataset. This is currently (to my knowledge) only used in the main processing of the input ROOT trees, however the same meta-data can be useful in later stages, such as dataset-specific scale factors (e.g. a cross section). It would be helpful and relevant to curator if it could convert between the YAML format and a pandas dataframe, possibly containing a subset of fields.

The scope of this issue is first and foremost to extract metadata from the yaml configs, but elsewhere the reverse process has been requested: adding meta-data to the configs using a table / spreadsheet / dataframe. Depending on the implementation, that might be (partially) addressed as well.

Wildcard support for trees

In a typical ATLAS workflow, files may contain many trees that are built with varying detector configurations (and then used for systematic uncertainties):

file.root
  - nominal
  - variation_1 
  - variation_2

It would be convenient to support wildcards for tree names to avoid having to process each tree specifically.

I am not sure how to best organize multiple trees attached to a single dataset, since they contain different information. One possibility might be to build the dataset name from the -d option plus the tree name. All datasets belonging together could be identified with additional metadata. Another option might be adding another layer, where the dataset can contain multiple "sub-datasets", which then each correspond to a specific tree.

`this_dir` in import statements fail when sample file in current directory

Imported from gitlab issue 7

this_dir fails to expand properly for files in an import section if the importing sample file is contained in the current working directory, eg:

$ fast_curator_check control_regions_15_11_18/all_samples.yml 
Traceback (most recent call last):
  File "/users/bk17414/.local/bin/fast_curator_check", line 11, in <module>
    sys.exit(main_check())
  File "/users/bk17414/.local/lib/python2.7/site-packages/fast_curator/__main__.py", line 69, in main_check
    datasets += read.from_yaml(infile)
  File "/users/bk17414/.local/lib/python2.7/site-packages/fast_curator/read.py", line 37, in from_yaml
    datasets_dict = _load_yaml(path)
  File "/users/bk17414/.local/lib/python2.7/site-packages/fast_curator/read.py", line 29, in _load_yaml
    with open(path, 'r') as f:
IOError: [Errno 2] No such file or directory: 'control_regions_15_11_18/all_samples.yml'

Remove the rootpy version of this package

Imported from gitlab issue 4

We might be able to drop the rootpy version of this package now completely, since the uproot version has all the same features as of gitlab:!3. Maintaining two versions of the code is extra effort, as #1 implies, and I don't expect the rootpy version will generally be used as much.

Main functions ignore `args`

Imported from gitlab issue 9

The functions in __main__.py all accept an "args" option which should be a vector of arguments similar to sys.argv. However this parameter is completely ignored within the code.

fast-curator not found

Imported from gitlab issue 2

I tried to install and run fast-curator as per the README, but the command isn't found - see below. I have the same problem with carpenter...

soolin.dice.priv> pip install --user fast-curator

Successfully installed alphatwirl-0.20.2 atsge-0.1.8 atuproot-0.1.5 awkward-0.4.3 cachetools-3.0.0 fast-carpenter-0.2.2 fast-curator-0.1.7 fast-flow-0.1.1 funcsigs-1.0.2 future-0.17.1 llvmlite-0.25.0 lz4-2.1.2 numba-0.40.1 pandas-0.23.4 python-dateutil-2.7.5 singledispatch-3.4.0.3 uproot-3.2.11 uproot-methods-0.2.7

soolin.dice.priv> fast_curator -o output_file_list.txt -t tree_name -d dataset_name --mc /storage/DUNE/HitFinding/Temp/DAQSimAna_plus_trigprim_FastHit.root
-bash: fast_curator: command not found

Allow to disable defaults in output file

The automatic identification of common options in the output YAML is very convenient to keep the output compact. For debugging purposes, it would be convenient to allow disabling this feature that sets defaults (via a command line argument). Instead, every option then would always be explicitly written.

This seems not too difficult to add, I can have a look and prepare a PR.

Add generic file-catalogue query

At the moment file lists are created using wild-cards either on the local file system or over xrootd. Many experiments use file catalogue databases and use a query language to extract a list of files from the database. In the same way as the user hook interface, it would be nice to have a command-line flag that sends the command-line query to known file-catalogue service. The xrootd_glob function should be moved over to use this interface and then we can look at implementing other ones (particularly of interest are: CMS DAS, Dirac find-lfns, and LZ's SPADE).

Support file path prefix field

A common use pattern for analyses working on multiple sites is to locate data files within a common directory structure. Given the varying mount points between computer clusters and sites however, the absolute paths might vary. It would be useful then to provide a prefix option which can be altered when running on different sites. This has the additional advantage that the prefix can be a root: endpoint or use some other remote scheme (eg https, globus, etc). If you want to swap the URL / port , etc, you could also use the prefix option.

There are perhaps two sensible approaches:

  1. Have a single prefix path string which provides a default, but that can be overridden by the calling code when the dataset dicts are loaded
  2. Have the author of the dataset files define a dictionary of possible prefixes. The calling code is then only able to choose from the valid options, rather than having the ability to over-ride the prefix completely.

Option 2 would be safer and possibly simpler, option 1 might be simpler for a user and offer more flexibility. In principle one could have both options supported.

Reading / Writing dataset yamls in jupyter notebook

A datasets created using fast_curator.write.write_yaml are not readable by fast_curator.read.yaml

Code snippet

from fast_curator.write import prepare_file_list, prepare_contents, write_yaml
fileLoc="/global/projecta/projectdirs/lz/data/MDC3/calibration/LZAP-4.5.1/20180226/lz_201802260001_32_lzap.root"
fileList = prepare_file_list(fileLoc,
                             dataset="DD",
                             expand_files="local",
                             tree_name="Scatters",
                             eventtype="mc",
                             no_empty_files=True,
                             confirm_tree=False,
                             ignore_inaccessible=True
                            ) 

datalist = []
datalist.append(fileList)
contents = prepare_contents(datalist)
dataset = get_datasets(contents)

# write out the dataset config file
a = write_yaml(dataset, "dataset.yml", append=False)

dataset_read = read.from_yaml("dataset.yaml")

Error is

ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object:argparse.Namespace'
  in "dataset.yaml", line 2, column 7

dataset.yaml looks as

datasets:
  - - !!python/object:argparse.Namespace
      associates: []
      eventtype: mc
      files:
        - /global/projecta/projectdirs/lz/data/MDC3/calibration/LZAP-4.5.1/20180226/lz_201802260001_32_lzap.root
      name: DD
      nevents: 639
      nfiles: 1
      tree: Scatters

dataset is

[Namespace(associates=[], eventtype='mc', files=['/global/projecta/projectdirs/lz/data/MDC3/calibration/LZAP-4.5.1/20180226/lz_201802260001_32_lzap.root'], name='DD', nevents=639, nfiles=1, tree='Scatters')]

Failling to remove files from list if unreadable

Run: fast_curator -d DD -t Scatters --mc --allow-missing-tree --no-empty-files FileLocation
FileLocation directory has some files which can be read and others which can't due to permissions.

fast fails due to not catching the issue when raised in uproot

trace-back:

---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
<ipython-input-17-94e797df59b4> in <module>
     16                                  tree_name="Scatters",
     17                                  eventtype="mc",
---> 18                                  no_empty_files=True,
     19                                  #confirm_tree=True
     20                               ) 

~/miniconda3/envs/fast_lz/lib/python3.7/site-packages/fast_curator/write.py in prepare_file_list(files, dataset, eventtype, tree_name, expand_files, prefix, no_empty_files, confirm_tree, include_branches)
     30                                                                no_empty=no_empty_files,
     31                                                                list_branches=include_branches,
---> 32                                                                confirm_tree=confirm_tree)
     33 
     34     data = {}

~/miniconda3/envs/fast_lz/lib/python3.7/site-packages/fast_curator/catalogues/__init__.py in check_files(*args, **kwargs)
     33     @staticmethod
     34     def check_files(*args, **kwargs):
---> 35         return check_entries_uproot(*args, **kwargs)
     36 
     37 

~/miniconda3/envs/fast_lz/lib/python3.7/site-packages/fast_curator/catalogues/__init__.py in check_entries_uproot(files, tree_names, no_empty, confirm_tree, list_branches)
     60         missing_trees = defaultdict(list)
     61         for tree in tree_names:
---> 62             totals = uproot.numentries(files, tree, total=False)
     63             for name, entries in totals.items():
     64                 n_entries[tree] += entries

~/miniconda3/envs/fast_lz/lib/python3.7/site-packages/uproot/tree.py in numentries(path, treepath, total, localsource, xrootdsource, httpsource, executor, blocking, **options)
   2004     else:
   2005         paths = [y for x in path for y in _filename_explode(x)]
-> 2006     return _numentries(paths, treepath, total, localsource, xrootdsource, httpsource, executor, blocking, [None] * len(paths), options)
   2007 
   2008 def _numentries(paths, treepath, total, localsource, xrootdsource, httpsource, executor, blocking, uuids, options):

~/miniconda3/envs/fast_lz/lib/python3.7/site-packages/uproot/tree.py in _numentries(paths, treepath, total, localsource, xrootdsource, httpsource, executor, blocking, uuids, options)
   2044     if executor is None:
   2045         for i in range(len(paths)):
-> 2046             _delayedraise(fill(i))
   2047         excinfos = ()
   2048     else:

~/miniconda3/envs/fast_lz/lib/python3.7/site-packages/uproot/tree.py in _delayedraise(excinfo)
     56             exec("raise cls, err, trc")
     57         else:
---> 58             raise err.with_traceback(trc)
     59 
     60 def _filename_explode(x):

~/miniconda3/envs/fast_lz/lib/python3.7/site-packages/uproot/tree.py in fill(i)
   2023     def fill(i):
   2024         try:
-> 2025             file = uproot.rootio.open(paths[i], localsource=localsource, xrootdsource=xrootdsource, httpsource=httpsource, read_streamers=False, **options)
   2026         except:
   2027             return sys.exc_info()

~/miniconda3/envs/fast_lz/lib/python3.7/site-packages/uproot/rootio.py in open(path, localsource, xrootdsource, httpsource, **options)
     51         else:
     52             openfcn = localsource
---> 53         return ROOTDirectory.read(openfcn(path), **options)
     54 
     55     elif _bytesid(parsed.scheme) == b"root":

~/miniconda3/envs/fast_lz/lib/python3.7/site-packages/uproot/rootio.py in <lambda>(path)
     48                 if n in options:
     49                     kwargs[n] = options.pop(n)
---> 50             openfcn = lambda path: MemmapSource(path, **kwargs)
     51         else:
     52             openfcn = localsource

~/miniconda3/envs/fast_lz/lib/python3.7/site-packages/uproot/source/memmap.py in __init__(self, path)
     19     def __init__(self, path):
     20         self.path = os.path.expanduser(path)
---> 21         self._source = numpy.memmap(self.path, dtype=numpy.uint8, mode="r")
     22 
     23     def parent(self):

~/miniconda3/envs/fast_lz/lib/python3.7/site-packages/numpy/core/memmap.py in __new__(subtype, filename, dtype, mode, offset, shape, order)
    223             f_ctx = contextlib_nullcontext(filename)
    224         else:
--> 225             f_ctx = open(os_fspath(filename), ('r' if mode == 'c' else mode)+'b')
    226 
    227         with f_ctx as fid:

PermissionError: [Errno 13] Permission denied: '/global/projecta/projectdirs/lz/data/MDC3/calibration/LZAP-4.5.1/20180226/lz_201802260003_25_lzap.root'

Type annotations for meta-data

Imported from gitlab issue 6

All Meta-data passed through on the command line is currently interpreted as as a string. It would be good to allow the user to annotate this in some way that we can re-interpret it in another form.

Add version option to command line tools

Using argparse in fast_curator.__main__.py it would be good to add a --version option to the fast_curator command which dumps the __version__ variable defined in version.py.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.