Code Monkey home page Code Monkey logo

hepdata-converter's Introduction

HEPData

GitHub Actions Build Status Coveralls Status License Docker Image Size GitHub Issues Documentation Status

The Durham High-Energy Physics Database (HEPData) has been built up over the past four decades as a unique open-access repository for scattering data from experimental particle physics. It currently comprises the data points from plots and tables related to several thousand publications including those from the Large Hadron Collider (LHC). HEPData is funded by a grant from the UK STFC and is based at the IPPP at Durham University.

HEPData is built upon Invenio v3 and is open source and free to use!

Research notice

Please note that this repository is participating in a study into sustainability of open source projects. Data will be gathered about this repository for approximately the next 12 months, starting from June 2021.

Data collected will include number of contributors, number of PRs, time taken to close/merge these PRs, and issues closed.

For more information, please visit the informational page or download the participant information sheet.

hepdata-converter's People

Contributors

20dm avatar alisonrclarke avatar dpiparo avatar eamonnmag avatar graemewatt avatar jstypka avatar michal-szostak avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hepdata-converter's Issues

yoda: support tables with more than two independent variables

Tables with {0,1,2} independent variables are currently exported to YODA Scatter{1,2,3}D objects (respectively). Tables with more than 2 independent variables are currently skipped in the YODA export. It might be possible to use the YODA::Scatter< N > base class to export tables with more than 2 independent variables, although it currently seems to be lacking a Python implementation.

CLI help broken

hepdata-converter -h seems to give similar results to root -h?

screenshot from 2016-09-15 13-02-17

root/yoda: parse 'value' given as a string of hyphen-separated bin limits

If there are any non-numeric independent variable values, the current code uses bins of unit width and centred on integers (1, 2, 3, etc.) in the export to ROOT/YODA. There are some records (for example, here or here) where an incorrect YAML encoding has been used such as value: 800 - 1000 or value: '1-1.5' instead of separate low and high values. Such an incorrect YAML encoding gives an acceptable table rendering and is properly handled by the visualisation code, so it could easily be missed in the review process. We should therefore tolerate these incorrect YAML encodings in the export to ROOT/YODA by checking if a string value without explicit low and high values has the format of two numbers separated by a hyphen (with or without surrounding spaces). This case should then be treated in the same way as if separate low and high values were specified.

Python 3.12 compatibility

It looks like the find_module method got removed in Python 3.12:

 _warnings.warn("find_module() is deprecated and "
                   "slated for removal in Python 3.12; use find_spec() instead",
                   DeprecationWarning)

resulting in

    from .parsers import Parser
  File "/path/to/hepdata_converter/parsers/__init__.py", line 178, in <module>
    module = loader.find_module(name).load_module(name)
             ^^^^^^^^^^^^^^^^^^
AttributeError: 'FileFinder' object has no attribute 'find_module'

when trying to import the hepdata_converter

yaml parser: investigate use of multiprocessing to parallelise loading YAML

The Kubernetes pods used in production each have 16 CPUs (16 sockets with 1 core per socket and 1 thread per core). Using the Python multiprocessing package could potentially speed up the parsing of large submissions by parallelising the loop over data tables:

for i in range(0, len(submission_data)):

It looks like an attempt to use multiprocessing.Pool was started in 980dd23 but later removed in 4b1ad68. If successful, the use of multiprocessing could be extended to other parts of the converter code.

RFC OLD HEPData supported fields

I created new issue so it will (hopefully) be easier to spot.

There are some fields in old HEPData format which do not have one to one mapping to the new format. The ones I know about (found in http://hepdata.cedar.ac.uk/resource/sample.input) are:

  • author
  • doi
  • status
  • experiment
  • detector
  • title

As for now I add those fields at the end of comment section in the new format, but I would like to hear some opinions what should be generally done with "unsupported" input.

Additionaly there are some fields which map to record_ids such as:

  • spiresId
  • inspireId
  • cdsId
  • durhamId

I map them to records_ids with type the same as the field but without "Id" at the end (similar as it's shown in https://github.com/HEPData/hepdata_submission). For example:

*spiresId: 10412

is mapped to

{value: "spires", id: 10412}

Any comments whether it is expected behavior?

Convert from YAML format to YODA

Of the current HepData export formats, a conversion to the YODA format should be implemented in the new system. This format is needed by the Rivet toolkit. Currently, only total uncertainties are supported in the YODA format, which should be computed from the sum in quadrature of the individual uncertainties. This might change in the future, but for the moment we still need the separate CSV export to provide the complete breakdown of multiple uncertainties in a simple plain-text format.

hepdata-converter CLI

For some tests and sanity checks (and to be able to post sample outputs for discussion on GitHub) I implemented rudimentary CLI tools allowing conversions between different formats. Old HEPData has some conversion tools, so I thought that this code can be integrated into the master branch.

Question is - should it remain, or be cleared up? Or some ideas / questions to discuss? @eamonnmag ?

Sample usage:

hepdata-converter --input-format yaml --output-format csv --table "Table 1" input output

The above line is equivalent to code:

#!/usr/bin/python
hepconverter.convert(input, output, options={'input_format': 'yaml',
                                             'output_format': 'csv',
                                             'table': 'Table 1'})

All the options required by parsers / writers are automatically accepted from command line so running:

hepdata-converter --help

will print all information about all supported input and output formats as well as possible parameters for them.

yoda: support a possible 'Reference data' qualifier

Similar to the existing 'Custom Rivet identifier' qualifier:

# Allow the standard Rivet identifier to be overridden by a custom value specified in the qualifiers.
if 'qualifiers' in table.dependent_variables[idep]:
for qualifier in table.dependent_variables[idep]['qualifiers']:
if qualifier['name'] == 'Custom Rivet identifier':
rivet_identifier = qualifier['value']

the code should check for a qualifier with name 'Reference data' and value 'false'. In this case, the rivet_path should be defined with THY (theory predictions) instead of REF (reference data) and the graph.setAnnotation('IsRef', '1') line should be omitted.

Request from Christian Gutschow.

yoda: add an option to switch off export of multidimensional objects

Add an option like no_multi_dim that switches off export of Scatter3D objects (and possibly higher dimensions, see #47) and instead writes multiple Scatter2D objects, one for each independent variable. This was the behaviour of the YODA exporter in the old HepData site and it still makes sense for some tables.

Since a Scatter3D object would still be preferred for covariance matrices, the code could check the description for 'covariance', 'correlation' or 'matrix' and still write a Scatter3D object in these cases even if the no_multi_dim option is set. See https://gitlab.com/hepcedar/rivet/-/issues/318 .

yoda: write luminosity qualifier as annotation

I am wondering whether we can also address the issue of what the integrated luminosity is on a table-by-table basis, for those paper where it differs? Rivet currently allows just one int lumi number per paper, so there would have to be development there too, but is integrated lumi something that is ever submitted to HEPData with measurements?
(For info, the reason it is useful is in working out the uncertainty / expected number of events from some BSM model projected onto a measurement.)

Originally posted by @jonbutterworth in #42 (comment)

hepdata missing functionality

Hello,

New hepdata.net misses some functionality compared to hepdata.cedar.ac.uk.

This is how easy it was to reproduce plots & EPS figures before on Windows: I could start "DMelt IDE" (Java program), and copy the URL script location, such as:
http://hepdata.cedar.ac.uk/h8test/view/ins1269454/dmelt.py
to the DMelt dialogue (File->Read script from URL -> Press run). It will create 6 canvases with data plus 6 EPS figures. Then I can edit the scripts and rerun the whole thing.
I can do this on Window 10, without installing any complicated libraries, or reading "file formats".

best wishes, Sergei

csv/root/yoda: add support for underflow/overflow bins

A long-standing open issue (HEPData/hepdata#358) is to allow open bin boundaries. In the past, this issue was put on hold because it was not clear how to export to ROOT/YODA formats. However, ROOT histograms have an underflow/overflow bin, and explicit underflow/overflow bins are supported by the new YODA2 format, therefore it is now easier to address this issue. I've just made a fix to the HEPData visualisation code to support underflow/overflow bins. The CSV/ROOT/YODA1/YODA2 conversion code should also be adapted to support underflow bins (e.g. {low: -.inf, high: 0}) and overflow bins (e.g. {low: 250, high: .inf}). For objects like ROOT TGraphAsymmErrors and YODA1 Scatter2D that do not support underflow/overflow bins, a workaround would be needed like setting the low/centre/high bin values all to the finite bin limit as in the HEPData visualisation code.

@20DM : maybe you could help with the implementation at least for the YODA2 export?

Convert from Rivet analysis to YAML format or vice versa

This issue is an extension of #5 ("Convert from YAML format to YODA"). A Rivet analysis consists of a .cc file (code, not relevant here), a .info file (metadata for the analysis), a .plot file (plot titles and axis labels), and a .yoda file with the actual data points. The last three files overlap with HEPData content and hence conversion between the two formats would be useful, i.e. both HEPData --> Rivet and Rivet --> HEPData. For example, I noticed yesterday that a particular Rivet analysis ATLAS_2014_I1315949 does not have a corresponding HepData record. Can we write a script to convert the relevant .info, .plot and .yoda files into the YAML format suitable for HEPData submission? Conversely, given a HEPData submission in the YAML format, can we export .info, .plot and .yoda (see #5) files suitable for inclusion in a Rivet analysis?

root: catch segfault when converting large tables

The ROOT conversion of https://www.hepdata.net/record/ins1742786 fails with an HTML file containing a message "502 Bad Gateway. The server returned an invalid or incomplete response.". By running the conversion offline, most tables can be converted other than "Statistical covariance matrix" and "Systematic covariance matrix" which gave an error message like:

Fatal in <TBufferFile::AutoExpand>: Request to expand to a negative size, likely due to an integer overflow: 0x91111b20 for a max of 0x7ffffffe.
aborting
[...]
qemu: uncaught target signal 6 (Aborted) - core dumped
Aborted

I can't find anything unusual in the YAML formatting of these tables, so I think the problem is simply caused by the large size (17424 rows and 2.6 MB). If the problem cannot be resolved, at least the segfault should be caught and a suitable error message returned from the problematic line of the conversion code:

yaml parser: only parse relevant YAML data file for single-table conversion

Currently, the YAML parser reads all YAML data files into a tables list even if only one table is requested in the output. This is very inefficient. Moreover, the single-table conversion can fail for very large records like https://www.hepdata.net/record/ins2132368 (313 data tables) that cannot be parsed within the 5-minute server timeout limit. The YAML parser should check for the presence of a table option and only parse the relevant table into the tables list.

setup.py missing dependencies

so I ran a freh install in a virtualenv and it seems like setup.py is missing some deps.

I needed to pip install

  • funcsigs
  • functools32
  • yoda

should I just add those as hard dependencies (maybe yoda should not be an explicit dependency, but then the hepdata-converter should not crash if there is an ImportError)

Reduce dependencies of hepdata-converter

Hi all,

thanks for providing such a nice tool. It's not obvious to seamlessly convert HEPData yaml files into formats like ROOT Files in a generic way!

I would suggest though to remove the dependencies on ROOTPy and Numpy. Having a small set of depenencies is always of great help when it comes to interoperability and the aforementioned packages are not small.

Here two simple ideas to replace the current usage of the two packages.

ROOTPy

Rootpy is used only here: https://github.com/HEPData/hepdata-converter/blob/master/hepdata_converter/writers/root_writer.py#L340

One could generate a random long name and delete the file after closing

Numpy

The only element of Numpy I found in the codebase is numpy.array. This can be replaced directly with the native python array.array class.

root: converter failing on some files with a null pointer

e.g. https://www.hepdata.net/record/sandbox/1469163359

def __init__(self, *args, **kwargs):
        super(ROOT, self).__init__(*args, **kwargs)
        self.extension = 'root'

    def _write_table(self, data_out, table):
        data_out.mkdir(table.name)
        data_out.cd(table.name)

        f = ObjectFactory(self.class_list, table.independent_variables, table.dependent_variables)
        for graph in f.get_next_object():
            graph.title = table.name
ReferenceError: attempt to access a null-pointer

Conversion between same file formats

During my exchange with @GraemeWatt arised issue of converting between yaml -> yaml. As for now the default behaviour is to not do anything at all. But after a little bit of consideration allowing "conversion" between same file formats may actually by of some use for some of the user base.

For example:

  • converting from multiple file yaml to single file yaml (this is connected to #12 )

Alternatively default behaviour may be to just copy input to output. Any comments on this one @eamonnmag ?

csv/root/yoda: allow for columns with missing values

For example, Tables 2, 3, 4, 7 and 8 of https://www.hepdata.net/record/ins1216885 return an HTML file with IndexError: list index out of range when attempting to convert to CSV, ROOT or YODA formats. The URL https://www.hepdata.net/download/submission/ins1216885/1/root currently appears on the first page of results in a Google search for return self.wsgi_app(environ, start_response). This came up in "Top growing queries" for May 2022 in an email from Google Search Console.

yaml parser: add option to skip validation of data tables

The validation of data tables can be time-consuming for submissions with many large data tables. For new submissions, it is unnecessary as the validation will already have been performed during the upload stage. An option should be added to switch off data table validation, that can be used when the converter is called from the main HEPData web app.

root: add support for ROOT TH2Poly objects for tables with two independent variables

Related to #37, the ROOT TH2Poly object is a 2D Histogram class allowing to define polygonal bins of arbitrary shape. In the case of a table with two independent variables, where bins of a given independent variable overlap with another bin of the same independent variable, but the 2D polygonal bins do not overlap (a check should be made), TH2Poly histograms could be written instead of the usual TH2F histograms. Here is an example record.

Suggestion from Antoni Aduszkiewicz.

oldhepdata: remove default None values for the data_license in submission.yaml

Remove these lines:

'data_license': {
'name': None,
'url': None,
'description': None # (optional)
},

since None is not a valid value of these fields in version 1.0.0 of the schema, where a string is expected. This problem has been worked around in the code for the main web app (HEPData/hepdata) by temporarily validating using version 0.1.0 of the schema if uploading an oldhepdata file, where no checks at all were made on the data_license. The temporary fix to validate using version 0.1.0 of the schema should be removed after this issue is closed and a new hepdata-converter version is made available.

oldhepdata: check for units if qualifier or header is SQRT(S)

This line gives an exception AttributeError: 'NoneType' object has no attribute 'lower' if a qualifier with name SQRT(S) has no units:

if name.startswith('SQRT(S)') and units.lower() in ('gev'):

Similarly if SQRT(S) is given as the name of a header without units:

if xheader['name'].startswith('SQRT(S)') and xheader['units'].lower() in ('gev'):

An additional condition should be added to these lines to check that the units are present and not None.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.