hepdata / hepdata-converter Goto Github PK

Converter software from/to the HEPData YAML format

License: GNU General Public License v2.0

Python 99.66% Shell 0.34%

hepdata-converter's Introduction

HEPData

The Durham High-Energy Physics Database (HEPData) has been built up over the past four decades as a unique open-access repository for scattering data from experimental particle physics. It currently comprises the data points from plots and tables related to several thousand publications including those from the Large Hadron Collider (LHC). HEPData is funded by a grant from the UK STFC and is based at the IPPP at Durham University.

HEPData is built upon Invenio v3 and is open source and free to use!

Free software: GPLv2 license
Documentation: http://hepdata.readthedocs.io/

Research notice

Please note that this repository is participating in a study into sustainability of open source projects. Data will be gathered about this repository for approximately the next 12 months, starting from June 2021.

Data collected will include number of contributors, number of PRs, time taken to close/merge these PRs, and issues closed.

For more information, please visit the informational page or download the participant information sheet.

hepdata-converter's People

Contributors

Stargazers

Watchers

Forkers

sebrown ldcorpe lcgapp durhamarc 20dm

hepdata-converter's Issues

stability: stabilise conversions

Increase tests across larger and more varied examples
Investigate performance (profile)

Python 3.12 compatibility

It looks like the find_module method got removed in Python 3.12:

 _warnings.warn("find_module() is deprecated and "
                   "slated for removal in Python 3.12; use find_spec() instead",
                   DeprecationWarning)

resulting in

    from .parsers import Parser
  File "/path/to/hepdata_converter/parsers/__init__.py", line 178, in <module>
    module = loader.find_module(name).load_module(name)
             ^^^^^^^^^^^^^^^^^^
AttributeError: 'FileFinder' object has no attribute 'find_module'

when trying to import the hepdata_converter

oldhepdata: check for units if qualifier or header is SQRT(S)

This line gives an exception AttributeError: 'NoneType' object has no attribute 'lower' if a qualifier with name SQRT(S) has no units:

hepdata-converter/hepdata_converter/parsers/oldhepdata_parser.py

Line 540 in 3f0330d

if name.startswith('SQRT(S)') and units.lower() in ('gev'):

Similarly if SQRT(S) is given as the name of a header without units:

hepdata-converter/hepdata_converter/parsers/oldhepdata_parser.py

Line 292 in 3f0330d

    
           if xheader['name'].startswith('SQRT(S)') and xheader['units'].lower() in ('gev'):

An additional condition should be added to these lines to check that the units are present and not None.

yaml parser: investigate use of multiprocessing to parallelise loading YAML

The Kubernetes pods used in production each have 16 CPUs (16 sockets with 1 core per socket and 1 thread per core). Using the Python multiprocessing package could potentially speed up the parsing of large submissions by parallelising the loop over data tables:

hepdata-converter/hepdata_converter/parsers/yaml_parser.py

Line 83 in 3f0330d

for i in range(0, len(submission_data)):

It looks like an attempt to use multiprocessing.Pool was started in 980dd23 but later removed in 4b1ad68. If successful, the use of multiprocessing could be extended to other parts of the converter code.

yoda: support tables with more than two independent variables

Tables with {0,1,2} independent variables are currently exported to YODA Scatter{1,2,3}D objects (respectively). Tables with more than 2 independent variables are currently skipped in the YODA export. It might be possible to use the YODA::Scatter< N > base class to export tables with more than 2 independent variables, although it currently seems to be lacking a Python implementation.

Reduce dependencies of hepdata-converter

Hi all,

thanks for providing such a nice tool. It's not obvious to seamlessly convert HEPData yaml files into formats like ROOT Files in a generic way!

I would suggest though to remove the dependencies on ROOTPy and Numpy. Having a small set of depenencies is always of great help when it comes to interoperability and the aforementioned packages are not small.

Here two simple ideas to replace the current usage of the two packages.

ROOTPy

Rootpy is used only here: https://github.com/HEPData/hepdata-converter/blob/master/hepdata_converter/writers/root_writer.py#L340

One could generate a random long name and delete the file after closing

Numpy

The only element of Numpy I found in the codebase is numpy.array. This can be replaced directly with the native python array.array class.

Submit for install via pip

csv/root/yoda: fix TypeError for https://www.hepdata.net/record/ins1606329

Trying to convert individual tables of https://www.hepdata.net/record/ins1606329 gives an HTML file with TypeError: argument of type 'NoneType' is not iterable.

yoda: write luminosity qualifier as annotation

I am wondering whether we can also address the issue of what the integrated luminosity is on a table-by-table basis, for those paper where it differs? Rivet currently allows just one int lumi number per paper, so there would have to be development there too, but is integrated lumi something that is ever submitted to HEPData with measurements?
(For info, the reason it is useful is in working out the uncertainty / expected number of events from some BSM model projected onto a measurement.)

Originally posted by @jonbutterworth in #42 (comment)

ROOT_conversion: unsupported operand type(s) for ** or pow(): 'str' and 'int'

For records:

TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'int'

root/yoda: parse 'value' given as a string of hyphen-separated bin limits

If there are any non-numeric independent variable values, the current code uses bins of unit width and centred on integers (1, 2, 3, etc.) in the export to ROOT/YODA. There are some records (for example, here or here) where an incorrect YAML encoding has been used such as value: 800 - 1000 or value: '1-1.5' instead of separate low and high values. Such an incorrect YAML encoding gives an acceptable table rendering and is properly handled by the visualisation code, so it could easily be missed in the review process. We should therefore tolerate these incorrect YAML encodings in the export to ROOT/YODA by checking if a string value without explicit low and high values has the format of two numbers separated by a hyphen (with or without surrounding spaces). This case should then be treated in the same way as if separate low and high values were specified.

CI with Travis

See the validator repo for how this is done.

Integrate ROOT and HistFactory input formats

Integrate the work started by @lukasheinrich on allowing ROOT and HistFactory input formats into the hepdata-converter package.

Investigate conversion output for ROOT

root: catch segfault when converting large tables

The ROOT conversion of https://www.hepdata.net/record/ins1742786 fails with an HTML file containing a message "502 Bad Gateway. The server returned an invalid or incomplete response.". By running the conversion offline, most tables can be converted other than "Statistical covariance matrix" and "Systematic covariance matrix" which gave an error message like:

Fatal in <TBufferFile::AutoExpand>: Request to expand to a negative size, likely due to an integer overflow: 0x91111b20 for a max of 0x7ffffffe.
aborting
[...]
qemu: uncaught target signal 6 (Aborted) - core dumped
Aborted

I can't find anything unusual in the YAML formatting of these tables, so I think the problem is simply caused by the large size (17424 rows and 2.6 MB). If the problem cannot be resolved, at least the segfault should be caught and a suitable error message returned from the problematic line of the conversion code:

hepdata-converter/hepdata_converter/writers/root_writer.py

Line 408 in 3f0330d

graph.Write()

CLI help broken

hepdata-converter -h seems to give similar results to root -h?

Convert from Rivet analysis to YAML format or vice versa

This issue is an extension of #5 ("Convert from YAML format to YODA"). A Rivet analysis consists of a .cc file (code, not relevant here), a .info file (metadata for the analysis), a .plot file (plot titles and axis labels), and a .yoda file with the actual data points. The last three files overlap with HEPData content and hence conversion between the two formats would be useful, i.e. both HEPData --> Rivet and Rivet --> HEPData. For example, I noticed yesterday that a particular Rivet analysis ATLAS_2014_I1315949 does not have a corresponding HepData record. Can we write a script to convert the relevant .info, .plot and .yoda files into the YAML format suitable for HEPData submission? Conversely, given a HEPData submission in the YAML format, can we export .info, .plot and .yoda (see #5) files suitable for inclusion in a Rivet analysis?

yoda: add an option to switch off export of multidimensional objects

Add an option like no_multi_dim that switches off export of Scatter3D objects (and possibly higher dimensions, see #47) and instead writes multiple Scatter2D objects, one for each independent variable. This was the behaviour of the YODA exporter in the old HepData site and it still makes sense for some tables.

Since a Scatter3D object would still be preferred for covariance matrices, the code could check the description for 'covariance', 'correlation' or 'matrix' and still write a Scatter3D object in these cases even if the no_multi_dim option is set. See https://gitlab.com/hepcedar/rivet/-/issues/318 .

deal with case where no qualifiers are present

The current code requires the presence of qualifiers, but they do not always exist. e.g.https://hepdata.net/record/2 table 1

csv/root/yoda: add support for underflow/overflow bins

A long-standing open issue (HEPData/hepdata#358) is to allow open bin boundaries. In the past, this issue was put on hold because it was not clear how to export to ROOT/YODA formats. However, ROOT histograms have an underflow/overflow bin, and explicit underflow/overflow bins are supported by the new YODA2 format, therefore it is now easier to address this issue. I've just made a fix to the HEPData visualisation code to support underflow/overflow bins. The CSV/ROOT/YODA1/YODA2 conversion code should also be adapted to support underflow bins (e.g. {low: -.inf, high: 0}) and overflow bins (e.g. {low: 250, high: .inf}). For objects like ROOT TGraphAsymmErrors and YODA1 Scatter2D that do not support underflow/overflow bins, a workaround would be needed like setting the low/centre/high bin values all to the finite bin limit as in the HEPData visualisation code.

@20DM : maybe you could help with the implementation at least for the YODA2 export?

Convert from old HEPdata format to New YAML Files

Old format - http://hepdata.cedar.ac.uk/view/ins1245023/d1/input;jsessionid=1thabyw67lr7j

New format - https://github.com/HEPData/hepdata_submission

Some files from @GraemeWatt for reference purposes..

GraphSysError format: http://cholm.web.cern.ch/cholm/root/gse/html/md__b_n_f.html

Java code from Durham group currently used to export from the database to the old format - https://gist.github.com/eamonnmag/ff970edce45fbde15abc

root: add support for ROOT TH2Poly objects for tables with two independent variables

Related to #37, the ROOT TH2Poly object is a 2D Histogram class allowing to define polygonal bins of arbitrary shape. In the case of a table with two independent variables, where bins of a given independent variable overlap with another bin of the same independent variable, but the 2D polygonal bins do not overlap (a check should be made), TH2Poly histograms could be written instead of the usual TH2F histograms. Here is an example record.

Suggestion from Antoni Aduszkiewicz.

root: converter failing on some files with a null pointer

e.g. https://www.hepdata.net/record/sandbox/1469163359

def __init__(self, *args, **kwargs):
        super(ROOT, self).__init__(*args, **kwargs)
        self.extension = 'root'

    def _write_table(self, data_out, table):
        data_out.mkdir(table.name)
        data_out.cd(table.name)

        f = ObjectFactory(self.class_list, table.independent_variables, table.dependent_variables)
        for graph in f.get_next_object():
            graph.title = table.name
ReferenceError: attempt to access a null-pointer

Validating YAML input files

As there is https://github.com/HEPData/hepdata-validator available, should it be integrated to hepdata-converter for validating input YAML files before conversion? @eamonnmag ?

Update hepdata-converter to work with python3

Should build on #29

Convert from YAML format to YODA

Of the current HepData export formats, a conversion to the YODA format should be implemented in the new system. This format is needed by the Rivet toolkit. Currently, only total uncertainties are supported in the YODA format, which should be computed from the sum in quadrature of the individual uncertainties. This might change in the future, but for the moment we still need the separate CSV export to provide the complete breakdown of multiple uncertainties in a simple plain-text format.

ROOT: increasing bin size errors

Skip tables that cannot be converted.

hepdata-converter CLI

For some tests and sanity checks (and to be able to post sample outputs for discussion on GitHub) I implemented rudimentary CLI tools allowing conversions between different formats. Old HEPData has some conversion tools, so I thought that this code can be integrated into the master branch.

Question is - should it remain, or be cleared up? Or some ideas / questions to discuss? @eamonnmag ?

Sample usage:

hepdata-converter --input-format yaml --output-format csv --table "Table 1" input output

The above line is equivalent to code:

#!/usr/bin/python
hepconverter.convert(input, output, options={'input_format': 'yaml',
                                             'output_format': 'csv',
                                             'table': 'Table 1'})

All the options required by parsers / writers are automatically accepted from command line so running:

hepdata-converter --help

will print all information about all supported input and output formats as well as possible parameters for them.

Coveralls integration

https://coveralls.io/

Linked with travis request #6

Switch from Travis to GitHub Actions

yoda: write header name and units as annotations

See request from Rivet developers. Write annotations as XName, XUnits, YName, YUnits, ZName, ZUnits, with the units annotation being omitted if not available for a particular variable.

yaml parser: add option to skip validation of data tables

The validation of data tables can be time-consuming for submissions with many large data tables. For new submissions, it is unnecessary as the validation will already have been performed during the upload stage. An option should be added to switch off data table validation, that can be used when the converter is called from the main HEPData web app.

hepdata missing functionality

Hello,

New hepdata.net misses some functionality compared to hepdata.cedar.ac.uk.

This is how easy it was to reproduce plots & EPS figures before on Windows: I could start "DMelt IDE" (Java program), and copy the URL script location, such as:
http://hepdata.cedar.ac.uk/h8test/view/ins1269454/dmelt.py
to the DMelt dialogue (File->Read script from URL -> Press run). It will create 6 canvases with data plus 6 EPS figures. Then I can edit the scripts and rerun the whole thing.
I can do this on Window 10, without installing any complicated libraries, or reading "file formats".

best wishes, Sergei

csv/root/yoda: fix IndexError for https://www.hepdata.net/record/ins1620202

Trying to convert https://www.hepdata.net/record/ins1620202 gives an HTML file with IndexError: list index out of range.

Support for reading (and writing) of single file yaml files

@GraemeWatt asked me about possibility of supporting single file yaml submission (http://hepdata.cedar.ac.uk/view/ins1203852/yaml). If this is also a standard, I think it should be supported (both for reading and writing). I'm putting an issue here in order to allow some discussion and tracking of the progress. Any comments on this one @eamonnmag ?

Update hepdata-converter to work with the new schema versioning in hepdata-validator

This will require HEPData/hepdata-validator#5 and HEPData/hepdata-converter-docker#4 to be complete before we can merge the changes.

yoda: support a possible 'Reference data' qualifier

Similar to the existing 'Custom Rivet identifier' qualifier:

hepdata-converter/hepdata_converter/writers/yoda_writer.py

Lines 117 to 121 in 3f0330d

    
           # Allow the standard Rivet identifier to be overridden by a custom value specified in the qualifiers. 
        
           if 'qualifiers' in table.dependent_variables[idep]: 
        
               for qualifier in table.dependent_variables[idep]['qualifiers']: 
        
                   if qualifier['name'] == 'Custom Rivet identifier': 
        
                       rivet_identifier = qualifier['value']

the code should check for a qualifier with name 'Reference data' and value 'false'. In this case, the rivet_path should be defined with THY (theory predictions) instead of REF (reference data) and the graph.setAnnotation('IsRef', '1') line should be omitted.

Request from Christian Gutschow.

root: check for overlapping bins before writing ROOT histograms

The histogram objects TH1F, TH2F and TH3F are not meaningful if any of the bins of a given independent variable overlap with another bin of the same independent variable. A check should be made for overlapping bins and histogram objects should not be written if there is any overlap. Here is an example record.

setup.py missing dependencies

so I ran a freh install in a virtualenv and it seems like setup.py is missing some deps.

I needed to pip install

funcsigs
functools32
yoda

should I just add those as hard dependencies (maybe yoda should not be an explicit dependency, but then the hepdata-converter should not crash if there is an ImportError)

oldhepdata: remove default None values for the data_license in submission.yaml

Remove these lines:

hepdata-converter/hepdata_converter/parsers/__init__.py

Lines 53 to 57 in 65c338a

    
           'data_license': { 
        
               'name': None, 
        
               'url': None, 
        
               'description': None # (optional) 
        
           },

since None is not a valid value of these fields in version 1.0.0 of the schema, where a string is expected. This problem has been worked around in the code for the main web app (HEPData/hepdata) by temporarily validating using version 0.1.0 of the schema if uploading an oldhepdata file, where no checks at all were made on the data_license. The temporary fix to validate using version 0.1.0 of the schema should be removed after this issue is closed and a new hepdata-converter version is made available.

Conversion between same file formats

During my exchange with @GraemeWatt arised issue of converting between yaml -> yaml. As for now the default behaviour is to not do anything at all. But after a little bit of consideration allowing "conversion" between same file formats may actually by of some use for some of the user base.

For example:

converting from multiple file yaml to single file yaml (this is connected to #12 )

Alternatively default behaviour may be to just copy input to output. Any comments on this one @eamonnmag ?

csv/root/yoda: allow for columns with missing values

For example, Tables 2, 3, 4, 7 and 8 of https://www.hepdata.net/record/ins1216885 return an HTML file with IndexError: list index out of range when attempting to convert to CSV, ROOT or YODA formats. The URL https://www.hepdata.net/download/submission/ins1216885/1/root currently appears on the first page of results in a Google search for return self.wsgi_app(environ, start_response). This came up in "Top growing queries" for May 2022 in an email from Google Search Console.

yaml parser: only parse relevant YAML data file for single-table conversion

Currently, the YAML parser reads all YAML data files into a tables list even if only one table is requested in the output. This is very inefficient. Moreover, the single-table conversion can fail for very large records like https://www.hepdata.net/record/ins2132368 (313 data tables) that cannot be parsed within the 5-minute server timeout limit. The YAML parser should check for the presence of a table option and only parse the relevant table into the tables list.

RFC OLD HEPData supported fields

I created new issue so it will (hopefully) be easier to spot.

There are some fields in old HEPData format which do not have one to one mapping to the new format. The ones I know about (found in http://hepdata.cedar.ac.uk/resource/sample.input) are:

author
doi
status
experiment
detector
title

As for now I add those fields at the end of comment section in the new format, but I would like to hear some opinions what should be generally done with "unsupported" input.

Additionaly there are some fields which map to record_ids such as:

spiresId
inspireId
cdsId
durhamId

I map them to records_ids with type the same as the field but without "Id" at the end (similar as it's shown in https://github.com/HEPData/hepdata_submission). For example:

*spiresId: 10412

is mapped to

{value: "spires", id: 10412}

Any comments whether it is expected behavior?

	# Allow the standard Rivet identifier to be overridden by a custom value specified in the qualifiers.
	if 'qualifiers' in table.dependent_variables[idep]:
	for qualifier in table.dependent_variables[idep]['qualifiers']:
	if qualifier['name'] == 'Custom Rivet identifier':
	rivet_identifier = qualifier['value']

	'data_license': {
	'name': None,
	'url': None,
	'description': None # (optional)
	},