danielwinterbottom / ichiggstautau Goto Github PK

CMSSW package for Imperial group analysis code

Home Page: http://danielwinterbottom.github.io/ICHiggsTauTau

Makefile 0.07% C++ 37.08% C 6.12% Shell 4.55% TeX 0.13% Python 52.05%

ichiggstautau's Introduction

Introduction

This repository is best considered in two parts. The first is a CMSSW package (UserCode/ICHiggsTauTau), which follows the standard conventions:

Physics object classes defined and implemented in the interface and src directories
CMSSW modules defined and implemented in the plugins directory that covert CMSSW objects to these formats
Configuration fragments for cmsRun jobs in the python directory
Complete configurations for cmsRun jobs in the test directory
Compiled using the scram tool, which will produce ROOT dictionaries for the object classes following the specification in the src/classes.h and src/classes_def.xml

The second part is an offline analysis framework, organised into a series of packages within the Analysis directory. This provides:

A modular build system implemented with make
A simple framework for analysing events, in which the work is done by module classes, in a similar fashion to the CMSSW framework
The same physics object classes from above are provided in the Objects package, with ROOT dictionaries built using the standard rootcint method and steered by the Objects/interface/LinkDef.h file.

Documentation can be produced by running docs/make_docs.sh from within the ICHiggsTauTau directory. An up-to-date copy can always be found here: http://ajgilbert.github.io/ICHiggsTauTau/index.html

The instructions for setting a CMSSW area with the ICHiggsTauTau package are reproduced below.

Setting up CMSSW

The following steps set up a new area, but note that only certain CMSSW releases are supported. All object classes and CMSSW modules compile should successfully with scram in each of the supported CMSSW release series:

	CMSSW_5_3_X
	CMSSW_7_0_X, CMSSW_7_1_X, CMSSW_7_2_X, CMSSW_7_3_X

The official CMSSW repository is hosted here: https://github.com/cms-sw/cmssw. If you do not already have a GitHub account, please read through the instructions at http://cms-sw.github.io/cmssw/faq.html, in particular ensure you have configured git with your personal information:

	git config --global user.name [First Name] [Last Name]
	git config --global user.email [Email Address]
	git config --global user.github [GitHub account name]
	#If you are running at IC you also need to do: git config --global http.sslVerify false
	#This extra line is not needed at CERN. Anywhere else if you see a fatal error message relating to SSL certificate problems it may be required

Before working with CMSSW in git, you will need to create a copy (or fork) of the official CMSSW repository in your own account at https://github.com/cms-sw/cmssw/fork. You will be able to store your own CMSSW branches and developments in this forked repository.

Create a new CMSSW area:

	export SCRAM_ARCH=slc6_amd64_gcc491
	# Or older, e.g. slc5_amd64_gcc472, slc6_amd64_gcc481
	scramv1 project CMSSW CMSSW_X_Y_Z
	cd CMSSW_X_Y_Z/src/
	cmsenv

Initialise this area for git by adding a single package:

	git cms-addpkg FWCore/Version
	# If you are not running on a machine based at CERN, this script will ask if you want to create a new reference repository.
	# You should answer yes to this, and the script will copy the entire cmssw repository to your home folder,
	# which will make setting up subsequent release areas a lot faster.

This command will have created two remote repositories, official-cmssw and my-cmssw. It will also have created and switched to a new branch, from-CMSSW_X_Y_Z. An additional remote should be added which provides the pre-configured branches:

	git remote add ic-cmssw [email protected]:ajgilbert/cmssw.git
	# fetch from the ic-cmssw remote repository and merge the from-CMSSW_X_Y_Z into your own local branch.
	git pull ic-cmssw from-CMSSW_X_Y_Z
	# Check which branch you actually need to merge in here. Descriptive names are useful, e.g. "higgstautau_from-CMSSW_5_3_7"

	# At this point, if you run 'ls' you will not see any new packages in the release area.
	# This is because the repository operates in sparse-checkout mode, hiding folders unless they are
	# explicitly made visible.  This is important, as we don't want to have to compile every single package.
	# To make the packages visible that have been modified from the release tag, run these commands:

	git cms-sparse-checkout CMSSW_X_Y_Z HEAD
	git read-tree -mu HEAD

	# You may optionally make other packages visible that depend on those which have been modified, so that
	# they will be recompiled.
	git-cms-checkdeps -a
	# In principle this should have also been done when using CVS, but is still not essential.

Next, add the IC analysis code package:

	git clone [email protected]:ajgilbert/ICHiggsTauTau.git UserCode/ICHiggsTauTau
	./UserCode/ICHiggsTauTau/init_X_Y_Z.sh  # (if exisits) This script performs a few final tasks in the new cmssw area

At this point everything is ready, and the working area can be compiled in the normal way with scram. New developments that are relevant for everyone can be committed to a branch and pushed to central IC repository (ic-cmssw). If you wish to test some new changes, or just share with specific people, it will be safer to work from a new branch, either based on the CMSSW release tag, or some commit on the from-CMSSW_X_Y_Z branch, e.g.

	git checkout -b my-analysis-from-CMSSW_X_Y_Z CMSSW_X_Y_Z # from the release tag
	git checkout -b my-analysis-from-CMSSW_X_Y_Z from-CMSSW_X_Y_Z # from the pre-configured branch
	... do some development ...
	git push my-cmssw my-new-branch

ichiggstautau's People

Contributors

Stargazers

Watchers

ichiggstautau's Issues

What to do with HTTAnalysisTools

This concerns @adewit and @ajgilbert - so far we have kept HTTSequence capable of running the paper2013 strategy to produce flat trees in sync with those we made for the run1 analyses. However, it is potentially more complicated to keep the flat tree reading code (i.e. HTTAnalysisTools linking up with HiggsTauTauPlot4) compatible with both old and new strategies. Background methods and aliases are coded very specifically for the run 1 selections (using the "method") quantity, and HTTAnalysisTools is already 1700 lines long - do we want to try to keep both strategies available in this part of the code? Of course if we don't then we cannot remake 8 TeV datacards without using an old branch. If we are to keep both then I expect defining a new set of "method"s would probably be the simplest. Thoughts?

CLASS_MEMBER dangling public

After using the CLASS_MEMBER compiler macro the class is left in the public state.

We should reintegrate the ICAnalysis-PhotonHadTowerOverEmCalculator back into the photon producer

This package didn't compile in an old CMSSW version. It should now and should be reintegrated into ICPhotonProducer

the 'please test' script really needs to ignore whitespace

Update ICMuonProducer to ensure we store all the variables needed to run the Muon ID at ntuple level

In run I we implemented the muon ID by storing in the ntuples all the variables used and then having a function in FnPredicates which implemented the ID. This now needs to be updated to the run II ID.

New electron ID variable

Appears the new cut-based electron ID has a new variable we don't save at the moment:

https://twiki.cern.ch/twiki/bin/viewauth/CMS/CutBasedElectronIdentificationRun2#Working_points_for_2016_data_for

dEtaInSeed => appears to be calculated as:

http://cmslxr.fnal.gov/source/RecoEgamma/ElectronIdentification/plugins/cuts/GsfEleDEtaInSeedCut.cc?v=CMSSW_8_0_21#0030

also same definition here:
https://twiki.cern.ch/twiki/bin/view/CMS/HEEPElectronIdentificationRun2#Selection_Cuts_HEEP_V5_1

If we want to add this should check it works ok on miniaod (should do), i.e. that:

ele->superCluster().isNonnull() && ele->superCluster()->seed().isNonnull()

are both non-null.

ICGenParticlePruner doesn't copy status flags

@adewit @rcl11 While working with the new status flags I found that our ICGenParticlePruner doesn't keep the status flag information. The ICGenParticlePruner seems to be a copy of an older version of the cmssw GenParticlePruner, the new version of which does keep the status flags. I've switched from ICGenParticlePruner to GenParticlePruner in my config and not seen any issues.

Re-integrate the electron conversion veto flag into this package

It is currently calculated and provided (as a ValueMap) by this external package: https://github.com/ajgilbert/ICAnalysis-ElectronConversionCalculator

At the time this was due to compiler problems in CMSSW_4_X_Y and early CMSSW_5_X_Y releases.

Someone should check:

if the two plugins defined in this package compile ok in our current supported releases
If not, can we use the #if CMSSW_MAJOR_VERSION >= X preprocessor checks to work around this
Is there a new definition/algorithm for the conversion veto in CMSSW_7_X_Y for run 2? If so, this should be added as a new plugin that produces a ValueMap in the same format.
- Related to this: check how this is calculated in PAT/MINIAOD in 7_3_X/7_4_X and make sure we know how to produce it consistently in AOD directly on the gedGsfElectrons
Is there a possibility to re-calculate on the level of miniAOD? (probably not, but worth knowing)

Treatment of GenParticles in miniAOD

Have to deal with two separate collections:
https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookMiniAOD#MC_Truth

The prunedGenParticles are a normal reco::GenParticle collection and should contain everything we need in the analysis (matrix element, full tau chain decay, final state electrons/muons). We can just save this as normal with ICGenParticleProducer.

The packedGenParticles are the new pat::PackedGenParticle type and contain all status 1 up to some high rapidity value, and are mainly for clustering gen jets.

Plan A:

Would like to be able to produce an ic::GenParticle from a PackedGenParticle. Can either template ICGenParticleProducer on the type, or write a brand new producer
As pointed out on the miniAOD twiki some care must be taken with following mother/daughter relations that span the two collections. Currently it wouldn't be possible to do this is we saved two separate ic::GenParticle collections. So if we want to be able to do this have to come up with a recipe - possibly trying to merge the two collections into one before we write it

Plan B:

decide we don't care about the packedGenParticles at all in the analysis and don't bother with any of this

Job output (number of open files)

The IC disk server doesn't really like too many files being open at the same time, which means running systematic shifts as well as the central values from the same job makes vols super slow (and our colleagues who then can't do any work super annoyed). Can work around this by running syst shifts separately, but at some point should investigate better options, for example writing a separate tree for each shift into one file, then writing them to separate files after the jobs finish. Not sure this would actually be better than current workaround/wouldn't overload the disk server in some other way, but we should check.

JECs in CMSSW 7_6_X

Not very urgent but something I noticed when testing the code in CMSSW 7_6:
The L1FastjetCorrectionESProducer (which we currently use to apply JECs to reclustered jets) doesn't work anymore. Haven't found a straightforward workaround/alternative yet, though I'm almost certain there must be one. If not I think we can fix the module ourselves as the only reason it doesn't work is that getByLabel is used without a consumes call, which shouldn't be too hard to add in.

missing obj, lib and bin directories in HiggsNuNu/LightTreeAna

Hi Patrick,

These folders need a file like .gitignore to be tracked by git. Currently this prevents a "make all" from completing successfully.

cheers,
Andrew

Corrupted double-linked list

If running ./bin/HTT on a local file with EventChecker enabled it will run to the end and write the output tree, then crash complaining about a corrupted double-linked list. This never happens when running off files on dcache, or when running on a local file without using EventChecker.
So there is some memory problem, though atm I fail to understand why this is only a problem under these very specific circumstances.

Anyway, low priority, but one to fix should we run out of stuff to do.

Jet flavour calculator for miniAOD reclustered jets

The jet flavour calculator for reclustered miniAOD jets fails because of course the prunedGenParticles don't contain all status 2 and 3 particles. I don't think there's much we can do about this as these particles are simply dropped from the event. If we did need the jet flavour for reclustered jets we could match them to the slimmedJets collection and use the stored flavour (if the reclustered jets are ak4 CHS too).

Code updates towards future analyses

Non-exhaustive list of code updates that will be needed in the next 3-4 months, towards being able to analyse the full 2016 dataset and future analyses

CMSSW-facing part of the code/CMSSW config:

Switch to at least CMSSW 8_0_20
Switch off filtering mode of met filters
Add new recipe for updating T1 corrected PFMet / extracting covariance matrix
Check if there are any other updated ID's and store extra variables where necessary (e.g. #161 but possibly other cases too)
Clean up config (which still contains snippets of code for running on AOD even though we do not need this anymore/possibly loads of other unused code)
Test the on-the-fly miniAOD generation (in case we do want to run on AOD, though we probably won't want to do this any time soon so not urgent)
(update 25/11):
Rewrite jet producer (can drop some of the jet sources/remove support for calo and jpt jets --> can get rid of the jetSrcHelper and jetDestHelper. Could take a little while to do)

Analysis code:

Remove 8 TeV code (already in progress)
Adapt for full 2016 rereco dataset (+ MC when it becomes available)
Apply tau energy scale shift to the full tau collection before selecting taus (more correct than current implementation). This requires gen matching of all reconstructed hadronic taus, not just the ones part of a selected pair, at the start of the chain.
Rewrite plotting code (to make more flexible and understandable)
Include option for making 2D datacards/plots in case this remains the norm in H->tautau
Implement fake rate method
Many other analysis modules could probably be rewritten to run faster (I suspect the PairGenInfo module could be more efficient), but probably not urgent

No way to rerun PU jet ID MVA for non CHS jets

In CMSSW_7_2_X there is no way to rerun the PU jet ID MVA on jet collections reclustered by the user. This is being fixed in CMSSW_7_4_X and we should remember to fix this.

ic::EventInfo should store the event number as an unsigned long long (=ULong64_t)

https://github.com/ajgilbert/cmssw/blob/CMSSW_7_4_X/DataFormats/Provenance/interface/RunLumiEventNumber.h

Main thing to check is updating dictionaries and checking existing ntuples can still be read.

Event numbers for CheckEvents

Currently CheckEvents reads the events to be checked from within HTTSequence, which means that you have to recompile every time you add/remove events numbers. Should be modified so that it can read from an external file to

handling event weights in aMC@NLO (or other NLO generators)

Two issues here:

Need to extract signed event weight for NLO MC events (until now we haven't needed this). These weights also have a magnitude, such that summing over all weights in a sample gives the NLO xsec calculated by the generator. We probably don't care about the magnitude, and these numbers are often superseded or replaced by a higher order calculation. Therefore I'd propose we only store the sign of the weight. From a filling-histograms point of view it's also easier to then normalise to luminosity. Need to adapt ICEventInfoProducer to extract this weight from the LHE.
Not as urgent, but still interesting, is that the MadGraph5_aMC@NLO samples should contain additional weights that account for systematic variations, e.g. there should be weights for shifts of the renormalisation and factorisation scales. Could be useful to add an option to store these too - we already have the ability to add a weight in ic::EventInfo but have it disabled in the total_weight() calculation. In the analysis doing the systematic shifts would then be as simple as switching the desired weight on.

I found these slides useful in explaining the details:
https://indico.cern.ch/event/388914/contribution/0/material/slides/0.pdf

I will try and look into this next week, but if someone else wants to get started in the meantime please go ahead.

use of ic::PFJet::pu_id_mva_loose() to be deprecated

Exact cuts depend on the training used, so shouldn't be maintained in PFJet.hh anymore. Remove this class method once usage is replaced by, e.g. PileupJetID function in FnPredicates.

Daughters and status codes in new MC have changed

The new pythia 8 status codes don't provide a direct replacement for status 3. The new status 21-29 is similar, but differs in that the lepton from a W->lnu decay for instance isn't always status 21-29.

I've emailed Josh Bendavid to ask if there is a recommendation, and it also appears from slides from Gen group meetings in early May that they are preparing a new set of "status flags" to try and solve this issue.

Package is broken in 7_6_0...

... and I mean really broken:

this doesn't compile anymore:

  edm::RefToBaseProd<T> reftobase = edm::RefToBaseProd<T>((handle->refAt(0)));

we are now forced to use the edm::consumes mechanism. Info here and here. This isn't available in 5_3_X so will need some sort of workaround - ideally without a huge amount of #ifdef'ing.
Need to check if CMSSW producers work ok in multi-threaded mode. Is this enabled by default?

JECs in ICPFJetProducer (for jets from PFCands)

The JECs calculated by ICPFJetProducer for jets built from (packed) PF candidates are different from the ones calculated by PAT, need to understand why if we want to produce jets without running the pat jet sequence.

Issues blocking 5_3_7 OOTB

Compilation:
/afs/cern.ch/work/a/agilbert/CMSSW_TEST/CMSSW_5_3_7/src/UserCode/ICHiggsTauTau/plugins/ICMetProducer.cc:16:62: fatal error: DataFormats/METReco/interface/PFMEtSignCovMatrix.h: No such file or directory
--> PFMEtSignCovMatrix no longer needed, as we won't be importing an external MET covariance matrix anymore. Will remove this option from the code. Possible knock-on effect to any cfg files where the option "InputSig" is defined.

/afs/cern.ch/work/a/agilbert/CMSSW_TEST/CMSSW_5_3_7/src/UserCode/ICHiggsTauTau/plugins/ICPhotonProducer.hh:10:73: fatal error: EgammaAnalysis/ElectronTools/interface/PFIsolationEstimator.h: No such file or directory
--> Pending

Large number of SVFit jobs with new workflow

For @ajgilbert and @adewit, it is clear that the number of events in the output trees is now a lot larger than it was in Run 1, due to moving cuts like isolation to post output tree. This poses a problem for our SVfit workflow, which currently requires more than 10 000 jobs for the full set of MC samples (if I use 7000 events per job, which I think was what we had in Run 1 and still makes the jobs several hours long), which is only going to get larger as we add the exclusive samples. The question is whether we have to live with this and we simply risk having to wait longer for SVFit jobs (given that so many of our cuts are now at ntuple level, in principle we could have to run this less often, although this has not necessarily reflected reality so far with the fairly frequent changes of triggers and other pre-output ntuple level choices), or whether there could be value to altering our workflow to apply some preselection and only run the calculation for a subset of the events in the tree (and when creating the tree, fill the SVFit mass branch with -999 for the events for which there is no calculation). Thoughts? Experiences of trying to get >10000 jobs of that length through any kind of batch system?

Need to rename Analysis/*/data directories

We currently store various root files and inputs that are needed for the analyses here (JEC files, mva trainings etc). Unfortunately it turns out that crab scans recursively for any directory named data under $CMSSW_BASE/src and ships this off with each job. This wastes storage space and time packing and sending these files to the crab server. I suggest we rename the folder from "data" to "input" to avoid this

Need to revisit job splitting

In the latest of IC batch improvements we can now only use < 1/3 of the short queue at the same time, so need to do something slightly cleverer with the hundreds of jobs that we have at the moment... Or face half day-long waits to rerun everything.
I haven't got any ideas (or time to mess about with it), but definitely something to sort out in the next few months.

Memory leak (in GetPtr/GetPtrVec?)

I'm not sure if anybody else is affected by this, but as I can't figure out how to solve this problem...

This is what happens:

When running HTT.cpp with more than 39 input files the job crashes throwing a bad_alloc. The same thing happens when running HiggsTauTau.cpp after 35 input files.
When I added a dryrun through the filelist to determine the needed vector size and reserved enough space in the vector of files, the job crashes at the same file, now spitting out a bunch of these errors:

R__unzipLZMA: error 5 in lzma_code
Error in <TBasket::ReadBasketBuffers>: fNbytes = 4312, fKeylen = 91, fObjlen = 30012, noutot = 0, nout=0, nin=4221, nbuf=30012
Error in <TBranchElement::GetBasket>: File: root://xrootd.grid.hep.ph.ic.ac.uk//store/user/adewit/July08_MC_74X/DYJetsToLL_M-50_TuneCUETP8M1_13TeV-amcatnloFXFX-pythia8/crab_DYJetsToLL-2/150710_084817/0000/EventTree_964.root at byte:187660, branch:genParticles.pdgid_, entry:2, badread=1, nerrors=1, basketnumber=0

before throwing a bad_alloc

I ran with valgrind but I'm not sure the output is hugely useful, for example

==10381== 7,024 (80 direct, 6,944 indirect) bytes in 2 blocks are definitely lost in loss record 269,416 of 269,644
==10381==    at 0x4806FB5: operator new(unsigned long) (in /cvmfs/cms.cern.ch/slc6_amd64_gcc481/external/valgrind/3.10.0/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10381==    by 0x4935D60: std::vector<ic::Tau*, std::allocator<ic::Tau*> >& ic::TreeEvent::GetPtrVec<ic::Tau>(std::string const&, std::string) (in /vols/cms04/amd12/CMSSW_7_2_0/src/UserCode/ICHiggsTauTau/Analysis/HiggsTauTau/lib/libICHiggsTauTau.so)
==10381==    by 0x49DB4F7: ic::SimpleFilter<ic::Tau>::Execute(ic::TreeEvent*) (in /vols/cms04/amd12/CMSSW_7_2_0/src/UserCode/ICHiggsTauTau/Analysis/HiggsTauTau/lib/libICHiggsTauTau.so)
==10381==    by 0x4B13F1F: ic::AnalysisBase::RunAnalysis() (in /vols/cms04/amd12/CMSSW_7_2_0/src/UserCode/ICHiggsTauTau/Analysis/Core/lib/libICCore.so)
==10381==    by 0x40DF11: main (in /vols/cms04/amd12/CMSSW_7_2_0/src/UserCode/ICHiggsTauTau/Analysis/HiggsTauTau/bin/HTT)

This suggests GetPtrVec is causing a memory leak. I don't see what the issue with it is though. Has anybody else seen this before/any ideas where else to look for the problem?

HTTSequence doesn't run old trigger filter mode for previous analyses

Just noticed that the trigger filter module is currently always run after HTTPairSelector which isn't in line with the method used for previous analyses. Should remember to add the module to be run t before the pair selector for mssmspring16 and fall15 strategies.

TBrowser seg faults when jet variables in EventTree are accessed

Both Adinda and I have noticed this behaviour in CMSSW_7_4_5, it only seems to affect the jet collections. Accessing them with TTree::Draw works fine though.

Clean up of FnPredicates and FnPairs

Inspired by #26 - should go through FnPredicates and FnPairs and:

remove old functions that aren't used anywhere (and won't be used again)
organise and document functions, ideally with links to twiki pages or other documentation for ID/iso selectors. Should aim to replicate style of Plotting.h, which leads to
this

This can be considered a low priority task :-)

Need to use FileBased splitting with 1 file per job for ntuple production when including mvamet

_NOT AN ISSUE ON OUR SIDE_ but still relevant to anyone trying to use the code to produce ntuples with mvamet included (so mainly @pjdunne and @amagnan) - in CMSSW_7_4_12 (and _15), puJetIdForPFMVAMEt crashes if you try to run on more than one file in the same job.
The solution (for now) is to use FileBased splitting and run one file per job - this behaviour has been flagged up to the puJetID people, will report back when I know more.