From 598297d368a336078b22231633233e5617af479c Mon Sep 17 00:00:00 2001 From: adewit Date: Thu, 14 Dec 2023 09:59:18 +0100 Subject: [PATCH] Documentation language edit /spell check (#871) * First iteration of spell check * Spell check for setup * physics model spell check * Add commonstatsmethods language check * Finish spellcheck part 3-4-5 * Fixing Combine small caps text in docs --- contributing.md | 20 +- docs/index.md | 60 +++--- docs/part2/bin-wise-stats.md | 22 +-- docs/part2/physicsmodels.md | 56 +++--- docs/part2/settinguptheanalysis.md | 280 +++++++++++++------------- docs/part3/commonstatsmethods.md | 302 ++++++++++++++--------------- docs/part3/debugging.md | 2 +- docs/part3/nonstandard.md | 254 ++++++++++++------------ docs/part3/regularisation.md | 48 ++--- docs/part3/runningthetool.md | 204 +++++++++---------- docs/part3/simplifiedlikelihood.md | 60 +++--- docs/part3/validation.md | 54 +++--- docs/part4/usefullinks.md | 42 ++-- docs/part5/longexercise.md | 50 ++--- docs/part5/roofit.md | 98 +++++----- 15 files changed, 777 insertions(+), 775 deletions(-) diff --git a/contributing.md b/contributing.md index 172ed36f2b3..4f0992a2461 100644 --- a/contributing.md +++ b/contributing.md @@ -1,6 +1,6 @@ # Contributing -Contributions to combine of all sizes, from minor documentation updates to big code improvements, are welcome and encouraged. +Contributions to Combine of all sizes, from minor documentation updates to big code improvements, are welcome and encouraged. To ensure good development of the tool, we try to coordinate contributions. However, we are happy to help overcome any steps that may pose issues for contributors. @@ -42,19 +42,19 @@ ensure `flake8` and `black` are installed: and then from the main directory of this repository run -flake8: +`flake8`: ``` flake8 . ``` -and black: +and `black`: ``` black -l 160 --check --diff . ``` If you'd like to see the details of the configuration `flake8` is using, check the `.flake8` file in the main directory. -The black linting uses the default [black style](https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html) (for v23.3.0), with only the command line options shown above. +The `black` linting uses the default [black style](https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html) (for v23.3.0), with only the command line options shown above. ## Updating Documentation @@ -65,7 +65,7 @@ For that reason, whenever you make a change you should consider whether this req If the change is user-facing it almost certainly does require a documentation update. Documentation is **very important** to us. -Therefore, we will be picky and make sure it is done well! +Therefore, we will be meticulous and make sure it is done well! However, we don't want to put extra burden on you, so we are happy to help and will make our own edits and updates to improve the documentation of your change. We appreciate you putting in some effort and thought to ensure: @@ -76,7 +76,7 @@ We appreciate you putting in some effort and thought to ensure: ### Technical details of the documentation -We use [mkdocs](www.mkdocs.org) to produce the static website that documents combine. +We use [mkdocs](www.mkdocs.org) to produce the static website that documents Combine. The documentation files are all under the `docs/` folder. Which pages get included in the site, and other configuration details are set in the `mkdocs.yml` file. @@ -97,17 +97,17 @@ mkdocs serve from the main repository directory. mkdocs will then print a link you can open to check the page generated in your browser. -**NOTE:** mkdocs builds which use internal links (or images etc) with absolute paths will work for local deployment, but will break when deployed to the public documentations pages. -Please ensure you use relative paths. Currently, this is the only known feature where the behvaiour differs between local mkdocs and public pages deployment. +**NOTE:** mkdocs builds that use internal links (or images, etc.) with absolute paths will work for local deployment, but will break when deployed to the public documentations pages. +Please ensure you use relative paths. Currently, this is the only known feature where the behvaiour differs between local mkdocs and public page deployment. If you'd like to test the deployment directly, the suggested method is to set up a docs page using your personal github account; this should mimic the exact settings of the official page. ## Big Contributions -We welcome large contributions to combine. +We welcome large contributions to Combine. Note, however, that we also follow long term planning, and there is a dedicated group stewarding the overall direction and development of the code. This means that the code development should fit in with our long term vision; -if you have an idea for a big improvement or change it will be most efficient if you [contact us](mailto:cms-cat-stats-conveners@cern.ch) first, in order to ensure that we can integrate it as seemlessly as possible into our plans. +if you have an idea for a big improvement or change it will be most efficient if you [contact us](mailto:cms-cat-stats-conveners@cern.ch) first, in order to ensure that we can integrate it as seamlessly as possible into our plans. This will simplify any potential conflicts when you make your pull request. ## Requested Contributions diff --git a/docs/index.md b/docs/index.md index e21061d2aa4..da40bb43012 100644 --- a/docs/index.md +++ b/docs/index.md @@ -2,16 +2,16 @@ These pages document the [RooStats](https://twiki.cern.ch/twiki/bin/view/RooStats/WebHome) / -[RooFit](https://root.cern.ch/roofit) - based software tools used for -statistical analysis within the [Higgs PAG](HiggsWG) - **combine**. +[RooFit](https://root.cern.ch/roofit) - based software tool used for +statistical analysis within the CMS experiment - Combine. Note that while this tool was originally developed in the [Higgs PAG](HiggsWG), its usage is now widespread within CMS. -Combine provides a command line interface to many different statistical techniques available inside RooFit/RooStats used widely inside CMS. +Combine provides a command-line interface to many different statistical techniques, available inside RooFit/RooStats, that are used widely inside CMS. -The package exists in GIT under [https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit) +The package exists on GitHub under [https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit) -For more information about GIT and its usage in CMS, see [http://cms-sw.github.io/cmssw/faq.html](http://cms-sw.github.io/cmssw/faq.html) +For more information about Git, GitHub and its usage in CMS, see [http://cms-sw.github.io/cmssw/faq.html](http://cms-sw.github.io/cmssw/faq.html) -The code can be checked out from GIT and compiled on top of a CMSSW release that includes a recent RooFit/RooStats +The code can be checked out from GitHub and compiled on top of a CMSSW release that includes a recent RooFit/RooStats # Installation instructions @@ -22,7 +22,7 @@ Earlier versions are not guaranteed to follow the standard. ## Within CMSSW (recommended for CMS users) The instructions below are for installation within a CMSSW environment. For end -users that don't need to commit or do any development, the following recipes +users that do not need to commit or do any development, the following recipes should be sufficient. To choose a release version, you can find the latest releases on github under [https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/releases](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/releases) @@ -30,9 +30,9 @@ releases on github under ### Combine v9 - recommended version The nominal installation method is inside CMSSW. The current release targets -CMSSW `11_3_X` series because this release has both python2 and python3 ROOT -bindings, allowing a more gradual migration of user code to python3. Combine is -fully python3-compatible and can work also in 12_X releases. +the CMSSW `11_3_X` series because this release has both python2 and python3 ROOT +bindings, allowing a more gradual migration of user code to python3. Combine is +fully python3-compatible and, with some adaptations, can also work in 12_X releases. ```sh cmsrel CMSSW_11_3_4 @@ -99,11 +99,11 @@ scramv1 b clean; scramv1 b # always make a clean build ## Standalone compilation The standalone version can be easily compiled using -[cvmfs](https://cernvm.cern.ch/fs/) as it relies on dependencies which are +[cvmfs](https://cernvm.cern.ch/fs/) as it relies on dependencies that are already installed at `/cvmfs/cms.cern.ch/`. Access to `/cvmfs/cms.cern.ch/` can be obtained from lxplus machines or via `CernVM`. See [CernVM](CernVM.md) for further details on the latter. In case you do not want to use the `cvmfs` -area, you will need to adapt the location of the dependencies listed in both +area, you will need to adapt the locations of the dependencies listed in both the `Makefile` and `env_standalone.sh` files. ``` @@ -114,18 +114,20 @@ cd HiggsAnalysis/CombinedLimit/ make -j 4 ``` -You will need to source `env_standalone.sh` each time you want to use the package, or add it to your login. +You will need to source `env_standalone.sh` each time you want to use the package, or add it to your login environment. ### Standalone compilation with LCG -For compilation outside of CMSSW, for example to use ROOT versions not yet available in CMSSW, one can compile against LCG releases. The current default is to compile with LCG_102 which contains ROOT 6.26: +For compilation outside of CMSSW, for example to use ROOT versions not yet available in CMSSW, one can compile against LCG releases. The current default is to compile with LCG_102, which contains ROOT 6.26: ```sh git clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit cd HiggsAnalysis/CombinedLimit source env_lcg.sh make LCG=1 -j 8 ``` -To change the LCG version, edit `env_lcg.sh`. The resulting binaries can be relocated (e.g. for use in a -batch job) if the following files are included in the job tarball: +To change the LCG version, edit `env_lcg.sh`. + +The resulting binaries can be moved for use in a +batch job if the following files are included in the job tarball: ```sh tar -zcf Combine_LCG_env.tar.gz build interface src/classes.h --exclude=obj ``` @@ -148,40 +150,42 @@ conda activate combine make CONDA=1 -j 8 ``` -Using combine from then on should only require sourcing the conda environment +Using Combine from then on should only require sourcing the conda environment ``` conda activate combine ``` -**Note:** on OS X, `combine` can only accept workspaces, so run `text2workspace.py` first. -This is due to some ridiculous issue with child processes and `LD_LIBRARY_PATH` (see note in Makefile) +**Note:** on OS X, Combine can only accept workspaces, so run `text2workspace.py` first. +This is due to an issue with child processes and `LD_LIBRARY_PATH` (see note in Makefile) # What has changed between tags? -You can generate a diff of any two tags (eg for `v7.0.8` and `v7.0.6`) by using following the url: +You can generate a diff of any two tags (eg for `v9.1.0` and `v9.0.0`) by using the following url: -[https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/compare/v7.0.6...v7.0.7](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/compare/v7.0.6...v7.0.7) +[https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/compare/v9.0.0...v9.1.0](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/compare/v9.0.0...v9.1.0) -Replace the tag names in the url to any tags you which to compare. +Replace the tag names in the url to any tags you would like to compare. # For developers -We use the _Fork and Pull_ model for development: each user creates a copy of the repository on github, commits their requests there and then sends pull requests for the administrators to merge. +We use the _Fork and Pull_ model for development: each user creates a copy of the repository on GitHub, commits their requests there, and then sends pull requests for the administrators to merge. _Prerequisites_ -1. Register on github, as needed anyway for CMSSW development: [http://cms-sw.github.io/cmssw/faq.html](http://cms-sw.github.io/cmssw/faq.html) +1. Register on GitHub, as needed anyway for CMSSW development: [http://cms-sw.github.io/cmssw/faq.html](http://cms-sw.github.io/cmssw/faq.html) + +2. Register your SSH key on GitHub: [https://help.github.com/articles/generating-ssh-keys](https://help.github.com/articles/generating-ssh-keys) -2. Register your SSH key on github: [https://help.github.com/articles/generating-ssh-keys](https://help.github.com/articles/generating-ssh-keys) 1 Fork the repository to create your copy of it: [https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/fork](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/fork) (more documentation at [https://help.github.com/articles/fork-a-repo](https://help.github.com/articles/fork-a-repo) ) +3. Fork the repository to create your copy of it: [https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/fork](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/fork) (more documentation at [https://help.github.com/articles/fork-a-repo](https://help.github.com/articles/fork-a-repo) ) You will now be able to browse your fork of the repository from [https://github.com/your-github-user-name/HiggsAnalysis-CombinedLimit](https://github.com/your-github-user-name/HiggsAnalysis-CombinedLimit) -We strongly encourage you to contribute any developments you make back into the main repository. +We strongly encourage you to contribute any developments you make back to the main repository. See [contributing.md](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/contributing.md) for details about contributing. -# Combine Tool +# CombineHarvester/CombineTools -An additional tool for submitting combine jobs to batch/crab, developed originally for HiggsToTauTau. Since the repository contains a certain amount of analysis-specific code, the following scripts can be used to clone it with a sparse checkout for just the core [`CombineHarvester/CombineTools`](https://github.com/cms-analysis/CombineHarvester/blob/master/CombineTools/) subpackage, speeding up the checkout and compile times: +CombineTools is an additional tool for submitting Combine jobs to batch systems or crab, which was originally developed in the context of Higgs to tau tau analyses. Since the repository contains a certain amount of analysis-specific code, the following scripts can be used to clone it with a sparse checkout for just the core [`CombineHarvester/CombineTools`](https://github.com/cms-analysis/CombineHarvester/blob/master/CombineTools/) subpackage, speeding up the checkout and compile times: git clone via ssh: diff --git a/docs/part2/bin-wise-stats.md b/docs/part2/bin-wise-stats.md index 15aaeba808e..10a02591bab 100644 --- a/docs/part2/bin-wise-stats.md +++ b/docs/part2/bin-wise-stats.md @@ -1,9 +1,9 @@ # Automatic statistical uncertainties ## Introduction -The `text2workspace.py` script is now able to produce a new type of workspace in which bin-wise statistical uncertainties are added automatically. This can be built for shape-based datacards where the inputs are in TH1 format. Datacards that use RooDataHists are not supported. The bin errrors (i.e. values returned by TH1::GetBinError) are used to model the uncertainties. +The `text2workspace.py` script is able to produce a type of workspace, using a set of new histogram classes, in which bin-wise statistical uncertainties are added automatically. This can be built for shape-based datacards where the inputs are in TH1 format. Datacards that use RooDataHists are not supported. The bin errrors (i.e. values returned by `TH1::GetBinError`) are used to model the uncertainties. -By default the script will attempt to assign a single nuisance parameter to scale the sum of the process yields in each bin, constrained by the total uncertainty, instead of requiring separate parameters, one per process. This is sometimes referred to as the [Barlow-Beeston](http://inspirehep.net/record/35053)-lite approach, and is useful as it minimises the number of parameters required in the maximum-likelihood fit. A useful description of this approach may be found in section 5 of [this report](https://arxiv.org/pdf/1103.0354.pdf). +By default the script will attempt to assign a single nuisance parameter to scale the sum of the process yields in each bin, constrained by the total uncertainty, instead of requiring separate parameters, one per process. This is sometimes referred to as the [Barlow-Beeston](http://inspirehep.net/record/35053)-lite approach, and is useful as it minimises the number of parameters required in the maximum likelihood fit. A useful description of this approach may be found in section 5 of [this report](https://arxiv.org/pdf/1103.0354.pdf). ## Usage instructions @@ -15,22 +15,22 @@ The following line should be added at the bottom of the datacard, underneath the The first string `channel` should give the name of the channels (bins) in the datacard for which the new histogram classes should be used. The wildcard `*` is supported for selecting multiple channels in one go. The value of `threshold` should be set to a value greater than or equal to zero to enable the creation of automatic bin-wise uncertainties, or `-1` to use the new histogram classes without these uncertainties. A positive value sets the threshold on the effective number of unweighted events above which the uncertainty will be modeled with the Barlow-Beeston-lite approach described above. Below the threshold an individual uncertainty per-process will be created. The algorithm is described in more detail below. -The last two settings are optional. The first of these, `include-signal` has a default value of `0` but can be set to `1` as an alternative. By default the total nominal yield and uncertainty used to test the threshold excludes signal processes, as typically the initial signal normalisation is arbitrary, and could unduly lead to a bin being considered well-populated despite poorly populated background templates. Setting this flag will include the signal processes in the uncertainty analysis. Note that this option only affects the logic for creating a single Barlow-Beeston-lite parameter vs. separate per-process parameters - the uncertainties on all signal processes are always included in the actual model! The second flag changes the way the normalisation effect of shape-altering uncertainties is handled. In the default mode (`1`) the normalisation is handled separately from the shape morphing via a an asymmetric log-normal term. This is identical to how combine has always handled shape morphing. When set to `2`, the normalisation will be adjusted in the shape morphing directly. Unless there is a strong motivation we encourage users to leave this on the default setting. +The last two settings are optional. The first of these, `include-signal` has a default value of `0` but can be set to `1` as an alternative. By default, the total nominal yield and uncertainty used to test the threshold excludes signal processes. The reason for this is that typically the initial signal normalization is arbitrary, and could unduly lead to a bin being considered well-populated despite poorly populated background templates. Setting this flag will include the signal processes in the uncertainty analysis. Note that this option only affects the logic for creating a single Barlow-Beeston-lite parameter vs. separate per-process parameters - the uncertainties on all signal processes are always included in the actual model! The second flag changes the way the normalization effect of shape-altering uncertainties is handled. In the default mode (`1`) the normalization is handled separately from the shape morphing via a an asymmetric log-normal term. This is identical to how Combine has always handled shape morphing. When set to `2`, the normalization will be adjusted in the shape morphing directly. Unless there is a strong motivation we encourage users to leave this on the default setting. ## Description of the algorithm When `threshold` is set to a number of effective unweighted events greater than or equal to zero, denoted $n^{\text{threshold}}$, the following algorithm is applied to each bin: - 1. Sum the yields $n_{i}$ and uncertainities $e_{i}$ of each background process $i$ in the bin. Note that the $n_i$ and $e_i$ include the nominal effect of any scaling parameters that have been set in the datacard, for example [`rateParams`](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part2/settinguptheanalysis/#rate-parameters). + 1. Sum the yields $n_{i}$ and uncertainties $e_{i}$ of each background process $i$ in the bin. Note that the $n_i$ and $e_i$ include the nominal effect of any scaling parameters that have been set in the datacard, for example [`rateParams`](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part2/settinguptheanalysis/#rate-parameters). $n_{\text{tot}} = \sum_{i\,\in\,\text{bkg}}n_i$, $e_{\text{tot}} = \sqrt{\sum_{i\,\in\,\text{bkg}}e_i^{2}}$ - 2. If $e_{\text{tot}} = 0$, the bin is skipped and no parameters are created. (Though you might want to check why there is no uncertainty on the background prediction in this bin!) + 2. If $e_{\text{tot}} = 0$, the bin is skipped and no parameters are created. If this is the case, it is a good idea to check why there is no uncertainty in the background prediction in this bin! 3. The effective number of unweighted events is defined as $n_{\text{tot}}^{\text{eff}} = n_{\text{tot}}^{2} / e_{\text{tot}}^{2}$, rounded to the nearest integer. 4. If $n_{\text{tot}}^{\text{eff}} \leq n^{\text{threshold}}$: separate uncertainties will be created for each process. Processes where $e_{i} = 0$ are skipped. If the number of effective events for a given process is lower than $n^{\text{threshold}}$ a Poisson-constrained parameter will be created. Otherwise a Gaussian-constrained parameter is used. 5. If $n_{\text{tot}}^{\text{eff}} \gt n^{\text{threshold}}$: A single Gaussian-constrained Barlow-Beeston-lite parameter is created that will scale the total yield in the bin. - 6. Note that the values of $e_{i}$ and therefore $e_{tot}$ will be updated automatically in the model whenever the process normalisations change. + 6. Note that the values of $e_{i}$, and therefore $e_{tot}$, will be updated automatically in the model whenever the process normalizations change. 7. A Gaussian-constrained parameter $x$ has a nominal value of zero and scales the yield as $n_{\text{tot}} + x \cdot e_{\text{tot}}$. The Poisson-constrained parameters are expressed as a yield multiplier with nominal value one: $n_{\text{tot}} \cdot x$. -The output from `text2workspace.py` will give details on how each bin has been treated by this alogorithm, for example: +The output from `text2workspace.py` will give details on how each bin has been treated by this algorithm, for example:
Show example output @@ -68,9 +68,9 @@ Bin Contents Error Notes
## Analytic minimisation -One significant advantage of the Barlow-Beeston-lite approach is that the maximum likelihood estimate of each nuisance parameter has a simple analytic form that depends only on $n_{\text{tot}}$, $e_{\text{tot}}$ and the observed number of data events in the relevant bin. Therefore when minimising the negative log-likelihood of the whole model it is possible to remove these parameters from the fit and set them to their best-fit values automatically. For models with large numbers of bins this can reduce the fit time and increase the fit stability. The analytic minimisation is enabled by default starting in combine v8.2.0, you can disable it by adding the option `--X-rtd MINIMIZER_no_analytic` when running combine. +One significant advantage of the Barlow-Beeston-lite approach is that the maximum likelihood estimate of each nuisance parameter has a simple analytic form that depends only on $n_{\text{tot}}$, $e_{\text{tot}}$ and the observed number of data events in the relevant bin. Therefore when minimising the negative log-likelihood of the whole model it is possible to remove these parameters from the fit and set them to their best-fit values automatically. For models with large numbers of bins this can reduce the fit time and increase the fit stability. The analytic minimisation is enabled by default starting in combine v8.2.0, you can disable it by adding the option `--X-rtd MINIMIZER_no_analytic` when running Combine. -The figure below shows a performance comparison of the analytical minimization versus the number of bins in the likelihood function. The real time (in sections) for a typical minimisation of a binned likelihood is shown as a function of the number of bins when invoking the analytic minimisation of the nuisance parameters versus the default numerical approach. +The figure below shows a performance comparison of the analytical minimisation versus the number of bins in the likelihood function. The real time (in sections) for a typical minimisation of a binned likelihood is shown as a function of the number of bins when invoking the analytic minimisation of the nuisance parameters versus the default numerical approach. /// details | **Show Comparison** @@ -81,10 +81,10 @@ The figure below shows a performance comparison of the analytical minimization v ## Technical details -Up until recently `text2workspace.py` would only construct the PDF for each channel using a `RooAddPdf`, i.e. each component process is represented by a separate PDF and normalisation coefficient. However in order to model bin-wise statistical uncertainties the alternative `RooRealSumPdf` can be more useful, as each process is represented by a RooFit function object instead of a PDF, and we can vary the bin yields directly. As such, a new RooFit histogram class `CMSHistFunc` is introduced, which offers the same vertical template morphing algorithms offered by the current default histogram PDF, `FastVerticalInterpHistPdf2`. Accompanying this is the `CMSHistErrorPropagator` class. This evaluates a sum of `CMSHistFunc` objects, each multiplied by a coefficient. It is also able to scale the summed yield of each bin to account for bin-wise statistical uncertainty nuisance parameters. +Up until recently `text2workspace.py` would only construct the PDF for each channel using a `RooAddPdf`, i.e. each component process is represented by a separate PDF and normalization coefficient. However, in order to model bin-wise statistical uncertainties, the alternative `RooRealSumPdf` can be more useful, as each process is represented by a RooFit function object instead of a PDF, and we can vary the bin yields directly. As such, a new RooFit histogram class `CMSHistFunc` is introduced, which offers the same vertical template morphing algorithms offered by the current default histogram PDF, `FastVerticalInterpHistPdf2`. Accompanying this is the `CMSHistErrorPropagator` class. This evaluates a sum of `CMSHistFunc` objects, each multiplied by a coefficient. It is also able to scale the summed yield of each bin to account for bin-wise statistical uncertainty nuisance parameters. !!! warning - One disadvantage of this new approach comes when evaluating the expectation for individual processes, for example when using the `--saveShapes` option in the `FitDiagnostics` mode of combine. The Barlow-Beeston-lite parameters scale the sum of the process yields directly, so extra work is needed in the distribution this total scaling back to each individual process. To achieve this an additional class `CMSHistFuncWrapper` has been created that, given a particular `CMSHistFunc`, the `CMSHistErrorPropagator` will distribute an appropriate fraction of the total yield shift to each bin. As a consequence of the extra computation needed to distribute the yield shifts in this way the evaluation of individual process shapes in `--saveShapes` can take longer then previously. + One disadvantage of this new approach comes when evaluating the expectation for individual processes, for example when using the `--saveShapes` option in the `FitDiagnostics` mode of Combine. The Barlow-Beeston-lite parameters scale the sum of the process yields directly, so extra work is needed to distribute this total scaling back to each individual process. To achieve this, an additional class `CMSHistFuncWrapper` has been created that, given a particular `CMSHistFunc`, the `CMSHistErrorPropagator` will distribute an appropriate fraction of the total yield shift to each bin. As a consequence of the extra computation needed to distribute the yield shifts in this way, the evaluation of individual process shapes in `--saveShapes` can take longer then previously. diff --git a/docs/part2/physicsmodels.md b/docs/part2/physicsmodels.md index e7c3c54c1d8..d779aaf4adb 100644 --- a/docs/part2/physicsmodels.md +++ b/docs/part2/physicsmodels.md @@ -1,6 +1,6 @@ # Physics Models -Combine can be run directly on the text based datacard. However, for more advanced physics models, the internal step to convert the datacard to a binary workspace can be performed by the user. To create a binary workspace starting from a `datacard.txt`, just do +Combine can be run directly on the text-based datacard. However, for more advanced physics models, the internal step to convert the datacard to a binary workspace should be performed by the user. To create a binary workspace starting from a `datacard.txt`, you can run ```sh text2workspace.py datacard.txt -o workspace.root @@ -8,12 +8,12 @@ text2workspace.py datacard.txt -o workspace.root By default (without the `-o` option), the binary workspace will be named `datacard.root` - i.e the **.txt** suffix will be replaced by **.root**. -A full set of options for `text2workspace` can be found by using `--help`. +A full set of options for `text2workspace` can be found by running `text2workspace.py --help`. -The default model which will be produced when running `text2workspace` is one in which all processes identified as signal are multiplied by a common multiplier **r**. This is all that is needed for simply setting limits or calculating significances. +The default model that will be produced when running `text2workspace` is one in which all processes identified as signal are multiplied by a common multiplier **r**. This is all that is needed for simply setting limits or calculating significances. -`text2workspace` will convert the datacard into a pdf which summaries the analysis. -For example, lets take a look at the [data/tutorials/counting/simple-counting-experiment.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/counting/simple-counting-experiment.txt) datacard. +`text2workspace` will convert the datacard into a PDF that summarizes the analysis. +For example, let's take a look at the [data/tutorials/counting/simple-counting-experiment.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/counting/simple-counting-experiment.txt) datacard. ```nohighlight # Simple counting experiment, with one signal and one background process @@ -40,9 +40,9 @@ deltaS lnN 1.20 - 20% uncertainty on signal deltaB lnN - 1.50 50% uncertainty on background ``` -If we run `text2workspace.py` on this datacard and take a look at the workspace (`w`) inside the `.root` file produced, we will find a number of different objects representing the signal, background and observed event rates as well as the nuisance parameters and signal strength **r**. +If we run `text2workspace.py` on this datacard and take a look at the workspace (`w`) inside the `.root` file produced, we will find a number of different objects representing the signal, background, and observed event rates, as well as the nuisance parameters and signal strength **r**. -From these objects, the necessary pdf has been constructed (named `model_s`). For this counting experiment we will expect a simple pdf of the form +From these objects, the necessary PDF has been constructed (named `model_s`). For this counting experiment we will expect a simple PDF of the form $$ p(n_{\mathrm{obs}}| r,\delta_{S},\delta_{B})\propto @@ -54,9 +54,9 @@ $$ where the expected signal and background rates are expressed as functions of the nuisance parameters, $n_{S}(\delta_{S}) = 4.76(1+0.2)^{\delta_{S}}~$ and $~n_{B}(\delta_{B}) = 1.47(1+0.5)^{\delta_{B}}$. -The first term represents the usual Poisson expression for observing $n_{\mathrm{obs}}$ events while the second two are the Gaussian constraint terms for the nuisance parameters. In this case ${\delta^{\mathrm{In}}_S}={\delta^{\mathrm{In}}_B}=0$, and the widths of both Gaussians are 1. +The first term represents the usual Poisson expression for observing $n_{\mathrm{obs}}$ events, while the second two are the Gaussian constraint terms for the nuisance parameters. In this case ${\delta^{\mathrm{In}}_S}={\delta^{\mathrm{In}}_B}=0$, and the widths of both Gaussians are 1. -A combination of counting experiments (or a binned shape datacard) will look like a product of pdfs of this kind. For a parametric/unbinned analyses, the pdf for each process in each channel is provided instead of the using the Poisson terms and a product is over the bin counts/events. +A combination of counting experiments (or a binned shape datacard) will look like a product of PDFs of this kind. For parametric/unbinned analyses, the PDF for each process in each channel is provided instead of the using the Poisson terms and a product runs over the bin counts/events. ## Model building @@ -68,13 +68,13 @@ text2workspace.py datacard -P HiggsAnalysis.CombinedLimit.PythonFile:modelName Generic models can be implemented by writing a python class that: -- defines the model parameters (by default it's just the signal strength modifier **`r`**) -- defines how signal and background yields depend on the parameters (by default, signal scale linearly with **`r`**, backgrounds are constant) -- potentially also modifies the systematics (e.g. switch off theory uncertainties on cross section when measuring the cross section itself) +- defines the model parameters (by default it is just the signal strength modifier **`r`**) +- defines how signal and background yields depend on the parameters (by default, the signal scales linearly with **`r`**, backgrounds are constant) +- potentially also modifies the systematic uncertainties (e.g. switch off theory uncertainties on cross section when measuring the cross section itself) -In the case of SM-like Higgs searches the class should inherit from **`SMLikeHiggsModel`** (redefining **`getHiggsSignalYieldScale`**), while beyond that one can inherit from **`PhysicsModel`**. You can find some examples in [PhysicsModel.py](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/python/PhysicsModel.py). +In the case of SM-like Higgs boson measurements, the class should inherit from **`SMLikeHiggsModel`** (redefining **`getHiggsSignalYieldScale`**), while beyond that one can inherit from **`PhysicsModel`**. You can find some examples in [PhysicsModel.py](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/python/PhysicsModel.py). -In the 4-process model (`PhysicsModel:floatingXSHiggs`, you will see that each of the 4 dominant Higgs production modes get separate scaling parameters, **`r_ggH`**, **`r_qqH`**, **`r_ttH`** and **`r_VH`** (or **`r_ZH`** and **`r_WH`**) as defined in, +In the 4-process model (`PhysicsModel:floatingXSHiggs`, you will see that each of the 4 dominant Higgs boson production modes get separate scaling parameters, **`r_ggH`**, **`r_qqH`**, **`r_ttH`** and **`r_VH`** (or **`r_ZH`** and **`r_WH`**) as defined in, ```python def doParametersOfInterest(self): @@ -106,27 +106,27 @@ You should note that `text2workspace` will look for the python module in `PYTHON A number of models used in the LHC Higgs combination paper can be found in [LHCHCGModels.py](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/python/LHCHCGModels.py). These can be easily accessed by providing for example `-P HiggsAnalysis.CombinedLimit.HiggsCouplings:c7` and others defined un [HiggsCouplings.py](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/python/HiggsCouplings.py). -Below are some (more generic) example models which also exist in gitHub. +Below are some (more generic) example models that also exist in GitHub. ### MultiSignalModel ready made model for multiple signal processes -Combine already contains a model **`HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel`** that can be used to assign different signal strengths to multiple processes in a datacard, configurable from the command line. +Combine already contains a model **`HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel`** that can be used to assign different signal strengths to multiple processes in a datacard, configurable from the command line. -The model is configured passing to text2workspace one or more mappings in the form **`--PO 'map=bin/process:parameter'`** +The model is configured by passing one or more mappings in the form **`--PO 'map=bin/process:parameter'`** to text2workspace: -- **`bin`** and **`process`** can be arbitrary regular expressions matching the bin names and process names in the datacard +- **`bin`** and **`process`** can be arbitrary regular expressions matching the bin names and process names in the datacard. Note that mappings are applied both to signals and to background processes; if a line matches multiple mappings, precedence is given to the last one in the order they are in the command line. - it is suggested to put quotes around the argument of **`--PO`** so that the shell does not try to expand any **`*`** signs in the patterns. -- **`parameter`** is the POI to use to scale that process (`name[starting_value,min,max]` the first time a parameter is defined, then just `name` if used more than once) - Special values are **`1`** and **`0==; ==0`** means to drop the process completely from the card, while **`1`** means to keep the yield as is in the card with no scaling (as normally done for backgrounds); **`1`** is the default that is applied to processes that have no mappings, so it's normally not needed, but it may be used either to make the thing explicit, or to override a previous more generic match on the same command line (e.g. `--PO 'map=.*/ggH:r[1,0,5]' --PO 'map=bin37/ggH:1'` would treat ggH as signal in general, but count it as background in the channel `bin37`) + It is suggested to put quotes around the argument of **`--PO`** so that the shell does not try to expand any **`*`** signs in the patterns. +- **`parameter`** is the POI to use to scale that process (`name[starting_value,min,max]` the first time a parameter is defined, then just `name` if used more than once). + Special values are **`1`** and **`0==; ==0`** means "drop the process completely from the model", while **`1`** means to "keep the yield as is in the card with no scaling" (as normally done for backgrounds); **`1`** is the default that is applied to processes that have no mappings. Therefore it is normally not needed, but it may be used to override a previous more generic match in the same command line (e.g. `--PO 'map=.*/ggH:r[1,0,5]' --PO 'map=bin37/ggH:1'` would treat ggH as signal in general, but count it as background in the channel `bin37`). -Passing the additional option **`--PO verbose`** will set the code to verbose mode, printing out the scaling factors for each process; people are encouraged to use this option to make sure that the processes are being scaled correctly. +Passing the additional option **`--PO verbose`** will set the code to verbose mode, printing out the scaling factors for each process; we encourage the use this option to make sure that the processes are being scaled correctly. -The MultiSignalModel will define all parameters as parameters of interest, but that can be then changed from the command line of combine, as described in the following sub-section. +The MultiSignalModel will define all parameters as parameters of interest, but that can be then changed from the command line, as described in the following subsection. Some examples, taking as reference the toy datacard [test/multiDim/toy-hgg-125.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/test/multiDim/toy-hgg-125.txt): -- Scale both `ggH` and `qqH` with the same signal strength `r` (that's what the default physics model of combine does for all signals; if they all have the same systematic uncertainties, it is also equivalent to adding up their yields and writing them as a single column in the card) +- Scale both `ggH` and `qqH` with the same signal strength `r` (that is what the default physics model of Combine does for all signals; if they all have the same systematic uncertainties, it is also equivalent to adding up their yields and writing them as a single column in the card) ```nohighlight $ text2workspace.py -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel --PO verbose --PO 'map=.*/ggH:r[1,0,10]' --PO 'map=.*/qqH:r' toy-hgg-125.txt -o toy-1d.root @@ -197,7 +197,7 @@ Some examples, taking as reference the toy datacard [test/multiDim/toy-hgg-125.t ### Two Hypothesis testing -The `PhysicsModel` that encodes the signal model above is the [twoHypothesisHiggs](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/python/HiggsJPC.py), which assumes that there will exist signal processes with suffix **_ALT** in the datacard. An example of such a datacard can be found under [data/benchmarks/simple-counting/twoSignals-3bin-bigBSyst.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/benchmarks/simple-counting/twoSignals-3bin-bigBSyst.txt) +The `PhysicsModel` that encodes the signal model above is the [twoHypothesisHiggs](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/python/HiggsJPC.py), which assumes signal processes with suffix **_ALT** will exist in the datacard. An example of such a datacard can be found under [data/benchmarks/simple-counting/twoSignals-3bin-bigBSyst.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/benchmarks/simple-counting/twoSignals-3bin-bigBSyst.txt) ```nohighlight $ text2workspace.py twoSignals-3bin-bigBSyst.txt -P HiggsAnalysis.CombinedLimit.HiggsJPC:twoHypothesisHiggs -m 125.7 --PO verbose -o jcp_hww.root @@ -211,11 +211,11 @@ The `PhysicsModel` that encodes the signal model above is the [twoHypothesisHigg Process S_ALT will get norm x ``` -The two processes (S and S_ALT) will get different scaling parameters. The LEP-style likelihood for hypothesis testing can now be performed by setting **x** or **not_x** to 1 and 0 and comparing two likelihood evaluations. +The two processes (S and S_ALT) will get different scaling parameters. The LEP-style likelihood for hypothesis testing can now be used by setting **x** or **not_x** to 1 and 0 and comparing the two likelihood evaluations. ### Signal-background interference -Since there are no such things as negative probability distribution functions, the recommended way to implement this is to start from the expression for the individual amplitudes $A$ and the parameter of interest $k$, +Since negative probability distribution functions do not exist, the recommended way to implement this is to start from the expression for the individual amplitudes $A$ and the parameter of interest $k$, $$ \mathrm{Yield} = |k * A_{s} + A_{b}|^2 @@ -330,7 +330,7 @@ is helpful for extracting the lower triangle of a square matrix. You could pick any nominal template, and adjust the scaling as appropriate. Generally it is advisable to use a nominal template corresponding to near where you expect the -POIs to land so that the shape systematic effects are well-modeled in that +best-fit values of the POIs to be so that the shape systematic effects are well-modeled in that region. It may be the case that the relative contributions of the terms are themselves diff --git a/docs/part2/settinguptheanalysis.md b/docs/part2/settinguptheanalysis.md index 48cc5a46d64..7a695f23e91 100644 --- a/docs/part2/settinguptheanalysis.md +++ b/docs/part2/settinguptheanalysis.md @@ -1,41 +1,41 @@ # Preparing the datacard -The input to combine, which defines the details of the experiment, is a datacard file is a plain ASCII file. This is true whether the experiment is a simple counting experiment or a shape analysis. +The input to Combine, which defines the details of the analysis, is a plain ASCII file we will refer to as datacard. This is true whether the analysis is a simple counting experiment or a shape analysis. ## A simple counting experiment The file [data/tutorials/counting/realistic-counting-experiment.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/counting/realistic-counting-experiment.txt) shows an example of a counting experiment. -The first lines can be used as a description and are not parsed by the program. They have to begin with a "#": +The first lines can be used to add some descriptive information. Those lines must start with a "#", and they are not parsed by Combine: ```nohighlight # Simple counting experiment, with one signal and a few background processes # Simplified version of the 35/pb H->WW analysis for mH = 160 GeV ``` -Then one declares the **number of observables**, **`imax`**, that are present in the model used to calculate limits/significances. The number of observables will typically be the number of channels in a counting experiment or the number of bins in a binned shape fit. (If one specifies for **`imax`** the value **`*`** it means "figure it out from the rest of the datacard", but in order to better catch mistakes it's recommended to specify it explicitly) +Following this, one declares the **number of observables**, **`imax`**, that are present in the model used to set limits / extract confidence intervals. The number of observables will typically be the number of channels in a counting experiment. The value **`*`** can be specified for **`imax`**, which tells Combine to determine the number of observables from the rest of the datacard. In order to better catch mistakes, it is recommended to explicitly specify the value. ```nohighlight imax 1 number of channels ``` -Then one declares the number of background sources to be considered, **`jmax`**, and the number of **independent sources of systematic uncertainties**, **`kmax`**: +This declaration is followed by a specification of the number of background sources to be considered, **`jmax`**, and the number of **independent sources of systematic uncertainty**, **`kmax`**: ```nohighlight jmax 3 number of backgrounds -kmax 5 number of nuisance parameters (sources of systematic uncertainties) +kmax 5 number of nuisance parameters (sources of systematic uncertainty) ``` In the example there is 1 channel, there are 3 background sources, and there are 5 independent sources of systematic uncertainty. -Then there are the lines describing what is actually observed: the number of events observed in each channel. The first line, starting with **`bin`** defines the label used for each channel. In the example we have 1 channel, labelled **`1`**, and in the following line, **`observation`**, are listed the events observed, **`0`** in this example: +After providing this information, the following lines describe what is observed in data: the number of events observed in each channel. The first line, starting with **`bin`**, defines the label used for each channel. In the example we have 1 channel, labelled **`1`**, and in the following line, **`observation`**, the number of observed events is given: **`0`** in this example. ```nohighlight # we have just one channel, in which we observe 0 events bin bin1 observation 0 ``` -Following is the part related to the number of events expected, for each bin and process, arranged in (#channels)*(#processes) columns. +This is followed by information related to the expected number of events, for each bin and process, arranged in (#channels)*(#processes) columns. ```nohighlight bin bin1 bin1 bin1 bin1 @@ -44,14 +44,14 @@ process 0 1 2 3 rate 1.47 0.63 0.06 0.22 ``` -- The **`bin`** line identifies the channel the column is referring to. It goes from **`1`** to the **`imax`** declared above. -- The first **`process`** line contains the labels of the various sources -- The second **`process`** line must have a positive number for backgrounds, and **`0`** or a negative number for the signals. You should use different process ids for different processes. -- The last line, **`rate`**, tells the expected yield of events in the specified bin and process +- The **`bin`** line identifies the channel that the column refers to. It ranges from **`1`** to the value of **`imax`** declared above. +- The first **`process`** line contains the names of the various process sources +- The second **`process`** line is a numerical process identifier. Backgrounds are given a positive number, while **`0`** and negative numbers are used for signal processes. Different process identifiers must be used for different processes. +- The last line, **`rate`**, gives the expected number of events for the given process in the specified bin -All bins should be declared in increasing order, and within each bin one should include all processes in increasing order, specifying a **0** for processes that do not contribute. +If a process does not contribute in a given bin, it can be removed from the datacard, or the rate can be set to **0**. -The last section contains the description of the systematic uncertainties: +The final section of the datacard describes the systematic uncertainties: ```nohighlight lumi lnN 1.11 - 1.11 - lumi affects both signal and gg->WW (mc-driven). lnN = lognormal @@ -61,68 +61,68 @@ xs_ggWW lnN - - 1.50 - 50% uncertainty on gg->WW cross section bg_others lnN - - - 1.30 30% uncertainty on the rest of the backgrounds ``` -- the first columns is a label identifying the uncertainty -- the second column identifies the type of distribution used +- The first column is the name of the nuisance parameter, a label that is used to identify the uncertainty +- The second column identifies the type of distribution used to describe the nuisance parameter - **`lnN`** stands for [Log-normal](http://en.wikipedia.org/wiki/Log-normal_distribution), which is the recommended choice for multiplicative corrections (efficiencies, cross sections, ...). - If **Δx/x** is the relative uncertainty on the multiplicative correction, one should put **1+Δx/x** in the column corresponding to the process and channel. Asymmetric log-normals are instead supported by providing κdownup where κdown is the ratio of the the yield to the nominal value for a -1σ deviation of the nuisance and κup is the ratio of thyield to the nominal value for $+1\sigma$ deviation. Note that for single-value log-normal with value $\kappa=1+\Delta x/x$, the yield of the process it is associated with is multiplied by $\kappa^{\theta}$. At $\theta=0$ the nominal yield is retained, at $\theta=1\sigma$ the yield is multiplied by $\kappa$ and at $\theta=-1\sigma$ the yield is multiplied by $1/\kappa$. This means that an uncertainty represented as 1.2 does not multiply the nominal yield by 0.8 for $\theta=-1\sigma$; but by 0.8333. For large uncertainties that have a symmetric effect on the yield it may therefore be desirable to encode them as asymmetric log-normals instead. - - **`gmN`** stands for [Gamma](http://en.wikipedia.org/wiki/Gamma_distribution), and is the recommended choice for the statistical uncertainty on a background coming from the number of events in a control region (or in a MC sample with limited statistics). - If the control region or MC contains **N** events, and the extrapolation factor from the control region to the signal region is **α** then one shoud put **N** just after the **`gmN`** keyword, and then the value of **α** in the proper column. Also, the yield in the **`rate`** row should match with **Nα** - - **`lnU`** stands for log-uniform distribution. A value of **1+ε** in the column will imply that the yield of this background is allowed to float freely between **x(1+ε)** and **x/(1+ε)** (in particular, if ε is small, then this is approximately **(x-Δx,x+Δx)** with **ε=Δx/x** ) - This is normally useful when you want to set a large a-priori uncertainty on a given background and then rely on the correlation between channels to constrain it. Beware that while Gaussian-like uncertainties behave in a similar way under profiling and marginalization, uniform uncertainties do not, so the impact of the uncertainty on the result will depend on how the nuisances are treated. -- then there are (#channels)*(#processes) columns reporting the relative effect of the systematic uncertainty on the rate of each process in each channel. The columns are aligned with the ones in the previous lines declaring bins, processes and rates. + If **Δx/x** is the relative uncertainty in the multiplicative correction, one should put **1+Δx/x** in the column corresponding to the process and channel. Asymmetric log-normals are instead supported by providing κdownup where κdown is the ratio of the the yield to the nominal value for a -1σ deviation of the nuisance parameter and κup is the ratio of the yield to the nominal value for a $+1\sigma$ deviation. Note that for a single-value log-normal with value $\kappa=1+\Delta x/x$, the yield of the process it is associated with is multiplied by $\kappa^{\theta}$. At $\theta=0$ the nominal yield is retained, at $\theta=1\sigma$ the yield is multiplied by $\kappa$ and at $\theta=-1\sigma$ the yield is multiplied by $1/\kappa$. This means that an uncertainty represented as 1.2 does not multiply the nominal yield by 0.8 for $\theta=-1\sigma$; but by 0.8333. It may therefore be desirable to encode large uncertainties that have a symmetric effect on the yield as asymmetric log-normals instead. + - **`gmN`** stands for [Gamma](http://en.wikipedia.org/wiki/Gamma_distribution), and is the recommended choice for the statistical uncertainty in a background determined from the number of events in a control region (or in an MC sample with limited sample size). + If the control region or simulated sample contains **N** events, and the extrapolation factor from the control region to the signal region is **α**, one shoud put **N** just after the **`gmN`** keyword, and then the value of **α** in the relevant (bin,process) column. The yield specified in the **`rate`** line for this (bin,process) combination should equal **Nα**. + - **`lnU`** stands for log-uniform distribution. A value of **1+ε** in the column will imply that the yield of this background is allowed to float freely between **x(1+ε)** and **x/(1+ε)**. In particular, if ε is small, this is approximately **(x-Δx,x+Δx)** with **ε=Δx/x**. + This distribution is typically useful when you want to set a large a-priori uncertainty on a given background process, and then rely on the correlation between channels to constrain it. Note that for this use case, we usually recommend using [a `rateParam`](#rate-parameters) instead. If you do use **`lnU`**, please be aware that while Gaussian-like uncertainties behave in a similar way under profiling and marginalization, uniform uncertainties do not. This means the impact of the uncertainty on the result will depend on how the nuisance parameters are treated. +- The next (#channels)*(#processes) columns indicate the relative effect of the systematic uncertainty on the rate of each process in each channel. The columns are aligned with those in the previous lines declaring bins, processes, and rates. In the example, there are 5 uncertainties: -- the first uncertainty affects the signal by 11%, and affects the **`ggWW`** process by 11% -- the second uncertainty affects the signal by 16% leaving the backgrounds unaffected -- the third line specifies that the **`qqWW`** background comes from a sideband with 4 observed events and an extrapolation factor of 0.16; the resulting uncertainty on the expected yield is $1/\sqrt{4+1}$ = 45% -- the fourth uncertainty does not affect the signal, affects the **`ggWW`** background by 50%, leaving the other backgrounds unaffected -- the last uncertainty does not affect the signal, affects by 30% the **`others`** backgrounds, leaving the rest of the backgrounds unaffected +- The first uncertainty has an 11% effect on the signal and on the **`ggWW`** process. +- The second uncertainty affects the signal by 16%, but leaves the background processes unaffected +- The third line specifies that the **`qqWW`** background comes from a sideband with 4 observed events and an extrapolation factor of 0.16; the resulting uncertainty in the expected yield is $1/\sqrt{4+1}$ = 45% +- The fourth uncertainty does not affect the signal, has a 50% effect on the **`ggWW`** background, and leaves the other backgrounds unaffected +- The fifth uncertainty does not affect the signal, has a 30% effect on the **`others`** background process, and does not affect the remaining backgrounds. -## Shape analysis +## Shape analyses The datacard has to be supplemented with two extensions: - * A new block of lines defining how channels and processes are mapped into shapes - * The block for systematics that can contain also rows with shape uncertainties. + * A new block of lines defining how channels and processes are mapped into shapes. + * The block for systematics can now also contain rows with shape uncertainties. -The expected shape can be parametric or not parametric. In the first case the parametric pdfs have to be given as input to the tool. In the latter case, for each channel, histograms have to be provided for the expected shape of each process. For what concerns data, they have to be provided as input to the tool as a histogram to perform a ***binned*** shape analysis and as a RooDataSet to perform an ***unbinned*** shape analysis. +The expected shape can be parametric, or not. In the first case the parametric PDFs have to be given as input to the tool. In the latter case, for each channel, histograms have to be provided for the expected shape of each process. The data have to be provided as input as a histogram to perform a ***binned*** shape analysis, and as a RooDataSet to perform an ***unbinned*** shape analysis. !!! warning - If using RooFit based inputs (RooDataHists/RooDataSets/RooAbsPdfs) then you should be careful to use *different* RooRealVars as the observable in each category being combined. It is possible to use the same RooRealVar if the observable has the same range (and binning if using binned data) in each category though in most cases it is simpler to avoid doing so. + If using RooFit-based inputs (RooDataHists/RooDataSets/RooAbsPdfs) then you need to ensure you are using *different* RooRealVars as the observable in each category entering the statistical analysis. It is possible to use the same RooRealVar if the observable has the same range (and binning if using binned data) in each category, although in most cases it is simpler to avoid doing this. -### Rates for shape analysis +### Rates for shape analyses -As with the counting experiment, the total nominal *rate* of a given process must be identified in the **rate** line of the datacard. However, there are special options for shape based analyses as follows +As with the counting experiment, the total nominal *rate* of a given process must be identified in the **rate** line of the datacard. However, there are special options for shape-based analyses, as follows: - * A value of **-1** in the rate line indicates to combine to calculate the rate from the input TH1 (via TH1::Integral) or RooDataSet/RooDataHist (via RooAbsData::sumEntries) - * For parametric shapes (RooAbsPdf), if a parameter is found in the input workspace with the name pdfname**_norm** the rate will be multiplied by the value of that parameter. Note that since this parameter can be freely floating, the normalization of a shape can be made to freely float this way. This can also be achieved through the use of [`rateParams`](#rate-parameters) + * A value of **-1** in the rate line means Combine will calculate the rate from the input TH1 (via TH1::Integral) or RooDataSet/RooDataHist (via RooAbsData::sumEntries). + * For parametric shapes (RooAbsPdf), if a parameter with the name pdfname**_norm** is found in the input workspace, the rate will be multiplied by the value of that parameter. Note that since this parameter can be freely floating, the normalization of a process can be set freely float this way. This can also be achieved through the use of [`rateParams`](#rate-parameters). -### Binned shape analysis +### Binned shape analyses For each channel, histograms have to be provided for the observed shape and for the expected shape of each process. - Within each channel, all histograms must have the same binning. -- The normalization of the data histogram must correspond to the number of observed events -- The normalization of the expected histograms must match the expected yields +- The normalization of the data histogram must correspond to the number of observed events. +- The normalization of the expected histograms must match the expected event yields. -The combine tool can take as input histograms saved as TH1, as RooAbsHist in a RooFit workspace (an example of how to create a RooFit workspace and save histograms is available in [github](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/benchmarks/shapes/make_simple_shapes.cxx)), or from a pandas dataframe ([example](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/shapes/simple-shapes-df.txt)) +The Combine tool can take as input histograms saved as TH1, as RooAbsHist in a RooFit workspace (an example of how to create a RooFit workspace and save histograms is available in [github](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/benchmarks/shapes/make_simple_shapes.cxx)), or from a pandas dataframe ([example](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/shapes/simple-shapes-df.txt)). -The block of lines defining the mapping (first block in the datacard) contains one or more rows in the form +The block of lines defining the mapping (first block in the datacard) contains one or more rows of the form - **shapes *process* *channel* *file* *histogram* *[histogram_with_systematics]* ** -In this line +In this line, -- ***process*** is any one the process names, or **\*** for all processes, or **data_obs** for the observed data -- ***channel*** is any one the process names, or **\*** for all channels -- *file*, *histogram* and *histogram_with_systematics* identify the names of the files and of the histograms within the file, after doing some replacements (if any are found): - - **$PROCESS** is replaced with the process name (or "**data_obs**" for the observed data) - - **$CHANNEL** is replaced with the channel name - - **$SYSTEMATIC** is replaced with the name of the systematic + (**Up, Down**) - - **$MASS** is replaced with the higgs mass value which is passed as option in the command line used to run the limit tool +- ***process*** is any one the process names, or **\*** for all processes, or **data_obs** for the observed data; +- ***channel*** is any one the process names, or **\*** for all channels; +- *file*, *histogram* and *histogram_with_systematics* identify the names of the files and of the histograms within the file, after making some replacements (if any are found): + - **$PROCESS** is replaced with the process name (or "**data_obs**" for the observed data); + - **$CHANNEL** is replaced with the channel name; + - **$SYSTEMATIC** is replaced with the name of the systematic + (**Up, Down**); + - **$MASS** is replaced with the chosen (Higgs boson) mass value that is passed as a command-line option when running the tool -In addition, user defined keywords can be included to be replaced. Any word in the datacard **$WORD** will be replaced by **VALUE** when including the option `--keyword-value WORD=VALUE`. The option can be repeated multiple times for multiple keywords. +In addition, user-defined keywords can be used. Any word in the datacard **$WORD** will be replaced by **VALUE** when including the option `--keyword-value WORD=VALUE`. This option can be repeated multiple times for multiple keywords. #### Template shape uncertainties @@ -130,27 +130,27 @@ Shape uncertainties can be taken into account by vertical interpolation of the h $$ f(\theta) = \frac{1}{2} \left( (\delta^{+}-\delta^{-})\theta + \frac{1}{8}(\delta^{+}+\delta^{-})(3\theta^6 - 10\theta^4 + 15\theta^2) \right) $$ -and for $|\theta|> 1$ ($|\theta|<-1$), $f(\theta)$ is a straight line with gradient $\delta^{+}$ ($\delta^{-}$), where $\delta^{+}=f(\theta=1)-f(\theta=0)$, and $\delta^{-}=f(\theta=-1)-f(\theta=0)$, derived using the nominal and up/down histograms. and +and for $|\theta|> 1$ ($|\theta|<-1$), $f(\theta)$ is a straight line with gradient $\delta^{+}$ ($\delta^{-}$), where $\delta^{+}=f(\theta=1)-f(\theta=0)$, and $\delta^{-}=f(\theta=-1)-f(\theta=0)$, derived using the nominal and up/down histograms. This interpolation is designed so that the values of $f(\theta)$ and its derivatives are continuous for all values of $\theta$. -The normalizations are interpolated linearly in log scale just like we do for log-normal uncertainties. If the value in a given bin is negative for some value of $\theta$, the value will be truncated at 0. +The normalizations are interpolated linearly in log scale, just like we do for log-normal uncertainties. If the value in a given bin is negative for some value of $\theta$, the value will be truncated at 0. -For each shape uncertainty and process/channel affected by it, two additional input shapes have to be provided, obtained shifting that parameter up and down by one standard deviation. When building the likelihood, each shape uncertainty is associated to a nuisance parameter taken from a unit gaussian distribution, which is used to interpolate or extrapolate using the specified histograms. +For each shape uncertainty and process/channel affected by it, two additional input shapes have to be provided. These are obtained by shifting the parameter up and down by one standard deviation. When building the likelihood, each shape uncertainty is associated to a nuisance parameter taken from a unit gaussian distribution, which is used to interpolate or extrapolate using the specified histograms. -For each given source of shape uncertainty, in the part of the datacard containing shape uncertainties (last block), there must be a row +For each given shape uncertainty, the part of the datacard describing shape uncertainties must contain a row - ** *name* *shape* *effect_for_each_process_and_channel* ** -The effect can be "-" or 0 for no effect, 1 for normal effect, and possibly something different from 1 to test larger or smaller effects (in that case, the unit gaussian is scaled by that factor before using it as parameter for the interpolation) +The effect can be "-" or 0 for no effect, 1 for the normal effect, and something different from 1 to test larger or smaller effects (in that case, the unit gaussian is scaled by that factor before using it as parameter for the interpolation). -The datacard in [data/tutorials/shapes/simple-shapes-TH1.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/shapes/simple-shapes-TH1.txt) is a clear example of how to include shapes in the datacard. In the first block the following line specifies the shape mapping: +The datacard in [data/tutorials/shapes/simple-shapes-TH1.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/shapes/simple-shapes-TH1.txt) provides an example of how to include shapes in the datacard. In the first block the following line specifies the shape mapping: ```nohighlight shapes * * simple-shapes-TH1.root $PROCESS $PROCESS_$SYSTEMATIC ``` -The last block concerns the treatment of the systematics affecting shapes. In this part the two uncertainties effecting on the shape are listed. +The last block concerns the treatment of the systematic uncertainties that affect shapes. In this case there are two uncertainties with a shape-altering effect. ```nohighlight alpha shape - 1 uncertainty on background shape and normalization @@ -158,13 +158,13 @@ sigma shape 0.5 - uncertainty on signal resolution. Assume the his # so divide the unit gaussian by 2 before doing the interpolation ``` -There are two options for the interpolation algorithm in the "shape" uncertainty. Putting **`shape`** will result in a of the **fraction of events in each bin** - i.e the histograms are first normalised before interpolation. Putting **`shapeN`** while instead base the interpolation on the logs of the fraction in each bin. For _both_ **`shape`** and **`shapeN`**, the total normalisation is interpolated using an asymmetric log-normal so that the effect of the systematic on both the shape and normalisation are accounted for. The following image shows a comparison of those two algorithms for this datacard. +There are two options for the interpolation algorithm in the "shape" uncertainty. Putting **`shape`** will result in an interpolation of the **fraction of events in each bin**. That is, the histograms are first normalized before interpolation. Putting **`shapeN`** while instead base the interpolation on the logs of the fraction in each bin. For _both_ **`shape`** and **`shapeN`**, the total normalization is interpolated using an asymmetric log-normal, so that the effect of the systematic on both the shape and normalization are accounted for. The following image shows a comparison of the two algorithms for the example datacard. ![](images/compare-shape-algo.png) -In this case there are two processes, *signal* and *background*, and two uncertainties affecting background (*alpha*) and signal shape (*sigma*). Within the root file 2 histograms per systematic have to be provided, they are the shape obtained, for the specific process, shifting up and down the parameter associated to the uncertainty: `background_alphaUp` and `background_alphaDown`, `signal_sigmaUp` and `signal_sigmaDown`. +In this case there are two processes, *signal* and *background*, and two uncertainties affecting the background (*alpha*) and signal shapes (*sigma*). In the ROOT file, two histograms per systematic have to be provided, they are the shapes obtained, for the specific process, by shifting the parameter associated with the uncertainty up and down by a standard deviation: `background_alphaUp` and `background_alphaDown`, `signal_sigmaUp` and `signal_sigmaDown`. -This is the content of the root file [simple-shapes-TH1.root ](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/benchmarks/shapes/simple-shapes-TH1.root) associated to the datacard [data/tutorials/shapes/simple-shapes-TH1.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/benchmarks/shapes/simple-shapes-TH1.txt): +The content of the ROOT file [simple-shapes-TH1.root ](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/benchmarks/shapes/simple-shapes-TH1.root) associated with the datacard [data/tutorials/shapes/simple-shapes-TH1.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/benchmarks/shapes/simple-shapes-TH1.txt) is: ```nohighlight root [0] @@ -182,9 +182,9 @@ TFile** simple-shapes-TH1.root KEY: TH1F data_sig;1 Histogram of data_sig__x ``` -For example, without shape uncertainties you could have just one row with +For example, without shape uncertainties there would only be one row with `shapes * * shapes.root $CHANNEL/$PROCESS` -Then for a simple example for two channels "e", "mu" with three processes "higgs", "zz", "top" you should create a rootfile that contains the following +Then, to give a simple example for two channels ("e", "mu") with three processes ()"higgs", "zz", "top"), the ROOT file contents should look like: | histogram | meaning | |:--------------|:---------------------------------------------| @@ -197,35 +197,35 @@ Then for a simple example for two channels "e", "mu" with three processes "higgs | `mu/zz` | expected shape for ZZ in muon channel | | `mu/top` | expected shape for top in muon channel | -If you also have one uncertainty that affects the shape, e.g. jet energy scale, you should create shape histograms for the jet energy scale shifted up by one sigma, you could for example do one folder for each process and write a like like +If there is also an uncertainty that affects the shape, e.g. the jet energy scale, shape histograms for the jet energy scale shifted up and down by one sigma need to be included. This could be done by creating a folder for each process and writing a line like `shapes * * shapes.root $CHANNEL/$PROCESS/nominal $CHANNEL/$PROCESS/$SYSTEMATIC` -or just attach a postifx to the name of the histogram +or a postifx can be added to the histogram name: `shapes * * shapes.root $CHANNEL/$PROCESS $CHANNEL/$PROCESS_$SYSTEMATIC` !!! warning - If you have a nuisance parameter which has shape effects (using `shape`) *and* rate effects (using `lnN`) you should use a single line for the systemstic uncertainty with `shape?`. This will tell combine to fist look for Up/Down systematic templates for that process and if it doesnt find them, it will interpret the number that you put for the process as a `lnN` instead. + If you have a nuisance parameter that has shape effects on some processes (using `shape`) *and* rate effects on other processes (using `lnN`) you should use a single line for the systematic uncertainty with `shape?`. This will tell Combine to fist look for Up/Down systematic templates for that process and if it doesnt find them, it will interpret the number that you put for the process as a `lnN` instead. -For a detailed example of a template based binned analysis see the [H→ττ 2014 DAS tutorial](https://twiki.cern.ch/twiki/bin/viewauth/CMS/SWGuideCMSDataAnalysisSchool2014HiggsCombPropertiesExercise#A_shape_analysis_using_templates) +For a detailed example of a template-based binned analysis, see the [H→ττ 2014 DAS tutorial](https://twiki.cern.ch/twiki/bin/viewauth/CMS/SWGuideCMSDataAnalysisSchool2014HiggsCombPropertiesExercise#A_shape_analysis_using_templates) -### Unbinned or parametric shape analysis +### Unbinned or parametric shape analyses -In some cases, it can be convenient to describe the expected signal and background shapes in terms of analytical functions rather than templates; a typical example are the searches where the signal is apparent as a narrow peak over a smooth continuum background. In this context, uncertainties affecting the shapes of the signal and backgrounds can be implemented naturally as uncertainties on the parameters of those analytical functions. It is also possible to adapt an agnostic approach in which the parameters of the background model are left freely floating in the fit to the data, i.e. only requiring the background to be well described by a smooth function. +In some cases, it can be convenient to describe the expected signal and background shapes in terms of analytical functions, rather than templates. Typical examples are searches/measurements where the signal is apparent as a narrow peak over a smooth continuum background. In this context, uncertainties affecting the shapes of the signal and backgrounds can be implemented naturally as uncertainties in the parameters of those analytical functions. It is also possible to adopt an agnostic approach in which the parameters of the background model are left freely floating in the fit to the data, i.e. only requiring the background to be well described by a smooth function. -Technically, this is implemented by means of the RooFit package, that allows writing generic probability density functions, and saving them into ROOT files. The pdfs can be either taken from RooFit's standard library of functions (e.g. Gaussians, polynomials, ...) or hand-coded in C++, and combined together to form even more complex shapes. +Technically, this is implemented by means of the RooFit package, which allows writing generic probability density functions, and saving them into ROOT files. The PDFs can be either taken from RooFit's standard library of functions (e.g. Gaussians, polynomials, ...) or hand-coded in C++, and combined together to form even more complex shapes. -In the datacard using templates, the column after the file name would have been the name of the histogram. For the parametric analysis we need two names to identify the mapping, separated by a colon (**`:`**). +In the datacard using templates, the column after the file name would have been the name of the histogram. For parametric analysis we need two names to identify the mapping, separated by a colon (**`:`**). **shapes process channel shapes.root *workspace_name:pdf_name*** -The first part identifies the name of the input [RooWorkspace](http://root.cern.ch/root/htmldoc/RooWorkspace.html) containing the pdf, and the second part the name of the [RooAbsPdf](http://root.cern.ch/root/htmldoc/RooAbsPdf.html) inside it (or, for the observed data, the [RooAbsData](http://root.cern.ch/root/htmldoc/RooAbsData.html)). There can be multiple input workspaces, just as there can be multiple input root files. You can use any of the usual RooFit pre-defined pdfs for your signal and background models. +The first part identifies the name of the input [RooWorkspace](http://root.cern.ch/root/htmldoc/RooWorkspace.html) containing the PDF, and the second part the name of the [RooAbsPdf](http://root.cern.ch/root/htmldoc/RooAbsPdf.html) inside it (or, for the observed data, the [RooAbsData](http://root.cern.ch/root/htmldoc/RooAbsData.html)). It is possible to have multiple input workspaces, just as there can be multiple input ROOT files. You can use any of the usual RooFit pre-defined PDFs for your signal and background models. !!! warning - If you are using RooAddPdfs in your model in which the coefficients are *not defined recursively*, combine will not interpret them properly. You can add the option `--X-rtd ADDNLL_RECURSIVE=0` to any combine command in order to recover the correct interpretation, however we recommend that you instead redefine your pdf so that the coefficients are recursive (as described on the [RooAddPdf documentation](https://root.cern.ch/doc/master/classRooAddPdf.html)) and keep the total normalisation (i.e extended term) as a separate object as in the case of the tutorial datacard. + If in your model you are using RooAddPdfs, in which the coefficients are *not defined recursively*, Combine will not interpret them correctly. You can add the option `--X-rtd ADDNLL_RECURSIVE=0` to any Combine command in order to recover the correct interpretation, however we recommend that you instead re-define your PDF so that the coefficients are recursive (as described in the [RooAddPdf documentation](https://root.cern.ch/doc/master/classRooAddPdf.html)) and keep the total normalization (i.e the extended term) as a separate object, as in the case of the tutorial datacard. -For example, take a look at the [data/tutorials/shapes/simple-shapes-parametric.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/shapes/simple-shapes-parametric.txt). We see the following line. +For example, take a look at the [data/tutorials/shapes/simple-shapes-parametric.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/shapes/simple-shapes-parametric.txt). We see the following line: ```nohighlight shapes * * simple-shapes-parametric_input.root w:$PROCESS @@ -234,7 +234,7 @@ bin 1 1 process sig bkg ``` -which indicates that the input file `simple-shapes-parametric_input.root` should contain an input workspace (`w`) with pdfs named `sig` and `bkg` since these are the names of the two processes in the datacard. Additionally, we expect there to be a dataset named `data_obs`. If we look at the contents of the workspace inside `data/tutorials/shapes/simple-shapes-parametric_input.root`, this is indeed what we see... +which indicates that the input file `simple-shapes-parametric_input.root` should contain an input workspace (`w`) with PDFs named `sig` and `bkg`, since these are the names of the two processes in the datacard. Additionally, we expect there to be a data set named `data_obs`. If we look at the contents of the workspace in `data/tutorials/shapes/simple-shapes-parametric_input.root`, this is indeed what we see: ```nohighlight root [1] w->Print() @@ -256,17 +256,17 @@ RooDataSet::data_obs(j) ``` -In this datacard, the signal is parameterised in terms of the hypothesised mass (`MH`). Combine will use this variable, instead of creating its own, which will be interpreted as the value for `-m`. For this reason, we should add the option `-m 30` (or something else within the observable range) when running combine. You will also see there is a variable named `bkg_norm`. This is used to normalize the background rate (see the section on [Rate parameters](#rate-parameters) below for details). +In this datacard, the signal is parameterized in terms of the hypothesized mass (`MH`). Combine will use this variable, instead of creating its own, which will be interpreted as the value for `-m`. For this reason, we should add the option `-m 30` (or something else within the observable range) when running Combine. You will also see there is a variable named `bkg_norm`. This is used to normalize the background rate (see the section on [Rate parameters](#rate-parameters) below for details). !!! warning - Combine will not accept RooExtendedPdfs as an input. This is to alleviate a bug that lead to improper treatment of normalization when using multiple RooExtendedPdfs to describe a single process. You should instead use RooAbsPdfs and provide the rate as a separate object (see the [Rate parameters](#rate-parameters) section). + Combine will not accept RooExtendedPdfs as input. This is to alleviate a bug that lead to improper treatment of the normalization when using multiple RooExtendedPdfs to describe a single process. You should instead use RooAbsPdfs and provide the rate as a separate object (see the [Rate parameters](#rate-parameters) section). The part of the datacard related to the systematics can include lines with the syntax - **name *param* X Y** -These lines encode uncertainties on the parameters of the signal and background pdfs. The parameter is to be assigned a Gaussian uncertainty of **Y** around its mean value of **X**. One can change the mean value from 0 to 1 (or really any value, if one so chooses) if the parameter in question is multiplicative instead of additive. +These lines encode uncertainties in the parameters of the signal and background PDFs. The parameter is to be assigned a Gaussian uncertainty of **Y** around its mean value of **X**. One can change the mean value from 0 to 1 (or any value, if one so chooses) if the parameter in question is multiplicative instead of additive. In the [data/tutorials/shapes/simple-shapes-parametric.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/shapes/simple-shapes-parametric.txt) datacard, there are lines for one such parametric uncertainty, @@ -274,36 +274,36 @@ In the [data/tutorials/shapes/simple-shapes-parametric.txt](https://github.com/c sigma param 1.0 0.1 ``` -meaning there is a parameter already contained in the input workspace called **`sigma`** which should be *constrained* with a Gaussian centered at 1.0 with a width of 0.1. Note that, the exact interpretation (i.e all combine knows is that 1.0 should be the most likely value and 0.1 is its 1σ uncertainy) of these parameters is left to the user since the signal pdf is constructed externally by you. Asymmetric uncertainties are written as with `lnN` using the syntax **-1σ/+1σ** in the datacard. +meaning there is a parameter in the input workspace called **`sigma`**, that should be *constrained* with a Gaussian centered at 1.0 with a width of 0.1. Note that the exact interpretation of these parameters is left to the user since the signal PDF is constructed externally by you. All Combine knows is that 1.0 should be the most likely value and 0.1 is its 1σ uncertainy. Asymmetric uncertainties are written using the syntax **-1σ/+1σ** in the datacard, as is the case for `lnN` uncertainties. -If one wants to specify a parameter that is freely floating across its given range, and not gaussian constrained, the following syntax is used: +If one wants to specify a parameter that is freely floating across its given range, and not Gaussian constrained, the following syntax is used: - **name *flatParam* ** -Though this is *not strictly necessary* in frequentist methods using profiled likelihoods as combine will still profile these nuisances when performing fits (as is the case for the `simple-shapes-parametric.txt` datacard). +Though this is *not strictly necessary* in frequentist methods using profiled likelihoods, as Combine will still profile these nuisances when performing fits (as is the case for the `simple-shapes-parametric.txt` datacard). !!! warning - All parameters which are floating or constant in the user's input workspaces will remain floating or constant. Combine will ***not*** modify those for you! + All parameters that are floating or constant in the user's input workspaces will remain floating or constant. Combine will ***not*** modify those for you! -A full example of a parametric analysis can be found in this [H→γγ 2014 DAS tutorial](https://twiki.cern.ch/twiki/bin/viewauth/CMS/SWGuideCMSDataAnalysisSchool2014HiggsCombPropertiesExercise#A_parametric_shape_analysis_H) +A full example of a parametric analysis can be found in this [H→γγ 2014 DAS tutorial](https://twiki.cern.ch/twiki/bin/viewauth/CMS/SWGuideCMSDataAnalysisSchool2014HiggsCombPropertiesExercise#A_parametric_shape_analysis_H). -#### Caveat on using parametric pdfs with binned datasets +#### Caveat on using parametric PDFs with binned datasets -Users should be aware of a feature that affects the use of parametric pdfs together with binned datasets. +Users should be aware of a feature that affects the use of parametric PDFs together with binned datasets. -RooFit uses the integral of the pdf, computed analytically (or numerically, but disregarding the binning), to normalize it, but then computes the expected event yield in each bin evaluating only the pdf at the bin center. This means that if the variation of the pdf is sizeable within the bin then there is a mismatch between the sum of the event yields per bin and the pdf normalization, and that can cause a bias in the fits (more properly, the bias is there if the contribution of the second derivative integrated on the bin size is not negligible, since for linear functions evaluating them at the bin center is correct). There are two reccomended ways to work around this ... +RooFit uses the integral of the PDF, computed analytically (or numerically, but disregarding the binning), to normalize it, but computes the expected event yield in each bin by evaluating the PDF at the bin center. This means that if the variation of the pdf is sizeable within the bin, there is a mismatch between the sum of the event yields per bin and the PDF normalization, which can cause a bias in the fits. More specifically, the bias is present if the contribution of the second derivative integrated in the bin size is not negligible. For linear functions, an evaluation at the bin center is correct. There are two recommended ways to work around this issue: **1. Use narrow bins** -It is recommended to use bins that are significantly finer than the characteristic scale of the pdfs - which would anyway be the recommended thing even in the absence of this feature. Obviously, this caveat does not apply to analyses using templates (they're constant across each bin, so there's no bias), or using unbinned datasets. +It is recommended to use bins that are significantly finer than the characteristic scale of the PDFs. Even in the absence of this feature, this would be advisable. Note that this caveat does not apply to analyses using templates (they are constant across each bin, so there is no bias), or using unbinned datasets. **2. Use a RooParametricShapeBinPdf** -Another solution (currently implemented for 1-dimensional histograms only) is to use a custom pdf which performs the correct integrals internally as in [RooParametricShapeBinPdf](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/src/RooParametricShapeBinPdf.cc) +Another solution (currently only implemented for 1-dimensional histograms) is to use a custom PDF that performs the correct integrals internally, as in [RooParametricShapeBinPdf](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/src/RooParametricShapeBinPdf.cc). -Note that this pdf class now allows parameters that are themselves **RooAbsReal** objects (i.e. functions of other variables). The integrals are handled internally by calling the underlying pdf’s `createIntegral()` method with named ranges created for each of the bins. This means that if the analytical integrals for the underlying pdf are available, they will be used. +Note that this PDF class now allows parameters that are themselves **RooAbsReal** objects (i.e. functions of other variables). The integrals are handled internally by calling the underlying PDF's `createIntegral()` method with named ranges created for each of the bins. This means that if the analytical integrals for the underlying PDF are available, they will be used. -The constructor for this class requires a **RooAbsReal** (eg any **RooAbsPdf**)along with a list of **RooRealVars** (the parameters, excluding the observable $x$), +The constructor for this class requires a **RooAbsReal** (eg any **RooAbsPdf**) along with a list of **RooRealVars** (the parameters, excluding the observable $x$), ```c++ RooParametricShapeBinPdf(const char *name, const char *title, RooAbsReal& _pdf, RooAbsReal& _x, RooArgList& _pars, const TH1 &_shape ) @@ -314,15 +314,15 @@ Below is a comparison of a fit to a binned dataset containing 1000 events with o ![Narrow bins](images/narrow.png) ![Wide bins](images/wide.png) -In the upper plot, the data are binned in 100 evenly spaced bins, while in the lower plot, there are 3 irregular bins. The blue lines show the result of the fit -when using the **RooExponential** directly while the red shows the result when wrapping the pdf inside a **RooParametricShapeBinPdf**. In the narrow binned case, the two -agree well while for wide bins, accounting for the integral over the bin yields a better fit. +In the upper plot, the data are binned in 100 evenly-spaced bins, while in the lower plot, there are three irregular bins. The blue lines show the result of the fit +when using the **RooExponential** directly, while the red lines show the result when wrapping the PDF inside a **RooParametricShapeBinPdf**. In the narrow binned case, the two +agree well, while for wide bins, accounting for the integral over the bin yields a better fit. -You should note that using this class will result in slower fits so you should first decide if the added accuracy is enough to justify the reduced efficiency. +You should note that using this class will result in slower fits, so you should first decide whether the added accuracy is enough to justify the reduced efficiency. ## Beyond simple datacards -Datacards can be extended in order to provide additional functionality and flexibility during runtime. These can also allow for the production of more complicated models and performing advanced computation of results beyond limits and significances. +Datacards can be extended in order to provide additional functionality and flexibility during runtime. These can also allow for the production of more complicated models and for producing more advanced results. ### Rate parameters @@ -332,25 +332,25 @@ The overall rate "expected" of a particular process in a particular bin does not name rateParam bin process initial_value [min,max] ``` -The `[min,max]` argument is optional and if not included, combine will remove the range of this parameter. This will produce a new parameter in the model (unless it already exists) which multiplies the rate of that particular **process** in the given **bin** by its value. +The `[min,max]` argument is optional. If it is not included, Combine will remove the range of this parameter. This will produce a new parameter, which multiplies the rate of that particular **process** in the given **bin** by its value, in the model (unless it already exists). -You can attach the same `rateParam` to multiple processes/bins by either using a wild card (eg `*` will match everything, `QCD_*` will match everything starting with `QCD_` etc.) in the name of the bin and/or process or by repeating the `rateParam` line in the datacard for different bins/processes with the same name. +You can attach the same `rateParam` to multiple processes/bins by either using a wild card (eg `*` will match everything, `QCD_*` will match everything starting with `QCD_`, etc.) in the name of the bin and/or process, or by repeating the `rateParam` line in the datacard for different bins/processes with the same name. !!! warning - `rateParam` is not a shortcut to evaluate the post-fit yield of a process since **other nuisances can also change the normalisation**. E.g., finding that the `rateParam` best-fit value is 0.9 does not necessarily imply that the process yield is 0.9 times the initial one. The best is to evaluate the yield taking into account the values of all nuisance parameters using [`--saveNormalizations`](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/nonstandard/#normalizations). + `rateParam` is not a shortcut to evaluate the post-fit yield of a process since **other nuisance parameters can also change the normalization**. E.g., finding that the `rateParam` best-fit value is 0.9 does not necessarily imply that the process yield is 0.9 times the initial yield. The best is to evaluate the yield taking into account the values of all nuisance parameters using [`--saveNormalizations`](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/nonstandard/#normalizations). -This parameter is by default, freely floating. It is possible to include a Gaussian constraint on any `rateParam` which is floating (i.e not a `formula` or spline) by adding a `param` nuisance line in the datacard with the same name. +This parameter is, by default, freely floating. It is possible to include a Gaussian constraint on any `rateParam` that is floating (i.e not a `formula` or spline) by adding a `param` nuisance line in the datacard with the same name. -In addition to rate modifiers which are freely floating, modifiers which are functions of other parameters can be included using the following syntax, +In addition to rate modifiers that are freely floating, modifiers that are functions of other parameters can be included using the following syntax, ```nohighlight name rateParam bin process formula args ``` -where `args` is a comma separated list of the arguments for the string `formula`. You can include other nuisance parameters in the `formula`, including ones which are Gaussian constrained (i,e via the `param` directive.) +where `args` is a comma-separated list of the arguments for the string `formula`. You can include other nuisance parameters in the `formula`, including ones that are Gaussian constrained (i,e via the `param` directive.) -Below is an example datacard which uses the `rateParam` directive to implement an ABCD like method in combine. For a more realistic description of it's use for ABCD, see the single-lepton SUSY search implementation described [here](http://cms.cern.ch/iCMS/jsp/openfile.jsp?tp=draft&files=AN2015_207_v5.pdf) +Below is an example datacard that uses the `rateParam` directive to implement an ABCD-like method in Combine. For a more realistic description of its use for ABCD, see the single-lepton SUSY search implementation described [here](http://cms.cern.ch/iCMS/jsp/openfile.jsp?tp=draft&files=AN2015_207_v5.pdf). ```nohighlight imax 4 number of channels @@ -372,20 +372,20 @@ gamma rateParam C bkg 100 delta rateParam D bkg 500 ``` -For more examples of using `rateParam` (eg for fitting process normalisations in control regions and signal regions simultaneously) see this [2016 CMS tutorial](https://indico.cern.ch/event/577649/contributions/2339440/attachments/1380196/2097805/beyond_simple_datacards.pdf) +For more examples of using `rateParam` (eg for fitting process normalizations in control regions and signal regions simultaneously) see this [2016 CMS tutorial](https://indico.cern.ch/event/577649/contributions/2339440/attachments/1380196/2097805/beyond_simple_datacards.pdf) -Finally, any pre-existing RooAbsReal inside some rootfile with a workspace can be imported using the following +Finally, any pre-existing RooAbsReal inside some ROOT file with a workspace can be imported using the following: ```nohighlight name rateParam bin process rootfile:workspacename ``` -The name should correspond to the name of the object which is being picked up inside the RooWorkspace. A simple example using the SM XS and BR splines available in HiggsAnalysis/CombinedLimit can be found under [data/tutorials/rate_params/simple_sm_datacard.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/rate_params/simple_sm_datacard.txt) +The name should correspond to the name of the object that is being picked up inside the RooWorkspace. A simple example using the SM XS and BR splines available in HiggsAnalysis/CombinedLimit can be found under [data/tutorials/rate_params/simple_sm_datacard.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/rate_params/simple_sm_datacard.txt) ### Extra arguments -If a parameter is intended to be used and it is *not* a user defined `param` or `rateParam`, it can be picked up by first issuing an `extArgs` directive before this line in the datacard. The syntax for `extArgs` is +If a parameter is intended to be used, and it is *not* a user-defined `param` or `rateParam`, it can be picked up by first issuing an `extArgs` directive before this line in the datacard. The syntax for `extArgs` is: ```nohighlight name extArg rootfile:workspacename @@ -401,69 +401,69 @@ Note that the `[min,max]` argument is optional and if not included, the code wil ### Manipulation of Nuisance parameters -It can often be useful to modify datacards, or the runtime behavior, without having to modify individual systematics lines. This can be acheived through the following. +It can often be useful to modify datacards, or the runtime behavior, without having to modify individual systematic lines. This can be achieved through nuisance parameter modifiers. #### Nuisance modifiers -If a nuisance parameter needs to be renamed for certain processes/channels, it can be done so using a single `nuisance edit` directive at the end of a datacard +If a nuisance parameter needs to be renamed for certain processes/channels, it can be done using a single `nuisance edit` directive at the end of a datacard ```nohighlight nuisance edit rename process channel oldname newname [options] ``` -Note that the wildcard (**\***) can be used for either/both of process and channel. -This will have the effect that nuisance parameter effecting a given process/channel will be renamed, thereby de-correlating it from other processes/channels. Use options `ifexists` to skip/avoid error if nuisance not found. -This kind of command will only effect nuisances of the type **`shape[N]`**, **`lnN`**. Instead, if you also want to change the names of **`param`** type nuisances, you can use a global version +Note that the wildcard (**\***) can be used for either a process, a channel, or both. +This will have the effect that nuisance parameters affecting a given process/channel will be renamed, thereby de-correlating between processes/channels. Use the option `ifexists` to skip/avoid an error if the nuisance paremeter is not found. +This kind of command will only affect nuisances of the type **`shape[N]`**, **`lnN`**. Instead, if you also want to change the names of **`param`** type nuisances, you can use a global version ```nohighlight nuisance edit rename oldname newname ``` which will rename all **`shape[N]`**, **`lnN`** and **`param`** nuisances found in one go. You should make sure these commands come after any process/channel specific ones in the datacard. This version does not accept options. -Other edits are also supported as follows, +Other edits are also supported, as follows: - * `nuisance edit add process channel name pdf value [options]` -> add a new or add to a nuisance. - * `nuisance edit drop process channel name [options]` -> remove this nuisance from the process/channel. Use options `ifexists` to skip/avoid error if nuisance not found. - * `nuisance edit changepdf name newpdf` -> change the pdf type of a given nuisance to `newpdf`. - * `nuisance edit split process channel oldname newname1 newname2 value1 value2` -> split a nuisance line into two separate nuisances called `newname1` and `newname2` with values `value1` and `value2`. Will produce two separate lines to that the original nuisance `oldname` becomes two uncorrelated nuisances. - * `nuisance edit freeze name [options]` -> set nuisance to frozen by default. Can be over-ridden in `combine` command line using `--floatNuisances` option Use options `ifexists` to skip/avoid error if nuisance not found. - * `nuisance edit merge process channel name1 name2` -> merge systematic `name2` into `name1` by adding their values in quadrature and removing `name2`. This only works if, for each process and channel included, they go in the same direction. For example, you can add 1.1 to 1.2, but not to 0.9. + * `nuisance edit add process channel name pdf value [options]` -> add a new nuisance parameter to a process + * `nuisance edit drop process channel name [options]` -> remove this nuisance from the process/channel. Use the option `ifexists` to skip/avoid errors if the nuisance parameter is not found. + * `nuisance edit changepdf name newpdf` -> change the PDF type of a given nuisance parameter to `newpdf`. + * `nuisance edit split process channel oldname newname1 newname2 value1 value2` -> split a nuisance parameter line into two separate nuisance parameters called `newname1` and `newname2` with values `value1` and `value2`. This will produce two separate lines so that the original nuisance parameter `oldname` is split into two uncorrelated nuisances. + * `nuisance edit freeze name [options]` -> set nuisance parameter frozen by default. Can be overridden on the command line using the `--floatNuisances` option. Use the option `ifexists` to skip/avoid errors if the nuisance parameter not found. + * `nuisance edit merge process channel name1 name2` -> merge systematic `name2` into `name1` by adding their values in quadrature and removing `name2`. This only works if, for each process and channel included, the uncertainties both increase or both reduce the process yield. For example, you can add 1.1 to 1.2, but not to 0.9. -The above edits (excluding the renaming) support nuisances which are any of **`shape[N]`**, **`lnN`**, **`lnU`**, **`gmN`**, **`param`**, **`flatParam`**, **`rateParam`** or **`discrete`** types. +The above edits (excluding the renaming) support nuisance parameters of the types **`shape[N]`**, **`lnN`**, **`lnU`**, **`gmN`**, **`param`**, **`flatParam`**, **`rateParam`**, or **`discrete`**. #### Groups of nuisances -Often it is desirable to freeze one or more nuisances to check the impact they have on limits, likelihood scans, significances etc. +Often it is desirable to freeze one or more nuisance parameters to check the impact they have on limits, likelihood scans, significances etc. -However, for large groups of nuisances (eg everything associated to theory) it is easier to define ***nuisance groups*** in the datacard. The following line in a datacard will, for example, produce a group of nuisances with the group name -`theory` which contains two parameters, `QCDscale` and `pdf`. +However, for large groups of nuisance parameters (eg everything associated to theory) it is easier to define ***nuisance groups*** in the datacard. The following line in a datacard will, for example, produce a group of nuisance parameters with the group name +`theory` that contains two parameters, `QCDscale` and `pdf`. ```nohighlight theory group = QCDscale pdf ``` -Multiple groups can be defined in this way. It is also possible to extend nuisance groups in datacards using **+=** in place of **=**. +Multiple groups can be defined in this way. It is also possible to extend nuisance parameters groups in datacards using **+=** in place of **=**. -These groups can be manipulated at runtime (eg for freezing all nuisances associated to a group at runtime, see [Running the tool](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/)). You can find more info on groups of nuisances [here](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/tree/81x-root606/data/tutorials/groups) +These groups can be manipulated at runtime (eg for freezing all nuisance parameterss associated to a group at runtime, see [Running the tool](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/)). You can find more info on groups of nuisances [here](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/tree/81x-root606/data/tutorials/groups) -Note that when using the automatic addition of statistical uncertainties (autoMCStats), the corresponding nuisance parameters are created by `text2workspace.py` and so do not exist in the datacards. It is therefore not possible to add autoMCStats parameters to groups of nuisances in the way described above. However, `text2workspace.py` will automatically create a group labelled **`autoMCStats`** which contains all autoMCStats parameters. +Note that when using the automatic addition of statistical uncertainties (autoMCStats), the corresponding nuisance parameters are created by `text2workspace.py` and so do not exist in the datacards. It is therefore not possible to add autoMCStats parameters to groups of nuisances in the way described above. However, `text2workspace.py` will automatically create a group labelled **`autoMCStats`**, which contains all autoMCStats parameters. -This group is useful for freezing all parameters created by autoMCStats. For freezing subsets of the parameters, for example if the datacard contains two categories, **cat_label_1** and **cat_label_2**, to only freeze the autoMCStat parameters created for category **cat_label_1** the regular expression features can be used. In this example this can be achieved by using `--freezeParameters 'rgx{prop_bincat_label_1_bin.*}'`. +This group is useful for freezing all parameters created by autoMCStats. For freezing subsets of the parameters, for example if the datacard contains two categories, **cat_label_1** and **cat_label_2**, to only freeze the autoMCStat parameters created for category **cat_label_1**, the regular expression features can be used. In this example this can be achieved by using `--freezeParameters 'rgx{prop_bincat_label_1_bin.*}'`. ### Combination of multiple datacards -If you have separate channels each with it's own datacard, it is possible to produce a combined datacard using the script **`combineCards.py`** +If you have separate channels, each with their own datacard, it is possible to produce a combined datacard using the script **`combineCards.py`** The syntax is simple: **`combineCards.py Name1=card1.txt Name2=card2.txt .... > card.txt`** -If the input datacards had just one bin each, then the output channels will be called `Name1`, `Name2`, and so on. Otherwise, a prefix `Name1_` ... `Name2_` will be added to the bin labels in each datacard. The supplied bin names `Name1`, `Name2`, etc. must themselves conform to valid C++/python identifier syntax. +If the input datacards had just one bin each, the output channels will be called `Name1`, `Name2`, and so on. Otherwise, a prefix `Name1_` ... `Name2_` will be added to the bin labels in each datacard. The supplied bin names `Name1`, `Name2`, etc. must themselves conform to valid C++/python identifier syntax. !!! warning - When combining datacards, you should pay attention that systematics which have different names will be assumed to be uncorrelated, and the ones with the same name will be assumed 100% correlated. A systematic correlated across channels must have the same p.d.f. in all cards (i.e. always **`lnN`**, or all **`gmN`** with same `N`). Furthermore, when using *parametric models*, "parameter" objects such as `RooRealVar`, `RooAbsReal`, and `RooAbsCategory` (parameters, pdf indices etc) with the same name will be assumed to be the same object. If this is not intended, you may find unintended behaviour such as the order of combining cards having an impact on the results! Make sure that such objects are named differently in your inputs if they represent different things! Instead, Combine will try to rename other "shape" objects (such as pdfs) automatically. + When combining datacards, you should keep in mind that systematic uncertainties that have different names will be assumed to be uncorrelated, and those with the same name will be assumed 100% correlated. An uncertainty correlated across channels must have the same PDF. in all cards (i.e. always **`lnN`**, or all **`gmN`** with same `N`. Note that `shape` and `lnN` can be interchanged via the `shape?` directive). Furthermore, when using *parametric models*, "parameter" objects such as `RooRealVar`, `RooAbsReal`, and `RooAbsCategory` (parameters, PDF indices etc) with the same name will be assumed to be the same object. If this is not intended, you may encounter unexpected behaviour, such as the order of combining cards having an impact on the results. Make sure that such objects are named differently in your inputs if they represent different things! Instead, Combine will try to rename other "shape" objects (such as PDFs) automatically. -The `combineCards.py` script will complain if you are trying to combine a *shape* datacard with a *counting* datacard. You can however convert a *counting* datacard in an equivalent shape-based one by adding a line `shapes * * FAKE` in the datacard after the `imax`, `jmax` and `kmax` section. Alternatively, you can add the option `-S` in `combineCards.py` which will do this for you while making the combination. +The `combineCards.py` script will fail if you are trying to combine a *shape* datacard with a *counting* datacard. You can however convert a *counting* datacard into an equivalent shape-based one by adding a line `shapes * * FAKE` in the datacard after the `imax`, `jmax`, and `kmax` section. Alternatively, you can add the option `-S` to `combineCards.py`, which will do this for you while creating the combined datacard. ### Automatic production of datacards and workspaces -For complicated analyses or cases in which multiple datacards are needed (e.g. optimisation studies), you can avoid writing these by hand. The object [Datacard](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/python/Datacard.py) defines the analysis and can be created as a python object. The template python script below will produce the same workspace as running `textToWorkspace.py` (see the section on [Physics Models](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part2/physicsmodels/)) on the [realistic-counting-experiment.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/counting/realistic-counting-experiment.txt) datacard. +For complicated analyses or cases in which multiple datacards are needed (e.g. optimization studies), you can avoid writing these by hand. The object [Datacard](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/python/Datacard.py) defines the analysis and can be created as a python object. The template python script below will produce the same workspace as running `textToWorkspace.py` (see the section on [Physics Models](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part2/physicsmodels/)) on the [realistic-counting-experiment.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/counting/realistic-counting-experiment.txt) datacard. ```python from HiggsAnalysis.CombinedLimit.DatacardParser import * @@ -522,26 +522,26 @@ MB.setPhysics(defaultModel) MB.doModel() ``` -Any existing datacard can be converted into such a template python script by using the `--dump-datacard` option in `text2workspace.py` in case a more complicated template is needed. +Any existing datacard can be converted into such a template python script by using the `--dump-datacard` option in `text2workspace.py`, in case a more complicated template is needed. !!! warning - The above is **not advised** for final results as this script is not easily combined with other analyses so should only be used for internal studies. + The above is **not advised** for final results, as this script is not easily combined with other analyses so should only be used for internal studies. -For the automatic generation of datacards (which are combinable), you should instead use the [CombineHarvester](http://cms-analysis.github.io/CombineHarvester/) package which includes many features for producing complex datacards in a reliable, automated way. +For the automatic generation of datacards that **are** combinable, you should instead use the [CombineHarvester](http://cms-analysis.github.io/CombineHarvester/) package, which includes many features for producing complex datacards in a reliable, automated way. ## Sanity checking the datacard -For large combinations with multiple channels/processes etc, the `.txt` file can get unweildy to read through. There are some simple tools to help check and disseminate the contents of the cards. +For large combinations with multiple channels/processes etc, the `.txt` file can get unwieldy to read through. There are some simple tools to help check and disseminate the contents of the cards. -In order to get a quick view of the systematic uncertainties included in the datacard, you can use the `test/systematicsAnalyzer.py` tool. This will produce a list of the systematic uncertainties (normalisation and shape), indicating what type they are, which channels/processes they affect and the size of the affect on the normalisation (for shape uncertainties, this will just be the overall uncertaintly on the normalisation information). +In order to get a quick view of the systematic uncertainties included in the datacard, you can use the `test/systematicsAnalyzer.py` tool. This will produce a list of the systematic uncertainties (normalization and shape), indicating what type they are, which channels/processes they affect and the size of the effect on the normalization (for shape uncertainties, this will just be the overall uncertainty on the normalization). -The default output is a `.html` file which allows you to expand to give more details about the affect of the systematic for each channel/process. Add the option `--format brief` to give a simpler summary report direct to the terminal. An example output for the tutorial card `data/tutorials/shapes/simple-shapes-TH1.txt` is shown below. +The default output is a `.html` file that can be expanded to give more details about the effect of the systematic uncertainty for each channel/process. Add the option `--format brief` to obtain a simpler summary report direct to the terminal. An example output for the tutorial card `data/tutorials/shapes/simple-shapes-TH1.txt` is shown below. ```nohighlight $ python test/systematicsAnalyzer.py data/tutorials/shapes/simple-shapes-TH1.txt --all -f html > out.html ``` -which will produce the following output in html format. +This will produce the following output in html format: @@ -616,9 +616,9 @@ function toggleChann(id) { -In case you only have a cut-and-count style card, include the option `--noshape`. +In case you only have a counting experiment datacard, include the option `--noshape`. -If you have a datacard which uses several `rateParams` or a Physics model which includes some complicated product of normalisation terms in each process, you can check the values of the normalisation (and which objects in the workspace comprise them) using the `test/printWorkspaceNormalisations.py` tool. As an example, below is the first few blocks of output for the tutorial card `data/tutorials/counting/realistic-multi-channel.txt`. +If you have a datacard that uses several `rateParams` or a Physics model that includes a complicated product of normalization terms in each process, you can check the values of the normalization (and which objects in the workspace comprise them) using the `test/printWorkspaceNormalisations.py` tool. As an example, the first few blocks of output for the tutorial card `data/tutorials/counting/realistic-multi-channel.txt` are given below: /// details | **Show example output**

@@ -690,7 +690,7 @@ Dumping ProcessNormalization n_exp_bine_mu_proc_ZTT @ 0x6bc8910
 ///
 
 
-As you can see, for each channel, a report is given for the top-level rate object in the workspace, for each process contributing to that channel. You can also see the various terms which make up that rate. The default value is for the default parameters in the workspace (i.e when running `text2workspace`, these are the values created as default).
+As you can see, for each channel, a report is given for the top-level rate object in the workspace, for each process contributing to that channel. You can also see the various terms that make up that rate. The default value is for the default parameters in the workspace (i.e when running `text2workspace`, these are the values created as default).
 
 Another example is shown below for the workspace produced from the [data/tutorials/shapes/simple-shapes-parametric.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/shapes/simple-shapes-parametric.txt) datacard.
 
@@ -728,5 +728,5 @@ Another example is shown below for the workspace produced from the [data/tutoria
 
/// -This tells us that the normalisation for the background process, named `n_exp_final_binbin1_proc_bkg` is a product of two objects `n_exp_binbin1_proc_bkg * shapeBkg_bkg_bin1__norm`. The first object is just from the **rate** line in the datacard (equal to 1) and the second is a floating parameter. For the signal, the normalisation is called `n_exp_binbin1_proc_sig` and is a `ProcessNormalization` object which contains the rate modifications due to the systematic uncertainties. You can see that it also has a "*nominal value*" which again is just from the value given in the **rate** line of the datacard (again=1). +This tells us that the normalization for the background process, named `n_exp_final_binbin1_proc_bkg` is a product of two objects `n_exp_binbin1_proc_bkg * shapeBkg_bkg_bin1__norm`. The first object is just from the **rate** line in the datacard (equal to 1) and the second is a floating parameter. For the signal, the normalisation is called `n_exp_binbin1_proc_sig` and is a `ProcessNormalization` object that contains the rate modifications due to the systematic uncertainties. You can see that it also has a "*nominal value*", which again is just from the value given in the **rate** line of the datacard (again=1). diff --git a/docs/part3/commonstatsmethods.md b/docs/part3/commonstatsmethods.md index 059b2f4bc87..e1087812115 100644 --- a/docs/part3/commonstatsmethods.md +++ b/docs/part3/commonstatsmethods.md @@ -1,27 +1,27 @@ # Common Statistical Methods -In this section, the most commonly used statistical methods from combine will be covered including specific instructions on how to obtain limits, significances and likelihood scans. For all of these methods, the assumed parameters of interest (POI) is the overall signal strength **r** (i.e the default PhysicsModel). In general however, the first POI in the list of POIs (as defined by the PhysicsModel) will be taken instead of **r** which may or may not make sense for a given method ... use your judgment! +In this section, the most commonly used statistical methods from Combine will be covered, including specific instructions on how to obtain limits, significances, and likelihood scans. For all of these methods, the assumed parameter of interest (POI) is the overall signal strength **r** (i.e the default PhysicsModel). In general however, the first POI in the list of POIs (as defined by the PhysicsModel) will be taken instead of **r**. This may or may not make sense for any particular method, so care must be taken. -This section will assume that you are using the default model unless otherwise specified. +This section will assume that you are using the default physics model, unless otherwise specified. ## Asymptotic Frequentist Limits -The `AsymptoticLimits` method allows to compute quickly an estimate of the observed and expected limits, which is fairly accurate when the event yields are not too small and the systematic uncertainties don't play a major role in the result. -The limit calculation relies on an asymptotic approximation of the distributions of the **LHC** test-statistic, which is based on a profile likelihood ratio, under signal and background hypotheses to compute two p-values $p_{\mu}, p_{b}$ and therefore $CL_s=p_{\mu}/(1-p_{b})$ (see the (see the [FAQ](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part4/usefullinks/#faq) section for a description of these) - i.e it is the asymptotic approximation of computing limits with frequentist toys using the LHC test-statistic for limits: +The `AsymptoticLimits` method can be used to quickly compute an estimate of the observed and expected limits, which is accurate when the event yields are not too small and the systematic uncertainties do not play a major role in the result. +The limit calculation relies on an asymptotic approximation of the distributions of the **LHC** test statistic, which is based on a profile likelihood ratio, under the signal and background hypotheses to compute two p-values $p_{\mu}, p_{b}$ and therefore $CL_s=p_{\mu}/(1-p_{b})$ (see the [FAQ](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part4/usefullinks/#faq) section for a description). This means it is the asymptotic approximation for evaluating limits with frequentist toys using the LHC test statistic for limits. - * The test statistic is defined using the ratio of likelihoods $q_{r} = -2\ln[\mathcal{L}(\mathrm{data}|r,\hat{\theta}_{r})/\mathcal{L}(\mathrm{data}|r=\hat{r},\hat{\theta})]$ , in which the nuisance parameters are profiled separately for $r=\hat{r}$ and $r$. The value of $q_{r}$ set to 0 when $\hat{r}>r$ giving a one sided limit. Furthermore, the constraint $r>0$ is enforced in the fit. This means that if the unconstrained value of $\hat{r}$ would be negative, the test statistic $q_{r}$ is evaluated as $-2\ln[\mathcal{L}(\mathrm{data}|r,\hat{\theta}_{r})/\mathcal{L}(\mathrm{data}|r=0,\hat{\theta}_{0})]$ + * The test statistic is defined using the ratio of likelihoods $q_{r} = -2\ln[\mathcal{L}(\mathrm{data}|r,\hat{\theta}_{r})/\mathcal{L}(\mathrm{data}|r=\hat{r},\hat{\theta})]$ , in which the nuisance parameters are profiled separately for $r=\hat{r}$ and $r$. The value of $q_{r}$ is set to 0 when $\hat{r}>r$, giving a one-sided limit. Furthermore, the constraint $r>0$ is enforced in the fit. This means that if the unconstrained value of $\hat{r}$ would be negative, the test statistic $q_{r}$ is evaluated as $-2\ln[\mathcal{L}(\mathrm{data}|r,\hat{\theta}_{r})/\mathcal{L}(\mathrm{data}|r=0,\hat{\theta}_{0})]$ -This method is so commonly used that it is the default method (i.e not specifying `-M` will run `AsymptoticLimits`) +This method is the default Combine method: if you call Combine without specifying `-M`, the `AsymptoticLimits` method will be run. -A realistic example of datacard for a counting experiment can be found in the HiggsCombination package: [data/tutorials/counting/realistic-counting-experiment.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/counting/realistic-counting-experiment.txt) +A realistic example of a datacard for a counting experiment can be found in the HiggsCombination package: [data/tutorials/counting/realistic-counting-experiment.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/counting/realistic-counting-experiment.txt) -The method can be run using +The `AsymptoticLimits` method can be run using ```sh combine -M AsymptoticLimits realistic-counting-experiment.txt ``` -The program will print out the limit on the signal strength r (number of signal events / number of expected signal events) e .g. `Observed Limit: r < 1.6297 @ 95% CL` , the median expected limit `Expected 50.0%: r < 2.3111` and edges of the 68% and 95% ranges for the expected limits. +The program will print the limit on the signal strength r (number of signal events / number of expected signal events) e .g. `Observed Limit: r < 1.6297 @ 95% CL` , the median expected limit `Expected 50.0%: r < 2.3111`, and edges of the 68% and 95% ranges for the expected limits. ```nohighlight <<< Combine >>> @@ -43,13 +43,13 @@ By default, the limits are calculated using the CLs prescription, as !!! warning - You may find that combine issues a warning that the best fit for the background-only Asimov dataset returns a non-zero value for the signal strength for example; + You may find that Combine issues a warning that the best fit for the background-only Asimov dataset returns a nonzero value for the signal strength; `WARNING: Best fit of asimov dataset is at r = 0.220944 (0.011047 times` `rMax), while it should be at zero` - If this happens, you should check to make sure that there are no issues with the datacard or the Asimov generation used for your setup. For details on debugging it is recommended that you follow the simple checks used by the HIG PAG [here](https://twiki.cern.ch/twiki/bin/view/CMS/HiggsWG/HiggsPAGPreapprovalChecks). + If this happens, you should check to make sure that there are no issues with the datacard or the Asimov generation used for your setup. For details on debugging, it is recommended that you follow the simple checks used by the HIG PAG [here](https://twiki.cern.ch/twiki/bin/view/CMS/HiggsWG/HiggsPAGPreapprovalChecks). -The program will also create a rootfile `higgsCombineTest.AsymptoticLimits.mH120.root` containing a root tree `limit` that contains the limit values and other bookeeping information. The important columns are `limit` (the limit value) and `quantileExpected` (-1 for observed limit, 0.5 for median expected limit, 0.16/0.84 for the edges of the 65% interval band of expected limits, 0.025/0.975 for 95%). +The program will also create a ROOT file `higgsCombineTest.AsymptoticLimits.mH120.root` containing a ROOT tree `limit` that contains the limit values and other bookkeeping information. The important columns are `limit` (the limit value) and `quantileExpected` (-1 for observed limit, 0.5 for median expected limit, 0.16/0.84 for the edges of the 65% interval band of expected limits, 0.025/0.975 for 95%). ```nohighlight $ root -l higgsCombineTest.AsymptoticLimits.mH120.root @@ -68,17 +68,17 @@ root [0] limit->Scan("*") ### Blind limits -The `AsymptoticLimits` calculation follows the frequentist paradigm for calculating expected limits. This means that the routine will first fit the observed data, conditionally for a fixed value of **r** and set the nuisance parameters to the values obtained in the fit for generating the Asimov data, i.e it calculates the **post-fit** or **a-posteriori** expected limit. In order to use the **pre-fit** nuisance parameters (to calculate an **a-priori** limit), you must add the option `--noFitAsimov` or `--bypassFrequentistFit`. +The `AsymptoticLimits` calculation follows the frequentist paradigm for calculating expected limits. This means that the routine will first fit the observed data, conditionally for a fixed value of **r**, and set the nuisance parameters to the values obtained in the fit for generating the Asimov data set. This means it calculates the **post-fit** or **a-posteriori** expected limit. In order to use the **pre-fit** nuisance parameters (to calculate an **a-priori** limit), you must add the option `--noFitAsimov` or `--bypassFrequentistFit`. For blinding the results completely (i.e not using the data) you can include the option `--run blind`. !!! warning - You should *never* use `-t -1` to get blind limits! + While you *can* use `-t -1` to get blind limits, if the correct options are passed, we strongly recommend to use `--run blind`. ### Splitting points -In case your model is particularly complex, you can perform the asymptotic calculation by determining the value of CLs for a set grid of points (in `r`) and merging the results. This is done by using the option `--singlePoint X` for multiple values of X, hadding the output files and reading them back in, +In case your model is particularly complex, you can perform the asymptotic calculation by determining the value of CLs for a set grid of points (in `r`) and merging the results. This is done by using the option `--singlePoint X` for multiple values of X, hadd'ing the output files and reading them back in, ```sh combine -M AsymptoticLimits realistic-counting-experiment.txt --singlePoint 0.1 -n 0.1 @@ -93,14 +93,14 @@ combine -M AsymptoticLimits realistic-counting-experiment.txt --getLimitFromGrid ## Asymptotic Significances -The significance of a result is calculated using a ratio of profiled likelihoods, one in which the signal strength is set to 0 and the other in which it is free to float, i.e the quantity is $-2\ln[\mathcal{L}(\textrm{data}|r=0,\hat{\theta}_{0})/\mathcal{L}(\textrm{data}|r=\hat{r},\hat{\theta})]$, in which the nuisance parameters are profiled separately for $r=\hat{r}$ and $r=0$. +The significance of a result is calculated using a ratio of profiled likelihoods, one in which the signal strength is set to 0 and the other in which it is free to float. The evaluated quantity is $-2\ln[\mathcal{L}(\textrm{data}|r=0,\hat{\theta}_{0})/\mathcal{L}(\textrm{data}|r=\hat{r},\hat{\theta})]$, in which the nuisance parameters are profiled separately for $r=\hat{r}$ and $r=0$. -The distribution of this test-statistic can be determined using Wilke's theorem provided the number of events is large enough (i.e in the *Asymptotic limit*). The significance (or p-value) can therefore be calculated very quickly and uses the `Significance` method. +The distribution of this test statistic can be determined using Wilks' theorem provided the number of events is large enough (i.e in the *Asymptotic limit*). The significance (or p-value) can therefore be calculated very quickly. The `Significance` method can be used for this. -It is also possible to calculate the ratio of likelihoods between the freely floating signal strength to that of a fixed signal strength *other than 0*, by specifying it with the option `--signalForSignificance=X` +It is also possible to calculate the ratio of likelihoods between the freely floating signal strength to that of a fixed signal strength *other than 0*, by specifying it with the option `--signalForSignificance=X`. !!! info - This calculation assumes that the signal strength can only be positive (i.e we are not interested in negative signal strengths). This can be altered by including the option `--uncapped` + This calculation assumes that the signal strength can only be positive (i.e we are not interested in negative signal strengths). This behaviour can be altered by including the option `--uncapped`. ### Compute the observed significance @@ -123,13 +123,13 @@ Done in 0.00 min (cpu), 0.01 min (real) which is not surprising since 0 events were observed in that datacard. -The output root file will contain the significance value in the branch **limit**. To store the p-value instead, include the option `--pval`. These can be converted between one another using the RooFit functions `RooFit::PValueToSignificance` and `RooFit::SignificanceToPValue`. +The output ROOT file will contain the significance value in the branch **limit**. To store the p-value instead, include the option `--pval`. The significance and p-value can be converted between one another using the RooFit functions `RooFit::PValueToSignificance` and `RooFit::SignificanceToPValue`. -You may find it useful to resort to a brute-force fitting algorithm when calculating the significance which scans the nll (repeating fits until a tolerance is reached), bypassing MINOS, which can be activated with the option `bruteForce`. This can be tuned using the options `setBruteForceAlgo`, `setBruteForceTypeAndAlgo` and `setBruteForceTolerance`. +When calculating the significance, you may find it useful to resort to a brute-force fitting algorithm that scans the nll (repeating fits until a certain tolerance is reached), bypassing MINOS, which can be activated with the option `bruteForce`. This can be tuned using the options `setBruteForceAlgo`, `setBruteForceTypeAndAlgo` and `setBruteForceTolerance`. ### Computing the expected significance -The expected significance can be computed from an Asimov dataset of signal+background. There are two options for this +The expected significance can be computed from an Asimov data set of signal+background. There are two options for this: * a-posteriori expected: will depend on the observed dataset. * a-priori expected (the default behavior): does not depend on the observed dataset, and so is a good metric for optimizing an analysis when still blinded. @@ -140,9 +140,9 @@ The **a-priori** expected significance from the Asimov dataset is calculated as combine -M Significance datacard.txt -t -1 --expectSignal=1 ``` -In order to produced the **a-posteriori** expected significance, just generate a post-fit Asimov (i.e add the option `--toysFreq` in the command above). +In order to produce the **a-posteriori** expected significance, just generate a post-fit Asimov data set by adding the option `--toysFreq` in the command above. -The output format is the same as for observed signifiances: the variable **limit** in the tree will be filled with the significance (or with the p-value if you put also the option `--pvalue`) +The output format is the same as for observed significances: the variable **limit** in the tree will be filled with the significance (or with the p-value if you put also the option `--pvalue`) ## Bayesian Limits and Credible regions @@ -151,7 +151,7 @@ Bayesian calculation of limits requires the user to assume a particular prior di ### Computing the observed bayesian limit (for simple models) -The `BayesianSimple` method computes a Bayesian limit performing classical numerical integration; very fast and accurate but only works for simple models (a few channels and nuisance parameters). +The `BayesianSimple` method computes a Bayesian limit performing classical numerical integration. This is very fast and accurate, but only works for simple models (a few channels and nuisance parameters). ```nohighlight combine -M BayesianSimple simple-counting-experiment.txt @@ -162,11 +162,11 @@ Limit: r < 0.672292 @ 95% CL Done in 0.04 min (cpu), 0.05 min (real) ``` -The output tree will contain a single entry corresponding to the observed 95% upper limit. The confidence level can be modified to **100*X%** using `--cl X`. +The output tree will contain a single entry corresponding to the observed 95% confidence level upper limit. The confidence level can be modified to **100*X%** using `--cl X`. ### Computing the observed bayesian limit (for arbitrary models) -The `MarkovChainMC` method computes a Bayesian limit performing a monte-carlo integration. From the statistics point of view it is identical to the `BayesianSimple` method, only the technical implementation is different. The method is slower, but can also handle complex models. For this method, you can increase the accuracy of the result by increasing the number of markov chains at the expense of a longer running time (option `--tries`, default is 10). Let's use the realistic counting experiment datacard to test the method +The `MarkovChainMC` method computes a Bayesian limit performing a Monte Carlo integration. From the statistical point of view it is identical to the `BayesianSimple` method, only the technical implementation is different. The method is slower, but can also handle complex models. For this method you can increase the accuracy of the result by increasing the number of Markov Chains, at the expense of a longer running time (option `--tries`, default is 10). Let's use the realistic counting experiment datacard to test the method. To use the MarkovChainMC method, users need to specify this method in the command line, together with the options they want to use. For instance, to set the number of times the algorithm will run with different random seeds, use option `--tries`: @@ -180,9 +180,9 @@ Average chain acceptance: 0.078118 Done in 0.14 min (cpu), 0.15 min (real) ``` -Again, the resulting limit tree will contain the result. You can also save the chains using the option `--saveChain` which will then also be included in the output file. +Again, the resulting limit tree will contain the result. You can also save the chains using the option `--saveChain`, which will then also be included in the output file. -Exclusion regions can be made from the posterior once an ordering principle is defined to decide how to grow the contour (there's infinite possible regions that contain 68% of the posterior pdf). Below is a simple example script which can be used to plot the posterior distribution from these chains and calculate the *smallest* such region. Note that in this example we are ignoring the burn-in (but you can add it by just editing `for i in range(mychain.numEntries()):` to `for i in range(200,mychain.numEntries()):` eg for a burn-in of 200. +Exclusion regions can be made from the posterior once an ordering principle is defined to decide how to grow the contour (there is an infinite number of possible regions that contain 68% of the posterior pdf). Below is a simple example script that can be used to plot the posterior distribution from these chains and calculate the *smallest* such region. Note that in this example we are ignoring the burn-in. This can be added by e.g. changing `for i in range(mychain.numEntries()):` to `for i in range(200,mychain.numEntries()):` for a burn-in of 200. /// details | **Show example script**

@@ -253,57 +253,57 @@ Running the script on the output file produced for the same datacard (including
 
 	0.950975 % (0.95 %) interval (target)  = 0 < r < 2.2
 
-along with a plot of the posterior shown below. This is the same as the output from combine but the script can also be used to find lower limits (for example) or credible intervals.
+along with a plot of the posterior distribution shown below. This is the same as the output from Combine, but the script can also be used to find lower limits (for example) or credible intervals.
 
 ![](images/bayes1D.png)
 
-An example to make contours when ordering by probability density is in [bayesContours.cxx](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/test/multiDim/bayesContours.cxx), but the implementation is very simplistic, with no clever handling of bin sizes nor any smoothing of statistical fluctuations.
+An example to make contours when ordering by probability density can be found in [bayesContours.cxx](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/test/multiDim/bayesContours.cxx). Note that the implementation is simplistic, with no clever handling of bin sizes nor smoothing of statistical fluctuations.
 
-The `MarkovChainMC` algorithm has many configurable parameters, and you're encouraged to experiment with those because the default configuration might not be the best for you (or might not even work for you at all).
+The `MarkovChainMC` algorithm has many configurable parameters, and you are encouraged to experiment with those. The default configuration might not be the best for your analysis.
 
 #### Iterations, burn-in, tries
 
 Three parameters control how the MCMC integration is performed:
 
--   the number of **tries** (option `--tries`): the algorithm will run multiple times with different ransom seeds and report as result the truncated mean and rms of the different results. The default value is 10, which should be ok for a quick computation, but for something more accurate you might want to increase this number even up to ~200.
--   the number of **iterations** (option `-i`) determines how many points are proposed to fill a single Markov Chain. The default value is 10k, and a plausible range is between 5k (for quick checks) and 20-30k for lengthy calculations. Usually beyond 30k you get a better tradeoff in time vs accuracy by increasing the number of chains (option `--tries`)
--   the number of **burn-in steps** (option `-b`) is the number of points that are removed from the beginning of the chain before using it to compute the limit. The default is 200. If your chain is very long, you might want to try increase this a bit (e.g. to some hundreds). Instead using a burn-on below 50 is likely to result in bias towards earlier stages of the chain before a reasonable convergence.
+-   the number of **tries** (option `--tries`): the algorithm will run multiple times with different random seeds. The truncated mean and RMS of the different results are reported. The default value is 10, which should be sufficient for a quick computation. For a more accurate result you might want to increase this number up to even ~200.
+-   the number of **iterations** (option `-i`) determines how many points are proposed to fill a single Markov Chain. The default value is 10k, and a plausible range is between 5k (for quick checks) and 20-30k for lengthy calculations. Beyond 30k, the time vs accuracy can be balanced better by increasing the number of chains (option `--tries`).
+-   the number of **burn-in steps** (option `-b`) is the number of points that are removed from the beginning of the chain before using it to compute the limit. The default is 200. If the chain is very long, we recommend to increase this value a bit (e.g. to several hundreds). Using a number of burn-in steps below 50 is likely to result in a bias towards earlier stages of the chain before a reasonable convergence.
 
 #### Proposals
 
 The option `--proposal` controls the way new points are proposed to fill in the MC chain.
 
 -   **uniform**: pick points at random. This works well if you have very few nuisance parameters (or none at all), but normally fails if you have many.
--   **gaus**: Use a product of independent gaussians one for each nuisance parameter; the sigma of the gaussian for each variable is 1/5 of the range of the variable (this can be controlled using the parameter `--propHelperWidthRangeDivisor`). This proposal appears to work well for a reasonable number of nuisances (up to ~15), provided that the range of the nuisance parameters is reasonable - something like ±5σ. This method does **not** work when there are no nuisance parameters.
--   **ortho** (**default**): This proposal is similar to the multi-gaussian proposal but at every step only a single coordinate of the point is varied, so that the acceptance of the chain is high even for a large number of nuisances (i.e. more than 20).
--   **fit**: Run a fit and use the uncertainty matrix from HESSE to construct a proposal (or the one from MINOS if the option `--runMinos` is specified). This sometimes work fine, but sometimes gives biased results, so we don't recommend it in general.
+-   **gaus**: Use a product of independent gaussians, one for each nuisance parameter. The sigma of the gaussian for each variable is 1/5 of the range of the variable. This behaviour can be controlled using the parameter `--propHelperWidthRangeDivisor`. This proposal appears to work well for up to around 15 nuisance parameters, provided that the range of the nuisance parameters is in the range ±5σ. This method does **not** work when there are no nuisance parameters.
+-   **ortho** (**default**): This proposal is similar to the multi-gaussian proposal. However, at every step only a single coordinate of the point is varied, so that the acceptance of the chain is high even for a large number of nuisance parameters (i.e. more than 20).
+-   **fit**: Run a fit and use the uncertainty matrix from HESSE to construct a proposal (or the one from MINOS if the option `--runMinos` is specified). This can give biased results, so this method is not recommended in general.
 
-If you believe there's something going wrong, e.g. if your chain remains stuck after accepting only a few events, the option `--debugProposal` can be used to have a printout of the first *N* proposed points to see what's going on (e.g. if you have some region of the phase space with probability zero, the **gaus** and **fit** proposal can get stuck there forever)
+If you believe there is something going wrong, e.g. if your chain remains stuck after accepting only a few events, the option `--debugProposal` can be used to obtain a printout of the first *N* proposed points. This can help you understand what is happening; for example if you have a region of the phase space with probability zero, the **gaus** and **fit** proposal can get stuck there forever.
 
 
 ### Computing the expected bayesian limit
 
-The expected limit is computed by generating many toy mc observations and compute the limit for each of them. This can be done passing the option `-t` . E.g. to run 100 toys with the `BayesianSimple` method, just do
+The expected limit is computed by generating many toy MC data sets and computing the limit for each of them. This can be done passing the option `-t` . E.g. to run 100 toys with the `BayesianSimple` method, you can run
 
     combine -M BayesianSimple datacard.txt -t 100 
 
-The program will print out the mean and median limit, and the 68% and 95% quantiles of the distributions of the limits. This time, the output root tree will contain **one entry per toy**.
+The program will print out the mean and median limit, as well as the 68% and 95% quantiles of the distributions of the limits. This time, the output ROOT tree will contain **one entry per toy**.
 
-For more heavy methods (eg the `MarkovChainMC`) you'll probably want to split this in multiple jobs. To do this, just run `combine` multiple times specifying a smaller number of toys (can be as low as `1`) each time using a different seed to initialize the random number generator (option `-s` if you set it to -1, the starting seed will be initialized randomly at the beginning of the job), then merge the resulting trees with `hadd` and look at the distribution in the merged file.
+For more heavy methods (eg the `MarkovChainMC`) you will probably want to split this calculation into multiple jobs. To do this, just run Combine multiple times specifying a smaller number of toys (as low as `1`), using a different seed to initialize the random number generator each time. The option `-s` can be used for this; if you set it to -1, the starting seed will be initialized randomly at the beginning of the job. Finally, you can merge the resulting trees with `hadd` and look at the distribution in the merged file.
 
 ### Multidimensional bayesian credible regions
 
-The `MarkovChainMC` method allows the user to produce the posterior pdf as a function of (in principle) any number of parameter of interest. In order to do so, you first need to create a workspace with more than one parameter, as explained in the [physics models](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part2/physicsmodels/) section.
+The `MarkovChainMC` method allows the user to produce the posterior PDF as a function of (in principle) any number of POIs. In order to do so, you first need to create a workspace with more than one parameter, as explained in the [physics models](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part2/physicsmodels/) section.
 
-For example, lets use the toy datacard [data/tutorials/multiDim/toy-hgg-125.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/multiDim/toy-hgg-125.txt) (counting experiment which vaguely resembles the H→γγ analysis at 125 GeV) and convert the datacard into a workspace with 2 parameters, ggH and qqH cross sections using `text2workspace`.
+For example, let us use the toy datacard [data/tutorials/multiDim/toy-hgg-125.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/multiDim/toy-hgg-125.txt) (counting experiment that vaguely resembles an early H→γγ analysis at 125 GeV) and convert the datacard into a workspace with 2 parameters, the ggH and qqH cross sections, using `text2workspace`.
 
     text2workspace.py data/tutorials/multiDim/toy-hgg-125.txt -P HiggsAnalysis.CombinedLimit.PhysicsModel:floatingXSHiggs --PO modes=ggH,qqH -o workspace.root
 
-Now we just run one (or more) MCMC chain(s) and save them in the output tree.By default, the nuisance parameters will be marginalized (integrated) over their pdfs. You can ignore the complaints about not being able to compute an upper limit (since for more than 1D, this isn't well defined),
+Now we just run one (or more) MCMC chain(s) and save them in the output tree. By default, the nuisance parameters will be marginalized (integrated) over their PDFs. You can ignore the complaints about not being able to compute an upper limit (since for more than 1D, this is not well-defined),
 
     combine -M MarkovChainMC workspace.root --tries 1 --saveChain -i 1000000 -m 125 -s 12345 
 
-The output of the markov chain is again a RooDataSet of weighted events distributed according to the posterior pdf (after you cut out the burn in part), so it can be used to make histograms or other distributions of the posterior pdf. See as an example [bayesPosterior2D.cxx](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/test/multiDim/bayesPosterior2D.cxx).
+The output of the Markov Chain is again a RooDataSet of weighted events distributed according to the posterior PDF (after you cut out the burn in part), so it can be used to make histograms or other distributions of the posterior PDF. See as an example [bayesPosterior2D.cxx](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/test/multiDim/bayesPosterior2D.cxx).
 
 Below is an example of the output of the macro,
 
@@ -317,15 +317,15 @@ bayesPosterior2D("bayes2D","Posterior PDF")
 
 ## Computing Limits with toys
 
-The `HybridNew` method is used to compute either the hybrid bayesian-frequentist limits popularly known as "CLs of LEP or Tevatron type" or the fully frequentist limits which are the current recommended method by the LHC Higgs Combination Group. Note that these methods can be resource intensive for complex models.
+The `HybridNew` method is used to compute either the hybrid bayesian-frequentist limits, popularly known as "CLs of LEP or Tevatron type", or the fully frequentist limits, which are the current recommended method by the LHC Higgs Combination Group. Note that these methods can be resource intensive for complex models.
 
-It is possible to define the criterion used for setting limits using `--rule CLs` (to use the CLs criterion) or `--rule CLsplusb` (to calculate the limit using $p_{\mu}$) and as always the confidence level desired using `--cl=X`
+It is possible to define the criterion used for setting limits using `--rule CLs` (to use the CLs criterion) or `--rule CLsplusb` (to calculate the limit using $p_{\mu}$) and as always the confidence level desired using `--cl=X`.
 
-The choice of test-statistic can be made via the option `--testStat` and different methodologies for treatment of the nuisance parameters are available. While it is possible to mix different test-statistics with different nuisance parameter treatments, this is highly **not-reccomended**. Instead one should follow one of the following three procedures,
+The choice of test statistic can be made via the option `--testStat`. Different methodologies for the treatment of the nuisance parameters are available. While it is possible to mix different test statistics with different nuisance parameter treatments, **we strongly do not recommend  this**. Instead one should follow one of the following three procedures,
 
 * **LEP-style**: `--testStat LEP --generateNuisances=1 --fitNuisances=0`
     * The test statistic is defined using the ratio of likelihoods $q_{\mathrm{LEP}}=-2\ln[\mathcal{L}(\mathrm{data}|r=0)/\mathcal{L}(\mathrm{data}|r)]$.
-    * The nuisance parameters are fixed to their nominal values for the purpose of evaluating the likelihood, while for generating toys, the nuisance parameters are first randomized within their pdfs before generation of the toy.
+    * The nuisance parameters are fixed to their nominal values for the purpose of evaluating the likelihood, while for generating toys, the nuisance parameters are first randomized within their PDFs before generation of the toy.
 
 * **TEV-style**: `--testStat TEV --generateNuisances=0 --generateExternalMeasurements=1 --fitNuisances=1`
     * The test statistic is defined using the ratio of likelihoods $q_{\mathrm{TEV}}=-2\ln[\mathcal{L}(\mathrm{data}|r=0,\hat{\theta}_{0})/\mathcal{L}(\mathrm{data}|r,\hat{\theta}_{r})]$, in which the nuisance parameters are profiled separately for $r=0$ and $r$.
@@ -334,23 +334,23 @@ The choice of test-statistic can be made via the option `--testStat` and differe
 * **LHC-style**: `--LHCmode LHC-limits`
 , which is the shortcut for `--testStat LHC --generateNuisances=0 --generateExternalMeasurements=1 --fitNuisances=1`
     * The test statistic is defined using the ratio of likelihoods $q_{r} = -2\ln[\mathcal{L}(\mathrm{data}|r,\hat{\theta}_{r})/\mathcal{L}(\mathrm{data}|r=\hat{r},\hat{\theta})]$ , in which the nuisance parameters are profiled separately for $r=\hat{r}$ and $r$.
-    * The value of $q_{r}$ set to 0 when $\hat{r}>r$ giving a one sided limit. Furthermore, the constraint $r>0$ is enforced in the fit. This means that if the unconstrained value of $\hat{r}$ would be negative, the test statistic $q_{r}$ is evaluated as $-2\ln[\mathcal{L}(\mathrm{data}|r,\hat{\theta}_{r})/\mathcal{L}(\mathrm{data}|r=0,\hat{\theta}_{0})]$
+    * The value of $q_{r}$ set to 0 when $\hat{r}>r$ giving a one-sided limit. Furthermore, the constraint $r>0$ is enforced in the fit. This means that if the unconstrained value of $\hat{r}$ would be negative, the test statistic $q_{r}$ is evaluated as $-2\ln[\mathcal{L}(\mathrm{data}|r,\hat{\theta}_{r})/\mathcal{L}(\mathrm{data}|r=0,\hat{\theta}_{0})]$
     * For the purposes of toy generation, the nuisance parameters are fixed to their **post-fit** values from the data (conditionally on the value of **r**), while the constraint terms are randomized in the evaluation of the likelihood.
 
 !!! warning
-    The recommended style is the **LHC-style**. Please note that this method is sensitive to the *observation in data* since the *post-fit* (after a fit to the data) values of the nuisance parameters (assuming different values of **r**) are used when generating the toys. For completely blind limits you can first generate a *pre-fit* asimov toy dataset (described in the [toy data generation](runningthetool.md#toy-data-generation) section) and use that in place of the data.  You can then use this toy by passing `-D toysFileName.root:toys/toy_asimov`
+    The recommended style is the **LHC-style**. Please note that this method is sensitive to the *observation in data* since the *post-fit* (after a fit to the data) values of the nuisance parameters (assuming different values of **r**) are used when generating the toys. For completely blind limits you can first generate a *pre-fit* asimov toy data set (described in the [toy data generation](runningthetool.md#toy-data-generation) section) and use that in place of the data.  You can use this toy by passing the argument `-D toysFileName.root:toys/toy_asimov`
 
-While the above shortcuts are the common variants, you can also try others. The treatment of the nuisances can be changed to the so-called "Hybrid-Bayesian" method which effectively integrates over the nuisance parameters. This is especially relevant when you have very few expected events in your data and you are using those events to constrain background processes. This can be achieved by setting `--generateNuisances=1 --generateExternalMeasurements=0`. You might also want to avoid first fitting to the data to choose the nominal values in this case, which can be done by also setting `--fitNuisances=0`. 
+While the above shortcuts are the commonly used versions, variations can be tested. The treatment of the nuisances can be changed to the so-called "Hybrid-Bayesian" method, which effectively integrates over the nuisance parameters. This is especially relevant when you have very few expected events in your data, and you are using those events to constrain background processes. This can be achieved by setting `--generateNuisances=1 --generateExternalMeasurements=0`. In case you want to avoid first fitting to the data to choose the nominal values you can additionally pass `--fitNuisances=0`. 
 
 !!! warning
-    If you have unconstrained parameters in your model (`rateParam` or if using a `_norm` variable for a pdf) and you want to use the "Hybrid-Bayesian" method, you **must** declare these as `flatParam` in your datacard and when running text2workspace you must add the option `--X-assign-flatParam-prior` in the command line. This will create uniform priors for these parameters, which is needed for this method and which otherwise would not get created.   
+    If you have unconstrained parameters in your model (`rateParam`, or if you are using a `_norm` variable for a PDF) and you want to use the "Hybrid-Bayesian" method, you **must** declare these as `flatParam` in your datacard. When running text2workspace you must add the option `--X-assign-flatParam-prior` in the command line. This will create uniform priors for these parameters. These are needed for this method and they would otherwise not get created.   
 
 !!! info
-    Note that (observed and toy) values of the test statistic stored in the instances of `RooStats::HypoTestResult` when the option `--saveHybridResult` has been specified, are defined without the factor 2 and therefore are twice as small as the values given by the formulas above. This factor is however included automatically by all plotting script supplied within the Combine package.
+    Note that (observed and expected) values of the test statistic stored in the instances of `RooStats::HypoTestResult` when the option `--saveHybridResult` is passed are defined without the factor 2. They are therefore twice as small as the values given by the formulas above. This factor is however included automatically by all plotting scripts supplied within the Combine package. If you use your own plotting scripts, you need to make sure to incorporate the factor 2. 
 
 ### Simple models
 
-For relatively simple models, the observed and expected limits can be calculated interactively. Since the **LHC-style** is the reccomended procedure for calculating limits using toys, we will use that in this section but the same applies to the other methods.
+For relatively simple models, the observed and expected limits can be calculated interactively. Since the **LHC-style** is the recommended set of options for calculating limits using toys, we will use that in this section. However, the same procedure can be followed with the other sets of options.
 
 ```sh
 combine realistic-counting-experiment.txt -M HybridNew --LHCmode LHC-limits
@@ -530,75 +530,75 @@ Failed to delete temporary file roostats-Sprxsw.root: No such file or directory
 
/// -The result stored in the **limit** branch of the output tree will be the upper limit (and its error stored in **limitErr**). The default behavior will be, as above, to search for the upper limit on **r** however, the values of $p_{\mu}, p_{b}$ and CLs can be calculated for a particular value **r=X** by specifying the option `--singlePoint=X`. In this case, the value stored in the branch **limit** will be the value of CLs (or $p_{\mu}$) (see the [FAQ](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part4/usefullinks/#faq) section). +The result stored in the **limit** branch of the output tree will be the upper limit (and its error, stored in **limitErr**). The default behaviour will be, as above, to search for the upper limit on **r**. However, the values of $p_{\mu}, p_{b}$ and CLs can be calculated for a particular value **r=X** by specifying the option `--singlePoint=X`. In this case, the value stored in the branch **limit** will be the value of CLs (or $p_{\mu}$) (see the [FAQ](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part4/usefullinks/#faq) section). #### Expected Limits -For the simple models, we can just run interactively 5 times to compute the median expected and the 68% and 95% interval boundaries. Use the `HybridNew` method with the same options as per the observed limit but adding a `--expectedFromGrid=` where the quantile is 0.5 for the median, 0.84 for the +ve side of the 68% band, 0.16 for the -ve side of the 68% band, 0.975 for the +ve side of the 95% band, 0.025 for the -ve side of the 95% band. +For simple models, we can run interactively 5 times to compute the median expected and the 68% and 95% central interval boundaries. For this, we can use the `HybridNew` method with the same options as for the observed limit, but adding a `--expectedFromGrid=`. Here, the quantile should be set to 0.5 for the median, 0.84 for the +ve side of the 68% band, 0.16 for the -ve side of the 68% band, 0.975 for the +ve side of the 95% band, and 0.025 for the -ve side of the 95% band. -The output file will contain the value of the quantile in the branch **quantileExpected** which can be used to separate the points. +The output file will contain the value of the quantile in the branch **quantileExpected**. This branch can therefore be used to separate the points. #### Accuracy -The search for the limit is performed using an adaptive algorithm, terminating when the estimate of the limit value is below some limit or when the precision cannot be futher improved with the specified options. The options controlling this behaviour are: +The search for the limit is performed using an adaptive algorithm, terminating when the estimate of the limit value is below some limit or when the precision cannot be improved further with the specified options. The options controlling this behaviour are: - `rAbsAcc`, `rRelAcc`: define the accuracy on the limit at which the search stops. The default values are 0.1 and 0.05 respectively, meaning that the search is stopped when Δr < 0.1 or Δr/r < 0.05. -- `clsAcc`: this determines the absolute accuracy up to which the CLs values are computed when searching for the limit. The default is 0.5%. Raising the accuracy above this value will increase significantly the time to run the algorithm, as you need N2 more toys to improve the accuracy by a factor N, you can consider enlarging this value if you're computing limits with a larger CL (e.g. 90% or 68%). Note that if you're using the `CLsplusb` rule then this parameter will control the uncertainty on $p_{\mu}$ rather than CLs. -- `T` or `toysH`: controls the minimum number of toys that are generated for each point. The default value of 500 should be ok when computing the limit with 90-95% CL. You can decrease this number if you're computing limits at 68% CL, or increase it if you're using 99% CL. +- `clsAcc`: this determines the absolute accuracy up to which the CLs values are computed when searching for the limit. The default is 0.5%. Raising the accuracy above this value will significantly increase the time needed to run the algorithm, as you need N2 more toys to improve the accuracy by a factor N. You can consider increasing this value if you are computing limits with a larger CL (e.g. 90% or 68%). Note that if you are using the `CLsplusb` rule, this parameter will control the uncertainty on $p_{\mu}$ rather than CLs. +- `T` or `toysH`: controls the minimum number of toys that are generated for each point. The default value of 500 should be sufficient when computing the limit at 90-95% CL. You can decrease this number if you are computing limits at 68% CL, or increase it if you are using 99% CL. -Note, to further improve the accuracy when searching for the upper limit, combine will also fit an exponential function to several of the points and interpolate to find the crossing. +Note, to further improve the accuracy when searching for the upper limit, Combine will also fit an exponential function to several of the points and interpolate to find the crossing. ### Complex models -For complicated models, it is best to produce a *grid* of test statistic distributions at various values of the signal strength, and use it to compute the observed and expected limit and bands. This approach is good for complex models since the grid of points can be distributed across any number of jobs. In this approach we will store the distributions of the test-statistic at different values of the signal strength using the option `--saveHybridResult`. The distribution at a single value of **r=X** can be determined by +For complicated models, it is best to produce a *grid* of test statistic distributions at various values of the signal strength, and use it to compute the observed and expected limit and central intervals. This approach is convenient for complex models, since the grid of points can be distributed across any number of jobs. In this approach we will store the distributions of the test statistic at different values of the signal strength using the option `--saveHybridResult`. The distribution at a single value of **r=X** can be determined by ```sh combine datacard.txt -M HybridNew --LHCmode LHC-limits --singlePoint X --saveToys --saveHybridResult -T 500 --clsAcc 0 ``` !!! warning - We have specified the accuracy here by including `clsAcc=0` which turns off adaptive sampling and specifying the number of toys to be 500 with the `-T N` option. For complex models, it may be necessary to split the toys internally over a number of instances of `HybridNew` using the option `--iterations I`. The **total** number of toys will be the product **I*N**. + We have specified the accuracy here by including `--clsAcc=0`, which turns off adaptive sampling, and specifying the number of toys to be 500 with the `-T N` option. For complex models, it may be necessary to internally split the toys over a number of instances of `HybridNew` using the option `--iterations I`. The **total** number of toys will be the product **I*N**. -The above can be repeated several times, in parallel, to build the distribution of the test-statistic (giving the random seed option `-s -1`). Once all of the distributions are finished, the resulting output files can be merged into one using **hadd** and read back to calculate the limit, specifying the merged file with `--grid=merged.root`. +The above can be repeated several times, in parallel, to build the distribution of the test statistic (passing the random seed option `-s -1`). Once all of the distributions have been calculated, the resulting output files can be merged into one using **hadd**, and read back to calculate the limit, specifying the merged file with `--grid=merged.root`. The observed limit can be obtained with ```sh -combine datacard.txt -M HybridNew --LHCmode LHC-limits --readHybridResults --grid=merged.root +combine datacard.txt -M HybridNew --LHCmode LHC-limits --readHybridResults --toysFile=merged.root ``` and similarly, the median expected and quantiles can be determined using ```sh -combine datacard.txt -M HybridNew --LHCmode LHC-limits --readHybridResults --grid=merged.root --expectedFromGrid +combine datacard.txt -M HybridNew --LHCmode LHC-limits --readHybridResults --toysFile=merged.root --expectedFromGrid ``` -substituting `` with 0.5 for the median, 0.84 for the +ve side of the 68% band, 0.16 for the -ve side of the 68% band, 0.975 for the +ve side of the 95% band, 0.025 for the -ve side of the 95% band. +substituting `` with 0.5 for the median, 0.84 for the +ve side of the 68% band, 0.16 for the -ve side of the 68% band, 0.975 for the +ve side of the 95% band, and 0.025 for the -ve side of the 95% band. -The splitting of the jobs can be left to the user's preference. However, users may wish to use the **combineTool** for automating this as described in the section on [combineTool for job submission](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#combinetool-for-job-submission) +The splitting of the jobs can be left to the user's preference. However, users may wish to use the **combineTool** for automating this, as described in the section on [combineTool for job submission](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#combinetool-for-job-submission) #### Plotting -A plot of the CLs (or $p_{\mu}$) as a function of **r**, which is used to find the crossing, can be produced using the option `--plot=limit_scan.png`. This can be useful for judging if the grid was sufficient in determining the upper limit. +A plot of the CLs (or $p_{\mu}$) as a function of **r**, which is used to find the crossing, can be produced using the option `--plot=limit_scan.png`. This can be useful for judging if the chosen grid was sufficient for determining the upper limit. If we use our [realistic-counting-experiment.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/counting/realistic-counting-experiment.txt) datacard and generate a grid of points $r\varepsilon[1.4,2.2]$ in steps of 0.1, with 5000 toys for each point, the plot of the observed CLs vs **r** should look like the following, ![](images/limit_scan.png) -You should judge in each case if the limit is accurate given the spacing of the points and the precision of CLs at each point. If it is not sufficient, simply generate more points closer to the limit and/or more toys at each point. +You should judge in each case whether the limit is accurate given the spacing of the points and the precision of CLs at each point. If it is not sufficient, simply generate more points closer to the limit and/or more toys at each point. -The distributions of the test-statistic can also be plotted, at each value in the grid, using the simple python tool, +The distributions of the test statistic can also be plotted, at each value in the grid, using ```sh python test/plotTestStatCLs.py --input mygrid.root --poi r --val all --mass MASS ``` -The resulting output file will contain a canvas showing the distribution of the test statistic background only and signal+background hypothesis at each value of **r**. Use `--help` to see more options for this script. +The resulting output file will contain a canvas showing the distribution of the test statistics for the background only and signal+background hypotheses at each value of **r**. Use `--help` to see more options for this script. !!! info - If you used the TEV or LEP style test statistic (using the commands as described above), then you should include the option `--doublesided`, which will also take care of defining the correct integrals for $p_{\mu}$ and $p_{b}$. Click on the examples below to see what a typical output of this plotting tool will look like when using the LHC test statistic, or TEV test statistic. + If you used the TEV or LEP style test statistic (using the commands as described above), then you should include the option `--doublesided`, which will also take care of defining the correct integrals for $p_{\mu}$ and $p_{b}$. Click on the examples below to see what a typical output of this plotting tool will look like when using the LHC test statistic, or the TEV test statistic. /// details | **qLHC test stat example** @@ -615,7 +615,7 @@ The resulting output file will contain a canvas showing the distribution of the ## Computing Significances with toys -Computation of expected significance with toys is a two step procedure: first you need to run one or more jobs to construct the expected distribution of the test statistic. As with setting limits, there are a number of different configurations for generating toys but we will use the preferred option using, +Computation of the expected significance with toys is a two-step procedure: first you need to run one or more jobs to construct the expected distribution of the test statistic. As for setting limits, there are a number of different possible configurations for generating toys. However, we will use the most commonly used option, * **LHC-style**: `--LHCmode LHC-significance` , which is the shortcut for `--testStat LHC --generateNuisances=0 --generateExternalMeasurements=1 --fitNuisances=1 --significance` @@ -625,13 +625,13 @@ Computation of expected significance with toys is a two step procedure: first yo ### Observed significance -To construct the distribution of the test statistic run as many times as necessary, +To construct the distribution of the test statistic, the following command should be run as many times as necessary ```sh combine -M HybridNew datacard.txt --LHCmode LHC-significance --saveToys --fullBToys --saveHybridResult -T toys -i iterations -s seed ``` -with different seeds, or using `-s -1` for random seeds, then merge all those results into a single root file with `hadd`. +with different seeds, or using `-s -1` for random seeds, then merge all those results into a single ROOT file with `hadd`. The *observed* significance can be calculated as @@ -643,83 +643,83 @@ where the option `--pvalue` will replace the result stored in the **limit** bran ### Expected significance, assuming some signal -The *expected* significance, assuming a signal with **r=X** can be calculated, by including the option `--expectSignal X` when generating the distribution of the test statistic and using the option `--expectedFromGrid=0.5` when calculating the significance for the median. To get the ±1σ bands, use 0.16 and 0.84 instead of 0.5, and so on... +The *expected* significance, assuming a signal with **r=X** can be calculated, by including the option `--expectSignal X` when generating the distribution of the test statistic and using the option `--expectedFromGrid=0.5` when calculating the significance for the median. To get the ±1σ bands, use 0.16 and 0.84 instead of 0.5, and so on. -You need a total number of background toys large enough to compute the value of the significance, but you need less signal toys (especially if you only need the median). For large significance, you can then run most of the toys without the `--fullBToys` option (about a factor 2 faster), and only a smaller part with that option turned on. +The total number of background toys needs to be large enough to compute the value of the significance, but you need fewer signal toys (especially when you are only computing the median expected significance). For large significances, you can run most of the toys without the `--fullBToys` option, which will be about a factor 2 faster. Only a small part of the toys needs to be run with that option turned on. -As with calculating limits with toys, these jobs can be submitted to the grid or batch systems with the help of the `combineTool` as described in the section on [combineTool for job submission](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#combinetool-for-job-submission) +As with calculating limits with toys, these jobs can be submitted to the grid or batch systems with the help of the `combineTool`, as described in the section on [combineTool for job submission](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#combinetool-for-job-submission) ## Goodness of fit tests -The `GoodnessOfFit` method can be used to evaluate how compatible the observed data are with the model pdf. +The `GoodnessOfFit` method can be used to evaluate how compatible the observed data are with the model PDF. -The module can be run specifying an algorithm, and will compute a goodness of fit indicator for that algorithm and the data. The procedure is therefore to first run on the real data +This method implements several algorithms, and will compute a goodness of fit indicator for the chosen algorithm and the data. The procedure is therefore to first run on the real data ```sh combine -M GoodnessOfFit datacard.txt --algo= ``` -and then to run on many toy mc datasets to determine the distribution of the goodness of fit indicator +and then to run on many toy MC data sets to determine the distribution of the goodness-of-fit indicator ```sh combine -M GoodnessOfFit datacard.txt --algo= -t -s ``` -When computing the goodness of fit, by default the signal strength is left floating in the fit, so that the measure is independent from the presence or absence of a signal. It is possible to instead keep it fixed to some value by passing the option `--fixedSignalStrength=`. +When computing the goodness-of-fit, by default the signal strength is left floating in the fit, so that the measure is independent from the presence or absence of a signal. It is possible to fixe the signal strength to some value by passing the option `--fixedSignalStrength=`. -The following algorithms are supported: +The following algorithms are implemented: -- **`saturated`**: Compute a goodness-of-fit measure for binned fits based on the *saturated model* method, as prescribed by the StatisticsCommittee [(note)](http://www.physics.ucla.edu/~cousins/stats/cousins_saturated.pdf). This quantity is similar to a chi-square, but can be computed for an arbitrary combination of binned channels with arbitrary constraints. +- **`saturated`**: Compute a goodness-of-fit measure for binned fits based on the *saturated model*, as prescribed by the Statistics Committee [(note)](http://www.physics.ucla.edu/~cousins/stats/cousins_saturated.pdf). This quantity is similar to a chi-square, but can be computed for an arbitrary combination of binned channels with arbitrary constraints. -- **`KS`**: Compute a goodness-of-fit measure for binned fits using the *Kolmogorov-Smirnov* test. It is based on the highest difference between the cumulative distribution function and the empirical distribution function of any bin. +- **`KS`**: Compute a goodness-of-fit measure for binned fits using the *Kolmogorov-Smirnov* test. It is based on the largest difference between the cumulative distribution function and the empirical distribution function of any bin. - **`AD`**: Compute a goodness-of-fit measure for binned fits using the *Anderson-Darling* test. It is based on the integral of the difference between the cumulative distribution function and the empirical distribution function over all bins. It also gives the tail ends of the distribution a higher weighting. -The output tree will contain a branch called **`limit`** which contains the value of the test-statistic in each toy. You can make a histogram of this test-statistic $t$ and from this distribution ($f(t)$) and the single value obtained in the data ($t_{0}$) you can calculate the p-value $$p = \int_{t=t_{0}}^{\mathrm{+inf}} f(t) dt $$. Note: in rare cases the test statistic value for the toys can be undefined (for AS and KD), and in this case we set the test statistic value to -1. When plotting the test statistic distribution, those toys should be excluded. This is automatically taken care of if you use the GoF collection script in CombineHarvester described below. +The output tree will contain a branch called **`limit`**, which contains the value of the test statistic in each toy. You can make a histogram of this test statistic $t$. From the distribution that is obtained in this way ($f(t)$) and the single value obtained by running on the observed data ($t_{0}$) you can calculate the p-value $$p = \int_{t=t_{0}}^{\mathrm{+inf}} f(t) dt $$. Note: in rare cases the test statistic value for the toys can be undefined (for AS and KD). In this case we set the test statistic value to -1. When plotting the test statistic distribution, those toys should be excluded. This is automatically taken care of if you use the GoF collection script in CombineHarvester, which is described below. -When generating toys, the default behavior will be used. See the section on [toy generation](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#toy-data-generation) for options on how to generate/fit nuisance parameters in these tests. It is recomended to use the *frequentist toys* (`--toysFreq`) when running the **`saturated`** model, and the default toys for the other two tests. +When generating toys, the default behavior will be used. See the section on [toy generation](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#toy-data-generation) for options that control how nuisance parameters are generated and fitted in these tests. It is recommended to use *frequentist toys* (`--toysFreq`) when running the **`saturated`** model, and the default toys for the other two tests. -Further goodness of fit methods could be added on request, especially if volunteers are available to code them. -The output limit tree will contain the value of the test-statistic in each toy (or the data) +Further goodness-of-fit methods could be added on request, especially if volunteers are available to code them. +The output limit tree will contain the value of the test statistic in each toy (or the data) !!! warning The above algorithms are all concerned with *one-sample* tests. For *two-sample* tests, you can follow an example CMS HIN analysis described [in this Twiki](https://twiki.cern.ch/twiki/bin/viewauth/CMS/HiggsCombineTwoDatasetCompatibility) ### Masking analysis regions in the saturated model -For searches that employs a simultaneous fit across signal and control regions, it may be useful to mask one or more analysis regions either when the likelihood is maximized (fit) or when the test-statistic is computed. This can be done by using the options `--setParametersForFit` and `--setParametersForEval`, respectively. The former will set parameters *before* each fit while the latter is used to set parameters *after* each fit, but before the NLL is evauated. Note of course that if the parameter in the list is floating, it will be still floating in each fit so will not effect the results when using `--setParametersForFit`. +For analyses that employ a simultaneous fit across signal and control regions, it may be useful to mask one or more analysis regions, either when the likelihood is maximized (fit) or when the test statistic is computed. This can be done by using the options `--setParametersForFit` and `--setParametersForEval`, respectively. The former will set parameters *before* each fit, while the latter is used to set parameters *after* each fit, but before the NLL is evaluated. Note, of course, that if the parameter in the list is floating, it will still be floating in each fit. Therefore, it will not affect the results when using `--setParametersForFit`. -A realistic example for a binned shape analysis performed in one signal region and two control samples can be found in this directory of the Higgs-combine package [Datacards-shape-analysis-multiple-regions](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/tree/81x-root606/data/tutorials/rate_params). +A realistic example for a binned shape analysis performed in one signal region and two control samples can be found in this directory of the Combine package [Datacards-shape-analysis-multiple-regions](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/tree/81x-root606/data/tutorials/rate_params). -First of all, one needs to combine the individual datacards to build a single model and to introduce the channel-masking variables as follow: +First of all, one needs to Combine the individual datacards to build a single model, and to introduce the channel masking variables as follow: ```sh combineCards.py signal_region.txt dimuon_control_region.txt singlemuon_control_region.txt > combined_card.txt text2workspace.py combined_card.txt --channel-masks ``` -More information about the channel-masking can be found in this -section [Channel Masking](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/nonstandard/#channel-masking). The saturated test-static value for a simultaneous fit across all the analysis regions can be calculated as: +More information about the channel masking can be found in this +section [Channel Masking](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/nonstandard/#channel-masking). The saturated test static value for a simultaneous fit across all the analysis regions can be calculated as: ```sh combine -M GoodnessOfFit -d combined_card.root --algo=saturated -n _result_sb ``` -In this case, signal and control regions are included in both the fit and in the evaluation of the test-static, and the signal strength is freely floating. This measures the compatibility between the signal+background fit and the observed data. Moreover, it can be interesting to assess the level of compatibility between the observed data in all the regions and the background prediction obtained by only fitting the control regions (CR-only fit). This is computed as follow: +In this case, signal and control regions are included in both the fit and in the evaluation of the test statistic, and the signal strength is freely floating. This measures the compatibility between the signal+background fit and the observed data. Moreover, it can be interesting to assess the level of compatibility between the observed data in all the regions and the background prediction obtained by only fitting the control regions (CR-only fit). This can be evaluated as follow: ```sh combine -M GoodnessOfFit -d combined_card.root --algo=saturated -n _result_bonly_CRonly --setParametersForFit mask_ch1=1 --setParametersForEval mask_ch1=0 --freezeParameters r --setParameters r=0 ``` -where the signal strength is frozen and the signal region is not considered in the fit (`--setParametersForFit mask_ch1=1`), but it is included in the test-statistic computation (`--setParametersForEval mask_ch1=0`). To show the differences between the two models being tested, one can perform a fit to the data using the FitDiagnostics method as: +where the signal strength is frozen and the signal region is not considered in the fit (`--setParametersForFit mask_ch1=1`), but it is included in the test statistic computation (`--setParametersForEval mask_ch1=0`). To show the differences between the two models being tested, one can perform a fit to the data using the FitDiagnostics method as: ```sh combine -M FitDiagnostics -d combined_card.root -n _fit_result --saveShapes --saveWithUncertainties combine -M FitDiagnostics -d combined_card.root -n _fit_CRonly_result --saveShapes --saveWithUncertainties --setParameters mask_ch1=1 ``` -By taking the total background, the total signal, and the data shapes from FitDiagnostics output, we can compare the post-fit predictions from the S+B fit (first case) and the CR-only fit (second case) with the observation as reported below: +By taking the total background, the total signal, and the data shapes from the FitDiagnostics output, we can compare the post-fit predictions from the S+B fit (first case) and the CR-only fit (second case) with the observation as reported below: /// details | **FitDiagnostics S+B fit** @@ -733,7 +733,7 @@ By taking the total background, the total signal, and the data shapes from FitDi /// -To compute a p-value for the two results, one needs to compare the observed goodness-of-fit value previously computed with expected distribution of the test-statistic obtained in toys: +To compute a p-value for the two results, one needs to compare the observed goodness-of-fit value previously computed with the expected distribution of the test statistic obtained in toys: ```sh combine -M GoodnessOfFit combined_card.root --algo=saturated -n result_toy_sb --toysFrequentist -t 500 @@ -752,9 +752,9 @@ where the former gives the result for the S+B model, while the latter gives the ![](images/gof_CRonly.png) /// -### Making a plot of the GoF test-statistic distribution +### Making a plot of the GoF test statistic distribution -If you have also checked out the [combineTool](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/#combine-tool), you can use this to run batch jobs or on the grid (see [here](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#combinetool-for-job-submission)) and produce a plot of the results. Once you have the jobs, you can hadd them together and run (e.g for the saturated model), +If you have also checked out the [combineTool](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/#combine-tool), you can use this to run batch jobs or on the grid (see [here](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#combinetool-for-job-submission)) and produce a plot of the results. Once the jobs have completed, you can hadd them together and run (e.g for the saturated model), ```sh combineTool.py -M CollectGoodnessOfFit --input data_run.root toys_run.root -m 125.0 -o gof.json @@ -763,18 +763,18 @@ plotGof.py gof.json --statistic saturated --mass 125.0 -o gof_plot --title-right ## Channel Compatibility -The `ChannelCompatibilityCheck` method can be used to evaluate how compatible are the measurements of the signal strength from the separate channels of a combination. +The `ChannelCompatibilityCheck` method can be used to evaluate how compatible the measurements of the signal strength from the separate channels of a combination are with each other. -The method performs two fits of the data, first with the nominal model in which all channels are assumed to have the *same signal strength multiplier* $r$, and then another allowing *separate signal strengths* $r_{i}$ in each channel. A chisquare-like quantity is computed as $-2 \ln \mathcal{L}(\mathrm{data}| r)/L(\mathrm{data}|\{r_{i}\}_{i=1}^{N_{\mathrm{chan}}})$. Just like for the goodness of fit indicators, the expected distribution of this quantity under the nominal model can be computed from toy mc. +The method performs two fits of the data, first with the nominal model in which all channels are assumed to have the *same signal strength modifier* $r$, and then another allowing *separate signal strengths* $r_{i}$ in each channel. A chisquare-like quantity is computed as $-2 \ln \mathcal{L}(\mathrm{data}| r)/L(\mathrm{data}|\{r_{i}\}_{i=1}^{N_{\mathrm{chan}}})$. Just like for the goodness-of-fit indicators, the expected distribution of this quantity under the nominal model can be computed from toy MC data sets. By default, the signal strength is kept floating in the fit with the nominal model. It can however be fixed to a given value by passing the option `--fixedSignalStrength=`. -In the default models build from the datacards the signal strengths in all channels are constrained to be non-negative. One can allow negative signal strengths in the fits by changing the bound on the variable (option `--rMin=`), which should make the quantity more chisquare-like under the hypothesis of zero signal; this however can create issues in channels with small backgrounds, since total expected yields and pdfs in each channel must be positive. +In the default model built from the datacards, the signal strengths in all channels are constrained to be non-negative. One can allow negative signal strengths in the fits by changing the bound on the variable (option `--rMin=`), which should make the quantity more chisquare-like under the hypothesis of zero signal; this however can create issues in channels with small backgrounds, since total expected yields and PDFs in each channel must be positive. -When run with the a verbosity of 1, as the default, the program also prints out the best fit signal strengths in all channels; as the fit to all channels is done simultaneously, the correlation between the other systematical uncertainties is taken into account, and so these results can differ from the ones obtained fitting each channel separately. +When run with a verbosity of 1, as is the default, the program also prints out the best fit signal strengths in all channels. As the fit to all channels is done simultaneously, the correlation between the other systematic uncertainties is taken into account. Therefore, these results can differ from the ones obtained when fitting each channel separately. -Below is an example output from combine, +Below is an example output from Combine, ```nohighlight $ combine -M ChannelCompatibilityCheck comb_hww.txt -m 160 -n HWW @@ -797,9 +797,9 @@ Chi2-like compatibility variable: 2.16098 Done in 0.08 min (cpu), 0.08 min (real) ``` -The output tree will contain the value of the compatibility (chisquare variable) in the **limit** branch. If the option `--saveFitResult` is specified, the output root file contains also two [RooFitResult](http://root.cern.ch/root/htmldoc/RooFitResult.html) objects **fit_nominal** and **fit_alternate** with the results of the two fits. +The output tree will contain the value of the compatibility (chi-square variable) in the **limit** branch. If the option `--saveFitResult` is specified, the output ROOT file also contains two [RooFitResult](http://root.cern.ch/root/htmldoc/RooFitResult.html) objects **fit_nominal** and **fit_alternate** with the results of the two fits. -This can be read and used to extract the best fit for each channel and the overall best fit using +This can be read and used to extract the best fit value for each channel, and the overall best fit value, using ```c++ $ root -l @@ -808,13 +808,13 @@ fit_alternate->floatParsFinal().selectByName("*ChannelCompatibilityCheck*")->Pri fit_nominal->floatParsFinal().selectByName("r")->Print("v"); ``` -The macro [cccPlot.cxx](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/test/plotting/cccPlot.cxx) can be used to produce a comparison plot of the best fit signals from all channels. +The macro [cccPlot.cxx](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/test/plotting/cccPlot.cxx) can be used to produce a comparison plot of the best fit signal strengths from all channels. ## Likelihood Fits and Scans -The `MultiDimFit` method can do multi-dimensional fits and likelihood based scans/contours using models with several parameters of interest. +The `MultiDimFit` method can be used to perform multi-dimensional fits and likelihood-based scans/contours using models with several parameters of interest. -Taking a toy datacard [data/tutorials/multiDim/toy-hgg-125.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/multiDim/toy-hgg-125.txt) (counting experiment which vaguely resembles the H→γγ analysis at 125 GeV), we need to convert the datacard into a workspace with 2 parameters, ggH and qqH cross sections +Taking a toy datacard [data/tutorials/multiDim/toy-hgg-125.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/multiDim/toy-hgg-125.txt) (counting experiment which vaguely resembles an early H→γγ analysis at 125 GeV), we need to convert the datacard into a workspace with 2 parameters, the ggH and qqH cross sections: ```sh text2workspace.py toy-hgg-125.txt -m 125 -P HiggsAnalysis.CombinedLimit.PhysicsModel:floatingXSHiggs --PO modes=ggH,qqH @@ -822,29 +822,29 @@ text2workspace.py toy-hgg-125.txt -m 125 -P HiggsAnalysis.CombinedLimit.PhysicsM A number of different algorithms can be used with the option `--algo `, -- **`none`** (default): Perform a maximum likelihood fit `combine -M MultiDimFit toy-hgg-125.root`; The output root tree will contain two columns, one for each parameter, with the fitted values. +- **`none`** (default): Perform a maximum likelihood fit `combine -M MultiDimFit toy-hgg-125.root`; The output ROOT tree will contain two columns, one for each parameter, with the fitted values. -- **`singles`**: Perform a fit of each parameter separately, treating the others as *unconstrained nuisances*: `combine -M MultiDimFit toy-hgg-125.root --algo singles --cl=0.68` . The output root tree will contain two columns, one for each parameter, with the fitted values; there will be one row with the best fit point (and `quantileExpected` set to -1) and two rows for each fitted parameter, where the corresponding column will contain the maximum and minimum of that parameter in the 68% CL interval, according to a *one-dimensional chisquare* (i.e. uncertainties on each fitted parameter *do not* increase when adding other parameters if they're uncorrelated). Note that if you run, for example, with `--cminDefaultMinimizerStrategy=0`, these uncertainties will be derived from the Hessian, while `--cminDefaultMinimizerStrategy=1` will invoke Minos to derive them. +- **`singles`**: Perform a fit of each parameter separately, treating the other parameters of interest as *unconstrained nuisance parameters*: `combine -M MultiDimFit toy-hgg-125.root --algo singles --cl=0.68` . The output ROOT tree will contain two columns, one for each parameter, with the fitted values; there will be one row with the best fit point (and `quantileExpected` set to -1) and two rows for each fitted parameter, where the corresponding column will contain the maximum and minimum of that parameter in the 68% CL interval, according to a *one-dimensional chi-square* (i.e. uncertainties on each fitted parameter *do not* increase when adding other parameters if they are uncorrelated). Note that if you run, for example, with `--cminDefaultMinimizerStrategy=0`, these uncertainties will be derived from the Hessian, while `--cminDefaultMinimizerStrategy=1` will invoke Minos to derive them. -- **`cross`**: Perform joint fit of all parameters: `combine -M MultiDimFit toy-hgg-125.root --algo=cross --cl=0.68`. The output root tree will have one row with the best fit point, and two rows for each parameter, corresponding to the minimum and maximum of that parameter on the likelihood contour corresponding to the specified CL, according to a *N-dimensional chisquare* (i.e. uncertainties on each fitted parameter *do* increase when adding other parameters, even if they're uncorrelated). Note that the output of this way of running *are not* 1D uncertainties on each parameter, and shouldn't be taken as such. +- **`cross`**: Perform a joint fit of all parameters: `combine -M MultiDimFit toy-hgg-125.root --algo=cross --cl=0.68`. The output ROOT tree will have one row with the best fit point, and two rows for each parameter, corresponding to the minimum and maximum of that parameter on the likelihood contour corresponding to the specified CL, according to an *N-dimensional chi-square* (i.e. the uncertainties on each fitted parameter *do* increase when adding other parameters, even if they are uncorrelated). Note that this method *does not* produce 1D uncertainties on each parameter, and should not be taken as such. -- **`contour2d`**: Make a 68% CL contour a la minos `combine -M MultiDimFit toy-hgg-125.root --algo contour2d --points=20 --cl=0.68`. The output will contain values corresponding to the best fit point (with `quantileExpected` set to -1) and for a set of points on the contour (with `quantileExpected` set to 1-CL, or something larger than that if the contour is hitting the boundary of the parameters). Probabilities are computed from the the n-dimensional $\chi^{2}$ distribution. For slow models, you can split it up by running several times with *different* number of points and merge the outputs (something better can be implemented). You can look at the [contourPlot.cxx](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/test/multiDim/contourPlot.cxx) macro for how to make plots out of this algorithm. +- **`contour2d`**: Make a 68% CL contour à la minos `combine -M MultiDimFit toy-hgg-125.root --algo contour2d --points=20 --cl=0.68`. The output will contain values corresponding to the best fit point (with `quantileExpected` set to -1) and for a set of points on the contour (with `quantileExpected` set to 1-CL, or something larger than that if the contour hits the boundary of the parameters). Probabilities are computed from the the n-dimensional $\chi^{2}$ distribution. For slow models, this method can be split by running several times with a *different* number of points, and merging the outputs. The [contourPlot.cxx](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/test/multiDim/contourPlot.cxx) macro can be used to make plots out of this algorithm. -- **`random`**: Scan N random points and compute the probability out of the profile likelihood `combine -M MultiDimFit toy-hgg-125.root --algo random --points=20 --cl=0.68`. Again, best fit will have `quantileExpected` set to -1, while each random point will have `quantileExpected` set to the probability given by the profile likelihood at that point. +- **`random`**: Scan N random points and compute the probability out of the profile likelihood ratio `combine -M MultiDimFit toy-hgg-125.root --algo random --points=20 --cl=0.68`. Again, the best fit will have `quantileExpected` set to -1, while each random point will have `quantileExpected` set to the probability given by the profile likelihood ratio at that point. - **`fixed`**: Compare the log-likelihood at a fixed point compared to the best fit. `combine -M MultiDimFit toy-hgg-125.root --algo fixed --fixedPointPOIs r=r_fixed,MH=MH_fixed`. The output tree will contain the difference in the negative log-likelihood between the points ($\hat{r},\hat{m}_{H}$) and ($\hat{r}_{fixed},\hat{m}_{H,fixed}$) in the branch `deltaNLL`. -- **`grid`**: Scan on a fixed grid of points not with approximately N points in total. `combine -M MultiDimFit toy-hgg-125.root --algo grid --points=10000`. - * You can partition the job in multiple tasks by using options `--firstPoint` and `--lastPoint`, for complicated scans, the points can be split as described in the [combineTool for job submission](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#combinetool-for-job-submission) section. The output file will contain a column `deltaNLL` with the difference in negative log likelihood with respect to the best fit point. Ranges/contours can be evaluated by filling TGraphs or TH2 histograms with these points. - * By default the "min" and "max" of the POI ranges are *not* included and the points which are in the scan are *centered* , eg `combine -M MultiDimFit --algo grid --rMin 0 --rMax 5 --points 5` will scan at the points $r=0.5, 1.5, 2.5, 3.5, 4.5$. You can instead include the option `--alignEdges 1` which causes the points to be aligned with the endpoints of the parameter ranges - eg `combine -M MultiDimFit --algo grid --rMin 0 --rMax 5 --points 6 --alignEdges 1` will now scan at the points $r=0, 1, 2, 3, 4, 5$. NB - the number of points must be increased by 1 to ensure both end points are included. +- **`grid`**: Scan a fixed grid of points with approximately N points in total. `combine -M MultiDimFit toy-hgg-125.root --algo grid --points=10000`. + * You can partition the job in multiple tasks by using the options `--firstPoint` and `--lastPoint`. For complicated scans, the points can be split as described in the [combineTool for job submission](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#combinetool-for-job-submission) section. The output file will contain a column `deltaNLL` with the difference in negative log-likelihood with respect to the best fit point. Ranges/contours can be evaluated by filling TGraphs or TH2 histograms with these points. + * By default the "min" and "max" of the POI ranges are *not* included and the points that are in the scan are *centred* , eg `combine -M MultiDimFit --algo grid --rMin 0 --rMax 5 --points 5` will scan at the points $r=0.5, 1.5, 2.5, 3.5, 4.5$. You can include the option `--alignEdges 1`, which causes the points to be aligned with the end-points of the parameter ranges - e.g. `combine -M MultiDimFit --algo grid --rMin 0 --rMax 5 --points 6 --alignEdges 1` will scan at the points $r=0, 1, 2, 3, 4, 5$. Note - the number of points must be increased by 1 to ensure both end points are included. -With the algorithms `none` and `singles` you can save the RooFitResult from the initial fit using the option `--saveFitResult`. The fit result is saved into a new file called `muiltidimfit.root`. +With the algorithms `none` and `singles` you can save the RooFitResult from the initial fit using the option `--saveFitResult`. The fit result is saved into a new file called `multidimfit.root`. -As usual, any *floating* nuisance parameters will be *profiled* which can be turned of using the `--freezeParameters` option. +As usual, any *floating* nuisance parameters will be *profiled*. This behaviour can be modified by using the `--freezeParameters` option. -For most of the methods, for lower precision results you can turn off the profiling of the nuisances setting option `--fastScan`, which for complex models speeds up the process by several orders of magnitude. **All** nuisance parameters will be kept fixed at the value corresponding to the best fit point. +For most of the methods, for lower-precision results you can turn off the profiling of the nuisance parameters by using the option `--fastScan`, which for complex models speeds up the process by several orders of magnitude. **All** nuisance parameters will be kept fixed at the value corresponding to the best fit point. -As an example, lets produce the $-2\Delta\ln{\mathcal{L}}$ scan as a function of **`r_ggH`** and **`r_qqH`** from the toy H→γγ datacard, with the nuisance parameters *fixed* to their global best fit values. +As an example, let's produce the $-2\Delta\ln{\mathcal{L}}$ scan as a function of **`r_ggH`** and **`r_qqH`** from the toy H→γγ datacard, with the nuisance parameters *fixed* to their global best fit values. ```sh combine toy-hgg-125.root -M MultiDimFit --algo grid --points 2000 --setParameterRanges r_qqH=0,10:r_ggH=0,4 -m 125 --fastScan @@ -903,28 +903,28 @@ best_fit->SetMarkerSize(3); best_fit->SetMarkerStyle(34); best_fit->Draw("p same ![](images/nll2D.png) -To make the full profiled scan just remove the `--fastScan` option from the combine command. +To make the full profiled scan, just remove the `--fastScan` option from the Combine command. -Similarly, 1D scans can be drawn directly from the tree, however for 1D likelihood scans, there is a python script from the [`CombineHarvester/CombineTools`](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/#combine-tool) package [plot1DScan.py](https://github.com/cms-analysis/CombineHarvester/blob/113x/CombineTools/scripts/plot1DScan.py) which can be used to make plots and extract the crossings of the `2*deltaNLL` - e.g the 1σ/2σ boundaries. +Similarly, 1D scans can be drawn directly from the tree, however for 1D likelihood scans, there is a python script from the [`CombineHarvester/CombineTools`](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/#combine-tool) package [plot1DScan.py](https://github.com/cms-analysis/CombineHarvester/blob/113x/CombineTools/scripts/plot1DScan.py) that can be used to make plots and extract the crossings of the `2*deltaNLL` - e.g the 1σ/2σ boundaries. ### Useful options for likelihood scans A number of common, useful options (especially for computing likelihood scans with the **grid** algo) are, -* `--autoBoundsPOIs arg`: Adjust bounds for the POIs if they end up close to the boundary. This can be a comma separated list of POIs, or "*" to get all of them. +* `--autoBoundsPOIs arg`: Adjust bounds for the POIs if they end up close to the boundary. This can be a comma-separated list of POIs, or "*" to get all of them. * `--autoMaxPOIs arg`: Adjust maxima for the POIs if they end up close to the boundary. Can be a list of POIs, or "*" to get all. * `--autoRange X`: Set to any **X >= 0** to do the scan in the $\hat{p}$ $\pm$ Xσ range, where $\hat{p}$ and σ are the best fit parameter value and uncertainty from the initial fit (so it may be fairly approximate). In case you do not trust the estimate of the error from the initial fit, you can just centre the range on the best fit value by using the option `--centeredRange X` to do the scan in the $\hat{p}$ $\pm$ X range centered on the best fit value. -* `--squareDistPoiStep`: POI step size based on distance from midpoint ( either (max-min)/2 or the best fit if used with `--autoRange` or `--centeredRange` ) rather than linear separation. -* `--skipInitialFit`: Skip the initial fit (saves time if for example a snapshot is loaded from a previous fit) +* `--squareDistPoiStep`: POI step size based on distance from the midpoint ( either (max-min)/2 or the best fit if used with `--autoRange` or `--centeredRange` ) rather than linear separation. +* `--skipInitialFit`: Skip the initial fit (saves time if, for example, a snapshot is loaded from a previous fit) -Below is a comparison in a likelihood scan, with 20 points, as a function of **`r_qqH`** with our `toy-hgg-125.root` workspace with and without some of these options. The options added tell combine to scan more points closer to the minimum (best-fit) than with the default. +Below is a comparison in a likelihood scan, with 20 points, as a function of **`r_qqH`** with our `toy-hgg-125.root` workspace with and without some of these options. The options added tell Combine to scan more points closer to the minimum (best-fit) than with the default. ![](images/r_qqH.png) You may find it useful to use the `--robustFit=1` option to turn on robust (brute-force) for likelihood scans (and other algorithms). You can set the strategy and tolerance when using the `--robustFit` option using the options `--setRobustFitAlgo` (default is `Minuit2,migrad`), `setRobustFitStrategy` (default is 0) and `--setRobustFitTolerance` (default is 0.1). If these options are not set, the defaults (set using `cminDefaultMinimizerX` options) will be used. -If running `--robustFit=1` with the algo **singles**, you can tune the accuracy of the routine used to find the crossing points of the likelihood using the option `--setCrossingTolerance` (default is set to 0.0001) +If running `--robustFit=1` with the algo **singles**, you can tune the accuracy of the routine used to find the crossing points of the likelihood using the option `--setCrossingTolerance` (the default is set to 0.0001) If you suspect your fits/uncertainties are not stable, you may also try to run custom HESSE-style calculation of the covariance matrix. This is enabled by running `MultiDimFit` with the `--robustHesse=1` option. A simple example of how the default behaviour in a simple datacard is given [here](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/issues/498). @@ -943,23 +943,23 @@ If `--floatOtherPOIs` is set to 0, the other parameters of interest (POIs), whic - When running with `--algo=singles`, the other floating POIs are treated as unconstrained nuisance parameters. - When running with `--algo=cross` or `--algo=contour2d`, the other floating POIs are treated as other POIs, and so they increase the number of dimensions of the chi-square. -As a result, when running with `floatOtherPOIs` set to 1, the uncertainties on each fitted parameters do not depend on what's the selection of POIs passed to MultiDimFit, but only on the number of parameters of the model. +As a result, when running with `--floatOtherPOIs` set to 1, the uncertainties on each fitted parameters do not depend on the selection of POIs passed to MultiDimFit, but only on the number of parameters of the model. !!! info - Note that `poi` given to the the option `-P` can also be any nuisance parameter. However, by default, the other nuisance parameters are left *floating*, so you do not need to specify that. + Note that `poi` given to the the option `-P` can also be any nuisance parameter. However, by default, the other nuisance parameters are left *floating*, so in general this does not need to be specified. -You can save the values of the other parameters of interest in the output tree by adding the option `saveInactivePOI=1`. You can additionally save the post-fit values any nuisance parameter, function or discrete index (RooCategory) defined in the workspace using the following options; +You can save the values of the other parameters of interest in the output tree by passing the option `--saveInactivePOI=1`. You can additionally save the post-fit values any nuisance parameter, function, or discrete index (RooCategory) defined in the workspace using the following options; -- `--saveSpecifiedNuis=arg1,arg2,...` will store the fitted value of any specified *constrained* nuisance parameter. Use `all` to save every constrained nuisance parameter. **Note** that if you want to store the values of `flatParams` (or floating parameters which are not defined in the datacard) or `rateParams`, which are *unconstrained*, you should instead use the generic option `--trackParameters` as described [here](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#common-command-line-options). +- `--saveSpecifiedNuis=arg1,arg2,...` will store the fitted value of any specified *constrained* nuisance parameter. Use `all` to save every constrained nuisance parameter. **Note** that if you want to store the values of `flatParams` (or floating parameters that are not defined in the datacard) or `rateParams`, which are *unconstrained*, you should instead use the generic option `--trackParameters` as described [here](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#common-command-line-options). - `--saveSpecifiedFunc=arg1,arg2,...` will store the value of any function (eg `RooFormulaVar`) in the model. - `--saveSpecifiedIndex=arg1,arg2,...` will store the index of any `RooCategory` object - eg a `discrete` nuisance. ### Using best fit snapshots -This can be used to save time when performing scans so that the best-fit needs not be redone and can also be used to perform scans with some nuisances frozen to the best-fit values. Sometimes it is useful to scan freezing certain nuisances to their *best-fit* values as opposed to the default values. To do this here is an example, +This can be used to save time when performing scans so that the best fit does not need to be repeated. It can also be used to perform scans with some nuisance parameters frozen to their best-fit values. This can be done as follows, -- Create a workspace workspace for a floating $r,m_{H}$ fit +- Create a workspace for a floating $r,m_{H}$ fit ```sh text2workspace.py hgg_datacard_mva_8TeV_bernsteins.txt -m 125 -P HiggsAnalysis.CombinedLimit.PhysicsModel:floatingHiggsMass --PO higgsMassRange=120,130 -o testmass.root` @@ -971,7 +971,7 @@ text2workspace.py hgg_datacard_mva_8TeV_bernsteins.txt -m 125 -P HiggsAnalysis.C combine -m 123 -M MultiDimFit --saveWorkspace -n teststep1 testmass.root --verbose 9 ``` -Now we can load the best-fit $\hat{r},\hat{m}_{H}$ and fit for $r$ freezing $m_{H}$ and **lumi_8TeV** to the best-fit values, +Now we can load the best fit $\hat{r},\hat{m}_{H}$ and fit for $r$ freezing $m_{H}$ and **lumi_8TeV** to their best-fit values, ```sh combine -m 123 -M MultiDimFit -d higgsCombineteststep1.MultiDimFit.mH123.root -w w --snapshotName "MultiDimFit" -n teststep2 --verbose 9 --freezeParameters MH,lumi_8TeV @@ -980,7 +980,7 @@ combine -m 123 -M MultiDimFit -d higgsCombineteststep1.MultiDimFit.mH123.root -w The Feldman-Cousins (FC) procedure for computing confidence intervals for a generic model is, -- use the profile likelihood as the test-statistic $q(x) = - 2 \ln \mathcal{L}(\mathrm{data}|x,\hat{\theta}_{x})/\mathcal{L}(\mathrm{data}|\hat{x},\hat{\theta})$ where $x$ is a point in the (N-dimensional) parameter space, and $\hat{x}$ is the point corresponding to the best fit. In this test-statistic, the nuisance parameters are profiled, separately both in the numerator and denominator. +- use the profile likelihood ratio as the test statistic, $q(x) = - 2 \ln \mathcal{L}(\mathrm{data}|x,\hat{\theta}_{x})/\mathcal{L}(\mathrm{data}|\hat{x},\hat{\theta})$ where $x$ is a point in the (N-dimensional) parameter space, and $\hat{x}$ is the point corresponding to the best fit. In this test statistic, the nuisance parameters are profiled, both in the numerator and denominator. - for each point $x$: - compute the observed test statistic $q_{\mathrm{obs}}(x)$ - compute the expected distribution of $q(x)$ under the hypothesis of $x$ as the true value. @@ -988,7 +988,7 @@ The Feldman-Cousins (FC) procedure for computing confidence intervals for a gene With a critical value $\alpha$. -In `combine`, you can perform this test on each individual point (**param1, param2,...**) = (**value1,value2,...**) by doing, +In Combine, you can perform this test on each individual point (**param1, param2,...**) = (**value1,value2,...**) by doing, ```sh combine workspace.root -M HybridNew --LHCmode LHC-feldman-cousins --clsAcc 0 --singlePoint param1=value1,param2=value2,param3=value3,... --saveHybridResult [Other options for toys, iterations etc as with limits] @@ -997,7 +997,7 @@ combine workspace.root -M HybridNew --LHCmode LHC-feldman-cousins --clsAcc 0 --s The point belongs to your confidence region if $p_{x}$ is larger than $\alpha$ (e.g. 0.3173 for a 1σ region, $1-\alpha=0.6827$). !!! warning - You should not use this method without the option `--singlePoint`. Although combine will not complain, the algorithm to find the crossing will only find a single crossing and therefore not find the correct interval. Instead you should calculate the Feldman-Cousins intervals as described above. + You should not use this method without the option `--singlePoint`. Although Combine will not complain, the algorithm to find the crossing will only find a single crossing and therefore not find the correct interval. Instead you should calculate the Feldman-Cousins intervals as described above. ### Physical boundaries @@ -1007,7 +1007,7 @@ Imposing physical boundaries (such as requiring $\mu>0$ for a signal strength) i --setParameterRanges param1=param1_min,param1_max:param2=param2_min,param2_max .... ``` -The boundary is imposed by **restricting the parameter range(s)** to those set by the user, in the fits. Note that this is a trick! The actual fitted value, as one of an ensemble of outcomes, can fall outside of the allowed region, while the boundary should be imposed on the physical parameter. The effect of restricting the parameter value in the fit is such that the test-statistic is modified as follows ; +The boundary is imposed by **restricting the parameter range(s)** to those set by the user, in the fits. Note that this is a trick! The actual fitted value, as one of an ensemble of outcomes, can fall outside of the allowed region, while the boundary should be imposed on the physical parameter. The effect of restricting the parameter value in the fit is such that the test statistic is modified as follows ; $q(x) = - 2 \ln \mathcal{L}(\mathrm{data}|x,\hat{\theta}_{x})/\mathcal{L}(\mathrm{data}|\hat{x},\hat{\theta})$, if $\hat{x}$ in contained in the bounded range @@ -1018,9 +1018,7 @@ $q(x) = - 2 \ln \mathcal{L}(\mathrm{data}|x,\hat{\theta}_{x})/\mathcal{L}(\mathr This can sometimes be an issue as Minuit may not know if has successfully converged when the minimum lies outside of that range. If there is no upper/lower boundary, just set that value to something far from the region of interest. !!! info - One can also imagine imposing the boundaries by first allowing Minuit to find the minimum in the *un-restricted* region and then setting the test-statistic to that above in the case that minimum lies outside the physical boundary. This would avoid potential issues of convergence - If you are interested in implementing this version in combine, please contact the development team. - -As in general for `HybridNew`, you can split the task into multiple tasks (grid and/or batch) and then merge the outputs, as described in the [combineTool for job submission](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#combinetool-for-job-submission) section. + One can also imagine imposing the boundaries by first allowing Minuit to find the minimum in the *unrestricted* region and then setting the test statistic to that in the case that minimum lies outside the physical boundary. This would avoid potential issues of convergence. If you are interested in implementing this version in Combine, please contact the development team. ### Extracting contours from results files @@ -1029,27 +1027,27 @@ As in general for `HybridNew`, you can split the task into multiple tasks (grid #### Extracting 1D intervals -For *one-dimensional* models only, and if the parameter behaves like a cross-section, the code is somewhat able to do interpolation and determine the values of your parameter on the contour (just like it does for the limits). As with limits, read in the grid of points and extract 1D intervals using, +For *one-dimensional* models only, and if the parameter behaves like a cross section, the code is able to interpolate and determine the values of your parameter on the contour (just like it does for the limits). As with limits, read in the grid of points and extract 1D intervals using, ```sh -combine workspace.root -M HybridNew --LHCmode LHC-feldman-cousins --readHybridResults --grid=mergedfile.root --cl <1-alpha> +combine workspace.root -M HybridNew --LHCmode LHC-feldman-cousins --readHybridResults --toysFile=mergedfile.root --cl <1-alpha> ``` -The output tree will contain the values of the POI which crosses the critical value ($\alpha$) - i.e, the boundaries of the confidence intervals, +The output tree will contain the values of the POI that crosses the critical value ($\alpha$) - i.e, the boundaries of the confidence intervals. You can produce a plot of the value of $p_{x}$ vs the parameter of interest $x$ by adding the option `--plot `. #### Extracting 2D contours -There is a tool for extracting *2D contours* from the output of `HybridNew` located in `test/makeFCcontour.py` provided the option `--saveHybridResult` was included when running `HybridNew`. It can be run with the usual combine output files (or several of them) as input, +There is a tool for extracting *2D contours* from the output of `HybridNew` located in `test/makeFCcontour.py`. This can be used provided the option `--saveHybridResult` was included when running `HybridNew`. It can be run with the usual Combine output files (or several of them) as input, ```sh ./test/makeFCcontour.py toysfile1.root toysfile2.root .... [options] -out outputfile.root ``` -To extract 2D contours, the names of each parameter must be given `--xvar poi_x --yvar poi_y`. The output will be a root file containing a 2D histogram of value of $p_{x,y}$ for each point $(x,y)$ which can be used to draw 2D contours. There will also be a histogram containing the number of toys found for each point. +To extract 2D contours, the names of each parameter must be given `--xvar poi_x --yvar poi_y`. The output will be a ROOT file containing a 2D histogram of value of $p_{x,y}$ for each point $(x,y)$ which can be used to draw 2D contours. There will also be a histogram containing the number of toys found for each point. -There are several options for reducing the running time (such as setting limits on the region of interest or the minimum number of toys required for a point to be included) Finally, adding the option `--storeToys` in this python tool will add histograms for each point to the output file of the test-statistic distribution. This will increase the momory usage however as all of the toys will be stored in memory. +There are several options for reducing the running time, such as setting limits on the region of interest or the minimum number of toys required for a point to be included. Finally, adding the option `--storeToys` in this script will add histograms for each point to the output file of the test statistic distribution. This will increase the memory usage, as all of the toys will be kept in memory. diff --git a/docs/part3/debugging.md b/docs/part3/debugging.md index 92927a318ed..cecb0178fd2 100644 --- a/docs/part3/debugging.md +++ b/docs/part3/debugging.md @@ -1,6 +1,6 @@ # Debugging fits -When a fit fails there are several things you can do to investigate. Have a look at [these slides](https://indico.cern.ch/event/976099/contributions/4138476/attachments/2163625/3651175/CombineTutorial-2020-debugging.pdf) from the combine tutorial. +When a fit fails there are several things you can do to investigate. CMS users can have a look at [these slides](https://indico.cern.ch/event/976099/contributions/4138476/attachments/2163625/3651175/CombineTutorial-2020-debugging.pdf) from a previous Combine tutorial. This section contains a few pointers for some of the methods mentioned in the slides. ## Analyzing the NLL shape in each parameter diff --git a/docs/part3/nonstandard.md b/docs/part3/nonstandard.md index 347f7f805a3..fcc66ebe89e 100644 --- a/docs/part3/nonstandard.md +++ b/docs/part3/nonstandard.md @@ -1,18 +1,18 @@ # Advanced Use Cases -This section will cover some of the more specific use cases for combine which are not necessarily related to the main statistics results. +This section will cover some of the more specific use cases for Combine that are not necessarily related to the main results of the analysis. -## Fitting Diagnostics +## Fit Diagnostics -You may first want to look at the HIG PAG standard checks applied to all datacards if you want to diagnose your limit setting/fitting results which can be found [here](https://twiki.cern.ch/twiki/bin/view/CMS/HiggsWG/HiggsPAGPreapprovalChecks) +If you want to diagnose your limits/fit results, you may first want to look at the HIG PAG standard checks, which are applied to all datacards and can be found [here](https://twiki.cern.ch/twiki/bin/view/CMS/HiggsWG/HiggsPAGPreapprovalChecks). -If you have already found the higgs boson but it's an exotic one, instead of computing a limit or significance you might want to extract it's cross section by performing a maximum-likelihood fit. Or, more seriously, you might want to use this same package to extract the cross section of some other process (e.g. the di-boson production). Or you might want to know how well the data compares to you model, e.g. how strongly it constraints your other nuisance parameters, what's their correlation, etc. These general diagnostic tools are contained in the method `FitDiagnostics`. +If you have already found the Higgs boson but it's an exotic one, instead of computing a limit or significance you might want to extract its cross section by performing a maximum-likelihood fit. Alternatively, you might want to know how compatible your data and your model are, e.g. how strongly your nuisance parameters are constrained, to what extent they are correlated, etc. These general diagnostic tools are contained in the method `FitDiagnostics`. ``` combine -M FitDiagnostics datacard.txt ``` -The program will print out the result of the *two fits* performed with signal strength **r** (or first POI in the list) set to zero and a second with floating **r**. The output root tree will contain the best fit value for **r** and it's uncertainty. You will also get a `fitDiagnostics.root` file containing the following objects: +The program will print out the result of *two fits*. The first one is performed with the signal strength **r** (or the first POI in the list, in models with multiple POIs) set to zero and a second with floating **r**. The output ROOT tree will contain the best fit value for **r** and its uncertainty. You will also get a `fitDiagnostics.root` file containing the following objects: | Object | Description | |------------------------|---------------------------------------------------------------------------------------------------------------------------------------| @@ -23,22 +23,22 @@ The program will print out the result of the *two fits* performed with signal st | **`tree_fit_sb`** | `TTree` of fitted nuisance parameter values and constraint terms (_In) with floating signal strength | | **`tree_fit_b`** | `TTree` of fitted nuisance parameter values and constraint terms (_In) with signal strength set to 0 | -by including the option `--plots`, you will additionally find the following contained in the root file . +by including the option `--plots`, you will additionally find the following contained in the ROOT file: | Object | Description | |------------------------|---------------------------------------------------------------------------------------------------------------------------------------| | **`covariance_fit_s`** | `TH2D` Covariance matrix of the parameters in the fit with floating signal strength | | **`covariance_fit_b`** | `TH2D` Covariance matrix of the parameters in the fit with signal strength set to zero | -| **`category_variable_prefit`** | `RooPlot` plot of the prefit pdfs with the data (or toy if running with `-t` overlaid) | -| **`category_variable_fit_b`** | `RooPlot` plot of the pdfs from the background only fit with the data (or toy if running with `-t` overlaid) | -| **`category_variable_fit_s`** | `RooPlot` plot of the pdfs from the signal+background fit with the data (or toy if running with `-t` overlaid) | +| **`category_variable_prefit`** | `RooPlot` plot of the pre-fit PDFs/templates with the data (or toy if running with `-t`) overlaid | +| **`category_variable_fit_b`** | `RooPlot` plot of the PDFs/templates from the background only fit with the data (or toy if running with `-t`) overlaid | +| **`category_variable_fit_s`** | `RooPlot` plot of the PDFs/templates from the signal+background fit with the data (or toy if running with `-t`) overlaid | -where for the `RooPlot` objects, you will get one per category in the likelihood and one per variable if using a multi-dimensional dataset. You will also get a png file for each of these additional objects. +There will be one `RooPlot` object per category in the likelihood, and one per variable if using a multi-dimensional dataset. For each of these additional objects a png file will also be produced. !!! info - If you use the option `--name` this name will be inserted into the file name for this output file too. + If you use the option `--name`, this additional name will be inserted into the file name for this output file. -As well as values of the constrained nuisance parameters (and their constraint values) in the toys, you will also find branches for the number of "bad" nll calls (which you should check is not too large) and the status of the fit `fit_status`. The fit status is computed as follows +As well as the values of the constrained nuisance parameters (and their constraints), you will also find branches for the number of "bad" nll calls (which you should check is not too large) and the status of the fit `fit_status`. The fit status is computed as follows ``` fit_status = 100 * hesse_status + 10 * minos_status + minuit_summary_status @@ -46,21 +46,21 @@ fit_status = 100 * hesse_status + 10 * minos_status + minuit_summary_status The `minuit_summary_status` is the usual status from Minuit, details of which can be found [here](https://root.cern.ch/root/htmldoc/ROOT__Minuit2__Minuit2Minimizer.html#ROOT__Minuit2__Minuit2Minimizer:Minimize). For the other status values, check these documentation links for the [`hesse_status`](https://root.cern.ch/root/htmldoc/ROOT__Minuit2__Minuit2Minimizer.html#ROOT__Minuit2__Minuit2Minimizer:Hesse) and the [`minos_status`](https://root.cern.ch/root/htmldoc/ROOT__Minuit2__Minuit2Minimizer.html#ROOT__Minuit2__Minuit2Minimizer:GetMinosError). -A fit status of -1 indicates that the fit failed (Minuit summary was not 0 or 1) and hence the fit is **not** valid. +A fit status of -1 indicates that the fit failed (Minuit summary was not 0 or 1) and hence the fit result is **not** valid. ### Fit options -- If you need only the signal+background fit, you can run with `--justFit`. This can be useful if the background-only fit is not interesting or not converging (e.g. if the significance of your signal is very very large) -- You can use `--rMin` and `--rMax` to set the range of the first POI; a range that is not too large compared to the uncertainties you expect from the fit usually gives more stable and accurate results. -- By default, the uncertainties are computed using MINOS for the first POI and HESSE for all other parameters (and hence they will be symmetric for the nuisance parameters). You can run MINOS for *all* parameters using the option `--minos all`, or for *none* of the parameters using `--minos none`. Note that running MINOS is slower so you should only consider using it if you think the HESSE uncertainties are not accurate. -- If MINOS or HESSE fails to converge, you can try running with `--robustFit=1` that will do a slower but more robust likelihood scan; this can be further controlled by the parameter `--stepSize` (the default is 0.1, and is relative to the range of the parameter) -- You can set the strategy and tolerance when using the `--robustFit` option using the options `setRobustFitAlgo` (default is `Minuit2,migrad`), `setRobustFitStrategy` (default is 0) and `--setRobustFitTolerance` (default is 0.1). If these options are not set, the defaults (set using `cminDefaultMinimizerX` options) will be used. You can also tune the accuracy of the routine used to find the crossing points of the likelihood using the option `--setCrossingTolerance` (default is set to 0.0001) -- If you find the covariance matrix provided by HESSE is not accurate (i.e. `fit_s->Print()` reports this was forced positive-definite) then a custom HESSE-style calculation of the covariance matrix can be used instead. This is enabled by running FitDiagnostics with the `--robustHesse 1` option. Please note that the status reported by `RooFitResult::Print()` will contain `covariance matrix quality: Unknown, matrix was externally provided` when robustHesse is used, this is normal and does not indicate a problem. NB: one feature of the robustHesse algorithm is that if it still cannot calculate a positive-definite covariance matrix it will try to do so by dropping parameters from the hessian matrix before inverting. If this happens it will be reported in the output to the screen. +- If you only want to run the signal+background fit, and do not need the output file, you can run with `--justFit`. In case you would like to run only the signal+background fit but would like to produce the output file, you should use the option `--skipBOnlyFit` instead. +- You can use `--rMin` and `--rMax` to set the range of the first POI; a range that is not too large compared with the uncertainties you expect from the fit usually gives more stable and accurate results. +- By default, the uncertainties are computed using MINOS for the first POI and HESSE for all other parameters. For the nuisance parameters the uncertainties will therefore be symmetric. You can run MINOS for *all* parameters using the option `--minos all`, or for *none* of the parameters using `--minos none`. Note that running MINOS is slower so you should only consider using it if you think the HESSE uncertainties are not accurate. +- If MINOS or HESSE fails to converge, you can try running with `--robustFit=1`. This will do a slower, but more robust, likelihood scan, which can be further controlled with the parameter `--stepSize` (the default value is 0.1, and is relative to the range of the parameter). +- The strategy and tolerance when using the `--robustFit` option can be set using the options `setRobustFitAlgo` (default is `Minuit2,migrad`), `setRobustFitStrategy` (default is 0) and `--setRobustFitTolerance` (default is 0.1). If these options are not set, the defaults (set using `cminDefaultMinimizerX` options) will be used. You can also tune the accuracy of the routine used to find the crossing points of the likelihood using the option `--setCrossingTolerance` (the default is set to 0.0001) +- If you find the covariance matrix provided by HESSE is not accurate (i.e. `fit_s->Print()` reports this was forced positive-definite) then a custom HESSE-style calculation of the covariance matrix can be used instead. This is enabled by running `FitDiagnostics` with the `--robustHesse 1` option. Please note that the status reported by `RooFitResult::Print()` will contain `covariance matrix quality: Unknown, matrix was externally provided` when robustHesse is used, this is normal and does not indicate a problem. NB: one feature of the robustHesse algorithm is that if it still cannot calculate a positive-definite covariance matrix it will try to do so by dropping parameters from the hessian matrix before inverting. If this happens it will be reported in the output to the screen. - For other fitting options see the [generic minimizer options](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/wiki/runningthetool#generic-minimizer-options) section. ### Fit parameter uncertainties -If you get a warning message when running `FitDiagnostics` which says `Unable to determine uncertainties on all fit parameters`. This means the covariance matrix calculated in FitDiagnostics was not correct. +If you get a warning message when running `FitDiagnostics` that says `Unable to determine uncertainties on all fit parameters`. This means the covariance matrix calculated in `FitDiagnostics` was not correct. The most common problem is that the covariance matrix is forced positive-definite. In this case the constraints on fit parameters as taken from the covariance matrix are incorrect and should not be used. In particular, if you want to make post-fit plots of the distribution used in the signal extraction fit and are extracting the uncertainties on the signal and background expectations from the covariance matrix, the resulting values will not reflect the truth if the covariance matrix was incorrect. By default if this happens and you passed the `--saveWithUncertainties` flag when calling `FitDiagnostics`, this option will be ignored as calculating the uncertainties would lead to incorrect results. This behaviour can be overridden by passing `--ignoreCovWarning`. @@ -73,103 +73,103 @@ A discontinuity in the NLL function or its derivatives at or near the minimum. If you are aware that your analysis has any of these features you could try resolving these. Setting `--cminDefaultMinimizerStrategy 0` can also help with this problem. -### Pre and post fit nuisance parameters and pulls +### Pre- and post-fit nuisance parameters -It is possible to compare pre-fit and post-fit nuisance parameters with the script [diffNuisances.py](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/test/diffNuisances.py). Taking as input a `fitDiagnostics.root` file, the script will by default print out the parameters which have changed significantly w.r.t. their initial estimate. For each of those parameters, it will print out the shift in value and the post-fit uncertainty, both normalized to the input values, and the linear correlation between the parameter and the signal strength. +It is possible to compare pre-fit and post-fit nuisance parameter values with the script [diffNuisances.py](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/test/diffNuisances.py). Taking as input a `fitDiagnostics.root` file, the script will by default print out the parameters that have changed significantly with respect to their initial estimate. For each of those parameters, it will print out the shift in value and the post-fit uncertainty, both normalized to the initial (pre-fit) value. The linear correlation between the parameter and the signal strength will also be printed. python diffNuisances.py fitDiagnostics.root -The script has several options to toggle the thresholds used to decide if a parameter has changed significantly, to get the printout of the absolute value of the nuisance parameters, and to get the output in another format for easy cut-n-paste (supported formats are `html`, `latex`, `twiki`). To print *all* of the parameters, use the option `--all`. +The script has several options to toggle the thresholds used to decide whether a parameter has changed significantly, to get the printout of the absolute value of the nuisance parameters, and to get the output in another format for use on a webpage or in a note (the supported formats are `html`, `latex`, `twiki`). To print *all* of the parameters, use the option `--all`. -The output by default will be the changes in the nuisance parameter values and uncertainties, relative to their initial (pre-fit) values (usually relative to initial values of 0 and 1 for most nuisance types). +By default, the changes in the nuisance parameter values and uncertainties are given relative to their initial (pre-fit) values (usually relative to initial values of 0 and 1 for most nuisance types). -The values in the output will be $(\theta-\theta_{I})/\sigma_{I}$ if the nuisance has a pre-fit uncertainty, otherwise it will be $\theta-\theta_{I}$ if not (eg, a `flatParam` has no pre-fit uncertainty). +The values in the output will be $(\theta-\theta_{I})/\sigma_{I}$ if the nuisance has a pre-fit uncertainty, otherwise they will be $\theta-\theta_{I}$ (for example, a `flatParam` has no pre-fit uncertainty). -The uncertainty reported will be the ratio $\sigma/\sigma_{I}$ - i.e the ratio of the post-fit to the pre-fit uncertainty. If there is no pre-fit uncertainty (as for `flatParam` nuisances) then the post-fit uncertainty is shown. +The reported uncertainty will be the ratio $\sigma/\sigma_{I}$ - i.e the ratio of the post-fit to the pre-fit uncertainty. If there is no pre-fit uncertainty (as for `flatParam` nuisances), the post-fit uncertainty is shown. -With the option `--abs`, instead the pre-fit and post-fit values and (asymmetric) uncertainties will be reported in full. +To print the pre-fit and post-fit values and (asymmetric) uncertainties, rather than the ratios, the option `--abs` can be used. !!! info - We recommend you include the options `--abs` and `--all` to get the full information on all of the parameters (including unconstrained nuisance parameters) at least once when checking your datacards. + We recommend that you include the options `--abs` and `--all` to get the full information on all of the parameters (including unconstrained nuisance parameters) at least once when checking your datacards. -If instead of the plain values, you wish to report the _pulls_, you can do so with the option `--pullDef X` with `X` being one of the following options; You should note that since the pulls below are only defined when the pre-fit uncertainty exists, *nothing* will be reported for parameters which have no prior constraint (except in the case of the `unconstPullAsym` choice as described below). You may want to run without this option and `--all` to get information on those parameters. +If instead of the nuisance parameter values, you wish to report the _pulls_, you can do so using the option `--pullDef X`, with `X` being one of the options listed below. You should note that since the pulls below are only defined when the pre-fit uncertainty exists, *nothing* will be reported for parameters that have no prior constraint (except in the case of the `unconstPullAsym` choice as described below). You may want to run without this option and `--all` to get information about those parameters. -- `relDiffAsymErrs`: This is the same as the default output of the tool except that only constrained parameters (pre-fit uncertainty defined) are reported. The error is also reported and calculated as $\sigma/\sigma_{I}$. +- `relDiffAsymErrs`: This is the same as the default output of the tool, except that only constrained parameters (i.e. where the pre-fit uncertainty is defined) are reported. The uncertainty is also reported and calculated as $\sigma/\sigma_{I}$. -- `unconstPullAsym`: Report the pull as $\frac{\theta-\theta_{I}}{\sigma}$ where $\theta_{I}$ and $\sigma$ are the initial value and **post-fit** uncertainty of that nuisance parameter. The pull defined in this way will have no error bar, but *all* nuisance parameters will have a result in this case. +- `unconstPullAsym`: Report the pull as $\frac{\theta-\theta_{I}}{\sigma}$, where $\theta_{I}$ and $\sigma$ are the initial value and **post-fit** uncertainty of that nuisance parameter. The pull defined in this way will have no error bar, but *all* nuisance parameters will have a result in this case. -- `compatAsym`: The pull is defined as $\frac{\theta-\theta_{D}}{\sqrt{\sigma^{2}+\sigma_{D}^{2}}}$, where $\theta_{D}$ and $\sigma_{D}$ are calculated as $\sigma_{D} = (\frac{1}{\sigma^{2}} - \frac{1}{\sigma_{I}^{2}})^{-1}$ and $\theta_{D} = \sigma_{D}(\theta - \frac{\theta_{I}}{\sigma_{I}^{2}})$, where $\theta_{I}$ and $\sigma_{I}$ are the initial value and uncertainty of that nuisance parameter. This can be thought of as a _compatibility_ between the initial measurement (prior) an imagined measurement where only the data (with no constraint) is used to measure the nuisance parameter. There is no error bar associated to this value. +- `compatAsym`: The pull is defined as $\frac{\theta-\theta_{D}}{\sqrt{\sigma^{2}+\sigma_{D}^{2}}}$, where $\theta_{D}$ and $\sigma_{D}$ are calculated as $\sigma_{D} = (\frac{1}{\sigma^{2}} - \frac{1}{\sigma_{I}^{2}})^{-1}$ and $\theta_{D} = \sigma_{D}(\theta - \frac{\theta_{I}}{\sigma_{I}^{2}})$. In this expression $\theta_{I}$ and $\sigma_{I}$ are the initial value and uncertainty of that nuisance parameter. This can be thought of as a _compatibility_ between the initial measurement (prior) and an imagined measurement where only the data (with no constraint on the nuisance parameter) is used to measure the nuisance parameter. There is no error bar associated with this value. - `diffPullAsym`: The pull is defined as $\frac{\theta-\theta_{I}}{\sqrt{\sigma_{I}^{2}-\sigma^{2}}}$, where $\theta_{I}$ and $\sigma_{I}$ are the pre-fit value and uncertainty (from [L. Demortier and L. Lyons](http://physics.rockefeller.edu/luc/technical_reports/cdf5776_pulls.pdf)). If the denominator is close to 0 or the post-fit uncertainty is larger than the pre-fit (usually due to some failure in the calculation), the pull is not defined and the result will be reported as `0 +/- 999`. -If using `--pullDef`, the results for *all* parameters for which the pull can be calculated will be shown (i.e `--all` will be set to `true`), not just those which have moved by some metric. +If using `--pullDef`, the results for *all* parameters for which the pull can be calculated will be shown (i.e `--all` will be set to `true`), not just those that have moved by some metric. -This script has the option (`-g outputfile.root`) to produce plots of the fitted _values_ of the nuisance parameters and their post-fit, asymmetric uncertainties. Instead, the pulls defined using one of the options above, can be plotted using the option `--pullDef X`. In addition this will produce a plot showing directly a comparison of the post-fit to pre-fit nuisance (symmetrized) uncertainties. +This script has the option (`-g outputfile.root`) to produce plots of the fitted _values_ of the nuisance parameters and their post-fit, asymmetric uncertainties. Instead, the pulls defined using one of the options above, can be plotted using the option `--pullDef X`. In addition this will produce a plot showing a comparison between the post-fit and pre-fit (symmetrized) uncertainties on the nuisance parameters. !!! info - In the above options, if an asymmetric uncertainty is associated to the nuisance parameter, then the choice of which uncertainty is used in the definition of the pull will depend on the sign of $\theta-\theta_{I}$. + In the above options, if an asymmetric uncertainty is associated with the nuisance parameter, then the choice of which uncertainty is used in the definition of the pull will depend on the sign of $\theta-\theta_{I}$. ### Normalizations -For a certain class of models, like those made from datacards for shape-based analysis, the tool can also compute and save to the output root file the best fit yields of all processes. If this feature is turned on with the option `--saveNormalizations`, the file will also contain three RooArgSet `norm_prefit`, `norm_fit_s`, `norm_fit_b` objects each containing one RooConstVar for each channel `xxx` and process `yyy` with name **`xxx/yyy`** and value equal to the best fit yield. You can use `RooRealVar::getVal` and `RooRealVar::getError` to estimate both the post-(or pre-)fit values and uncertainties of these normalisations. +For a certain class of models, like those made from datacards for shape-based analysis, the tool can also compute and save the best fit yields of all processes to the output ROOT file. If this feature is turned on with the option `--saveNormalizations`, the file will also contain three `RooArgSet` objects `norm_prefit`, `norm_fit_s`, and `norm_fit_b`. These each contain one `RooConstVar` for each channel `xxx` and process `yyy` with name **`xxx/yyy`** and value equal to the best fit yield. You can use `RooRealVar::getVal` and `RooRealVar::getError` to estimate both the post-fit (or pre-fit) values and uncertainties of these normalizations. -The sample pyroot macro [mlfitNormsToText.py](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/test/mlfitNormsToText.py) can be used to convert the root file into a text table with four columns: channel, process, yield from the signal+background fit and yield from the background-only fit. To include the uncertainties in the table, add the option `--uncertainties` +The sample `pyROOT` macro [mlfitNormsToText.py](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/test/mlfitNormsToText.py) can be used to convert the ROOT file into a text table with four columns: channel, process, yield from the signal+background fit, and yield from the background-only fit. To include the uncertainties in the table, add the option `--uncertainties`. !!! warning - Note that when running with multiple toys, the `norm_fit_s`, `norm_fit_b` and `norm_prefit` objects will be stored for the _last_ toy dataset generated and so may not be useful to you. + Note that when running with multiple toys, the `norm_fit_s`, `norm_fit_b`, and `norm_prefit` objects will be stored for the _last_ toy dataset generated and so may not be useful to you. -Note that this procedure works only for "extended likelihoods" like the ones used in shape-based analysis, not for the cut-and-count datacards. You can however convert a cut-and-count datacard in an equivalent shape-based one by adding a line `shapes * * FAKE` in the datacard after the `imax`, `jmax`, `kmax` or using `combineCards.py countingcard.txt -S > shapecard.txt`. +Note that this procedure works only for "extended likelihoods" like the ones used in shape-based analysis, not for counting experiment datacards. You can however convert a counting experiment datacard to an equivalent shape-based one by adding a line `shapes * * FAKE` in the datacard after the `imax`, `jmax`, `kmax` lines. Alternatively, you can use `combineCards.py countingcard.txt -S > shapecard.txt` to do this conversion. #### Per-bin norms for shape analyses -If you have a shape based analysis, you can also (instead) include the option `--savePredictionsPerToy`. With this option, additional branches will be filled in the three output trees contained in `fitDiagnostics.root`. +If you have a shape-based analysis, you can include the option `--savePredictionsPerToy`. With this option, additional branches will be filled in the three output trees contained in `fitDiagnostics.root`. -The normalisation values for each toy will be stored in the branches inside the `TTrees` named **n\_exp[\_final]\_binxxx\_proc\_yyy**. The **\_final** will only be there if there are systematics affecting this process. +The normalization values for each toy will be stored in the branches inside the `TTrees` named **n\_exp[\_final]\_binxxx\_proc\_yyy**. The **\_final** will only be there if there are systematic uncertainties affecting this process. -Additionally, there will be filled branches which provide the value of the expected **bin** content for each process, in each channel. These will are named as **n\_exp[\_final]\_binxxx\_proc\_yyy_i** (where **\_final** will only be in the name if there are systematics affecting this process) for channel `xxx`, process `yyy` bin number `i`. In the case of the post-fit trees (`tree_fit_s/b`), these will be resulting expectations from the _fitted_ models, while for the pre-fit tree, they will be the expectation from the generated model (i.e if running toys with `-t N` and using `--genNuisances`, they will be randomised for each toy). These can be useful, for example, for calculating correlations/covariances between different bins, in different channels or processes, within the model from toys. +Additionally, there will be branches that provide the value of the expected **bin** content for each process, in each channel. These are named **n\_exp[\_final]\_binxxx\_proc\_yyy_i** (where **\_final** will only be in the name if there are systematic uncertainties affecting this process) for channel `xxx`, process `yyy`, bin number `i`. In the case of the post-fit trees (`tree_fit_s/b`), these will be the expectations from the _fitted_ models, while for the pre-fit tree, they will be the expectation from the generated model (i.e if running toys with `-t N` and using `--genNuisances`, they will be randomized for each toy). These can be useful, for example, for calculating correlations/covariances between different bins, in different channels or processes, within the model from toys. !!! info - Be aware that for *unbinned* models, a binning scheme is adopted based on the `RooRealVar::getBinning` for the observable defining the shape, if it exists, or combine will adopt some appropriate binning for each observable. + Be aware that for *unbinned* models, a binning scheme is adopted based on the `RooRealVar::getBinning` for the observable defining the shape, if it exists, or Combine will adopt some appropriate binning for each observable. ### Plotting -`FitDiagnostics` can also produce pre- and post-fit plots the model in the same directory as `fitDiagnostics.root` along with the data. To get them, you have to specify the option `--plots`, and then *optionally specify* what are the names of the signal and background pdfs, e.g. `--signalPdfNames='ggH*,vbfH*'` and `--backgroundPdfNames='*DY*,*WW*,*Top*'` (by default, the definitions of signal and background are taken from the datacard). For models with more than 1 observable, a separate projection onto each observable will be produced. +`FitDiagnostics` can also produce pre- and post-fit plots of the model along with the data. They will be stored in the same directory as `fitDiagnostics.root`. To obtain these, you have to specify the option `--plots`, and then *optionally specify* the names of the signal and background PDFs/templates, e.g. `--signalPdfNames='ggH*,vbfH*'` and `--backgroundPdfNames='*DY*,*WW*,*Top*'` (by default, the definitions of signal and background are taken from the datacard). For models with more than 1 observable, a separate projection onto each observable will be produced. -An alternative is to use the options `--saveShapes`. The result will be additional folders in `fitDiagnostics.root` for each category, with pre and post-fit distributions of the signals and backgrounds as TH1s and the data as TGraphAsymmErrors (with Poisson intervals as error bars). +An alternative is to use the option `--saveShapes`. This will add additional folders in `fitDiagnostics.root` for each category, with pre- and post-fit distributions of the signals and backgrounds as TH1s, and the data as `TGraphAsymmErrors` (with Poisson intervals as error bars). !!! info - If you want to save post-fit shapes at a specific r value, add the options `--customStartingPoint` and `--skipSBFit`, and set the r value. The result will appear in **shapes\_fit\_b**, as described below. + If you want to save post-fit shapes at a specific **r** value, add the options `--customStartingPoint` and `--skipSBFit`, and set the **r** value. The result will appear in **shapes\_fit\_b**, as described below. -Three additional folders (**shapes\_prefit**, **shapes\_fit\_sb** and **shapes\_fit\_b** ) will contain the following distributions, +Three additional folders (**shapes\_prefit**, **shapes\_fit\_sb** and **shapes\_fit\_b** ) will contain the following distributions: | Object | Description | |------------------------|---------------------------------------------------------------------------------------------------------------------------------------| | **`data`** | `TGraphAsymmErrors` containing the observed data (or toy data if using `-t`). The vertical error bars correspond to the 68% interval for a Poisson distribution centered on the observed count. | -| **`$PROCESS`** (id <= 0) | `TH1F` for each signal process in channel, named as in the datacard | -| **`$PROCESS`** (id > 0) | `TH1F` for each background process in channel, named as in the datacard| +| **`$PROCESS`** (id <= 0) | `TH1F` for each signal process in each channel, named as in the datacard | +| **`$PROCESS`** (id > 0) | `TH1F` for each background process in each channel, named as in the datacard| | **`total_signal`** | `TH1F` Sum over the signal components| | **`total_background`** | `TH1F` Sum over the background components| | **`total`** | `TH1F` Sum over all of the signal and background components | -The above distributions are provided *for each channel included in the datacard*, in separate sub-folders, named as in the datacard: There will be one sub-folder per channel. +The above distributions are provided *for each channel included in the datacard*, in separate subfolders, named as in the datacard: There will be one subfolder per channel. !!! warning - The pre-fit signal is by default for `r=1` but this can be modified using the option `--preFitValue`. + The pre-fit signal is evaluated for `r=1` by default, but this can be modified using the option `--preFitValue`. -The distributions and normalisations are guaranteed to give the correct interpretation: +The distributions and normalizations are guaranteed to give the correct interpretation: -- For shape datacards whose inputs are TH1, the histograms/data points will have the bin number as the x-axis and the content of each bin will be a number of events. +- For shape datacards whose inputs are `TH1`, the histograms/data points will have the bin number as the x-axis and the content of each bin will be a number of events. -- For datacards whose inputs are RooAbsPdf/RooDataHists, the x-axis will correspond to the observable and the bin content will be the PDF density / events divided by the bin width. This means the absolute number of events in a given bin, i, can be obtained from `h.GetBinContent(i)*h.GetBinWidth(i)` or similar for the data graphs. **Note** that for *unbinned* analyses combine will make a reasonable guess as to an appropriate binning. +- For datacards whose inputs are `RooAbsPdf`/`RooDataHist`s, the x-axis will correspond to the observable and the bin content will be the PDF density / events divided by the bin width. This means the absolute number of events in a given bin, i, can be obtained from `h.GetBinContent(i)*h.GetBinWidth(i)` or similar for the data graphs. **Note** that for *unbinned* analyses Combine will make a reasonable guess as to an appropriate binning. Uncertainties on the shapes will be added with the option `--saveWithUncertainties`. These uncertainties are generated by re-sampling of the fit covariance matrix, thereby accounting for the full correlation between the parameters of the fit. !!! warning - It may be tempting to sum up the uncertainties in each bin (in quadrature) to get the *total* uncertainty on a process however, this is (usually) incorrect as doing so would not account for correlations *between the bins*. Instead you can refer to the uncertainties which will be added to the post-fit normalizations described above. + It may be tempting to sum up the uncertainties in each bin (in quadrature) to get the *total* uncertainty on a process. However, this is (usually) incorrect, as doing so would not account for correlations *between the bins*. Instead you can refer to the uncertainties which will be added to the post-fit normalizations described above. Additionally, the covariance matrix **between** bin yields (or yields/bin-widths) in each channel will also be saved as a `TH2F` named **total_covar**. If the covariance between *all bins* across *all channels* is desired, this can be added using the option `--saveOverallShapes`. Each folder will now contain additional distributions (and covariance matrices) corresponding to the concatenation of the bins in each channel (and therefore the covaraince between every bin in the analysis). The bin labels should make it clear as to which bin corresponds to which channel. @@ -187,7 +187,7 @@ $ root -l plotParametersFromToys("fitDiagnosticsToys.root","fitDiagnosticsData.root","workspace.root","r<0") ``` -The first argument is the name of the output file from running with toys, and the second and third (optional) arguments are the name of the file containing the result from a fit to the data and the workspace (created from `text2workspace.py`). The fourth argument can be used to specify a cut string applied to one of the branches in the tree which can be used to correlate strange behaviour with specific conditions. The output will be 2 pdf files (**`tree_fit_(s)b.pdf`**) and 2 root files (**`tree_fit_(s)b.root`**) containing canvases of the fit results of the tool. For details on the output plots, consult [AN-2012/317](http://cms.cern.ch/iCMS/user/noteinfo?cmsnoteid=CMS%20AN-2012/317). +The first argument is the name of the output file from running with toys, and the second and third (optional) arguments are the name of the file containing the result from a fit to the data and the workspace (created from `text2workspace.py`). The fourth argument can be used to specify a cut string applied to one of the branches in the tree, which can be used to correlate strange behaviour with specific conditions. The output will be 2 pdf files (**`tree_fit_(s)b.pdf`**) and 2 ROOT files (**`tree_fit_(s)b.root`**) containing canvases of the fit results of the tool. For details on the output plots, consult [AN-2012/317](http://cms.cern.ch/iCMS/user/noteinfo?cmsnoteid=CMS%20AN-2012/317). ## Scaling constraints @@ -205,17 +205,17 @@ To add a *constant scaling factor* we use the option `--X-rescale-nuisance`, eg text2workspace.py datacard.txt --X-rescale-nuisance '[some regular expression]' 0.5 -will create the workspace in which ever nuisance parameter whose name matches the specified regular expression will have the width of the gaussian constraint scaled by a factor 0.5. +will create the workspace in which every nuisance parameter whose name matches the specified regular expression will have the width of the gaussian constraint scaled by a factor 0.5. Multiple `--X-rescale-nuisance` options can be specified to set different scalings for different nuisances (note that you actually have to write `--X-rescale-nuisance` each time as in `--X-rescale-nuisance 'theory.*' 0.5 --X-rescale-nuisance 'exp.*' 0.1`). -To add a *functional scaling factor* we use the option `--X-nuisance-function`, which works in a similar way. Instead of a constant value you should specify a RooFit factory expression. +To add a *functional scaling factor* we use the option `--X-nuisance-function`, which works in a similar way. Instead of a constant value you should specify a `RooFit` factory expression. -A typical case would be scaling by $1/\sqrt{L}$, where $L$ is a luminosity scale factor eg assuming there is some parameter in the datacard/workspace called **`lumiscale`**, +A typical case would be scaling by $1/\sqrt{L}$, where $L$ is a luminosity scale factor. For example, assuming there is some parameter in the datacard/workspace called **`lumiscale`**, text2workspace.py datacard.txt --X-nuisance-function '[some regular expression]' 'expr::lumisyst("1/sqrt(@0)",lumiscale[1])' -This factory syntax is quite flexible, but for our use case the typical format will be: `expr::[function name]("[formula]", [arg0], [arg1], ...)`. The `arg0`, `arg1` ... are represented in the formula by `@0`, `@1`,... placeholders. +This factory syntax is flexible, but for our use case the typical format will be: `expr::[function name]("[formula]", [arg0], [arg1], ...)`. The `arg0`, `arg1` ... are represented in the formula by `@0`, `@1`,... placeholders. !!! warning We are playing a slight trick here with the `lumiscale` parameter. At the point at which `text2workspace.py` is building these scaling terms the `lumiscale` for the `rateParam` has not yet been created. By writing `lumiscale[1]` we are telling RooFit to create this variable with an initial value of 1, and then later this will be re-used by the `rateParam` creation. @@ -231,7 +231,7 @@ The impact of a nuisance parameter (NP) θ on a parameter of interest (POI) μ i This is effectively a measure of the correlation between the NP and the POI, and is useful for determining which NPs have the largest effect on the POI uncertainty. -It is possible to use the `FitDiagnostics` method of combine with the option `--algo impact -P parameter` to calculate the impact of a particular nuisance parameter on the parameter(s) of interest. We will use the `combineTool.py` script to automate the fits (see the [`combineTool`](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/#combine-tool) section to check out the tool. +It is possible to use the `FitDiagnostics` method of Combine with the option `--algo impact -P parameter` to calculate the impact of a particular nuisance parameter on the parameter(s) of interest. We will use the `combineTool.py` script to automate the fits (see the [`combineTool`](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/#combine-tool) section to check out the tool. We will use an example workspace from the [$H\rightarrow\tau\tau$ datacard](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/data/tutorials/htt/125/htt_tt.txt), @@ -240,7 +240,7 @@ $ cp HiggsAnalysis/CombinedLimit/data/tutorials/htt/125/htt_tt.txt . $ text2workspace.py htt_tt.txt -m 125 ``` -Calculating the impacts is done in a few stages. First we just fit for each POI, using the `--doInitialFit` option with `combineTool.py`, and adding the `--robustFit 1` option that will be passed through to combine, +Calculating the impacts is done in a few stages. First we just fit for each POI, using the `--doInitialFit` option with `combineTool.py`, and adding the `--robustFit 1` option that will be passed through to Combine, combineTool.py -M Impacts -d htt_tt.root -m 125 --doInitialFit --robustFit 1 @@ -250,13 +250,13 @@ Next we perform a similar scan for each nuisance parameter with the `--doFits` o combineTool.py -M Impacts -d htt_tt.root -m 125 --robustFit 1 --doFits -Note that this will run approximately 60 scans, and to speed things up the option `--parallel X` can be given to run X combine jobs simultaneously. The batch and grid submission methods described in the [combineTool for job submission](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#combinetool-for-job-submission) section can also be used. +Note that this will run approximately 60 scans, and to speed things up the option `--parallel X` can be given to run X Combine jobs simultaneously. The batch and grid submission methods described in the [combineTool for job submission](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#combinetool-for-job-submission) section can also be used. -Once all jobs are completed the output can be collected and written into a json file: +Once all jobs are completed, the output can be collected and written into a json file: combineTool.py -M Impacts -d htt_tt.root -m 125 -o impacts.json -A plot summarising the nuisance parameter values and impacts can be made with `plotImpacts.py`, +A plot summarizing the nuisance parameter values and impacts can be made with `plotImpacts.py`, plotImpacts.py -i impacts.json -o impacts @@ -267,12 +267,12 @@ The first page of the output is shown below. The direction of the +1σ and -1σ impacts (i.e. when the NP is moved to its +1σ or -1σ values) on the POI indicates whether the parameter is correlated or anti-correlated with it. -For models with multiple POIs, the combine option `--redefineSignalPOIs X,Y,Z...` should be specified in all three of the `combineTool.py -M Impacts [...]` steps above. The final step will produce the `impacts.json` file which will contain the impacts for all the specified POIs. In the `plotImpacts.py` script, a particular POI can be specified with `--POI X`. +For models with multiple POIs, the Combine option `--redefineSignalPOIs X,Y,Z...` should be specified in all three of the `combineTool.py -M Impacts [...]` steps above. The final step will produce the `impacts.json` file which will contain the impacts for all the specified POIs. In the `plotImpacts.py` script, a particular POI can be specified with `--POI X`. !!! warning - The plot also shows the *best fit* value of the POI at the top and its uncertainty. You may wish to allow the range to go -ve (i.e using `--setParameterRanges` or `--rMin`) to avoid getting one-sided impacts! + The plot also shows the *best fit* value of the POI at the top and its uncertainty. You may wish to allow the range to go negative (i.e using `--setParameterRanges` or `--rMin`) to avoid getting one-sided impacts! -This script also accepts an optional json-file argument with `-`t which can be used to provide a dictionary for renaming parameters. A simple example would be to create a file `rename.json`, +This script also accepts an optional json-file argument with `-t`, which can be used to provide a dictionary for renaming parameters. A simple example would be to create a file `rename.json`, ```python { @@ -285,11 +285,11 @@ that will rename the POI label on the plot. !!! info Since `combineTool` accepts the usual options for combine you can also generate the impacts on an Asimov or toy dataset. -The left panel in the summary plot shows the value of $(\theta-\theta_{0})/\Delta_{\theta}$ where $\theta$ and $\theta_{0}$ are the **post** and **pre**-fit values of the nuisance parameter and $\Delta_{\theta}$ is the **pre**-fit uncertainty. The asymmetric error bars show the **post**-fit uncertainty divided by the **pre**-fit uncertainty meaning that parameters with error bars smaller than $\pm 1$ are constrained in the fit. As with the `diffNuisances.py` script, use the option `--pullDef` are defined (eg to show the *pull* instead). +The left panel in the summary plot shows the value of $(\theta-\theta_{0})/\Delta_{\theta}$ where $\theta$ and $\theta_{0}$ are the **post** and **pre**-fit values of the nuisance parameter and $\Delta_{\theta}$ is the **pre**-fit uncertainty. The asymmetric error bars show the **post**-fit uncertainty divided by the **pre**-fit uncertainty meaning that parameters with error bars smaller than $\pm 1$ are constrained in the fit. The pull will additionally be shown. As with the `diffNuisances.py` script, the option `--pullDef` can be used (to modify the definition of the *pull* that is shown). ## Breakdown of uncertainties -Often you will want to report the breakdown of your total (systematic) uncertainty on a measured parameter due to one or more groups of nuisance parameters. For example these groups could be theory uncertainties, trigger uncertainties, ... The prodecude to do this in combine is to sequentially freeze groups of nuisance parameters and subtract (in quadrature) from the total uncertainty. Below are the steps to do so. We will use the `data/tutorials/htt/125/htt_tt.txt` datacard for this. +Often you will want to report the breakdown of your total (systematic) uncertainty on a measured parameter due to one or more groups of nuisance parameters. For example, these groups could be theory uncertainties, trigger uncertainties, ... The prodecude to do this in Combine is to sequentially freeze groups of nuisance parameters and subtract (in quadrature) from the total uncertainty. Below are the steps to do so. We will use the `data/tutorials/htt/125/htt_tt.txt` datacard for this. 1. Add groups to the datacard to group nuisance parameters. Nuisance parameters not in groups will be considered as "rest" in the later steps. The lines should look like the following and you should add them to the end of the datacard ``` @@ -300,14 +300,14 @@ efficiency group = CMS_eff_b_8TeV CMS_eff_t_tt_8TeV CMS_fake_b_8TeV 2. Create the workspace with `text2workspace.py data/tutorials/htt/125/htt_tt.txt -m 125`. -3. Run a post-fit with all nuisance parameters floating and store the workspace in an output file - `combine data/tutorials/htt/125/htt_tt.root -M MultiDimFit --saveWorkspace -n htt.postfit` +3. Run a fit with all nuisance parameters floating and store the workspace in an output file - `combine data/tutorials/htt/125/htt_tt.root -M MultiDimFit --saveWorkspace -n htt.postfit` 4. Run a scan from the postfit workspace ``` combine higgsCombinehtt.postfit.MultiDimFit.mH120.root -M MultiDimFit -n htt.total --algo grid --snapshotName MultiDimFit --setParameterRanges r=0,4 ``` -5. Run additional scans using the post-fit workspace sequentially adding another group to the list of groups to freeze +5. Run additional scans using the post-fit workspace, sequentially adding another group to the list of groups to freeze ``` combine higgsCombinehtt.postfit.MultiDimFit.mH120.root -M MultiDimFit --algo grid --snapshotName MultiDimFit --setParameterRanges r=0,4 --freezeNuisanceGroups theory -n htt.freeze_theory @@ -316,32 +316,32 @@ combine higgsCombinehtt.postfit.MultiDimFit.mH120.root -M MultiDimFit --algo gri combine higgsCombinehtt.postfit.MultiDimFit.mH120.root -M MultiDimFit --algo grid --snapshotName MultiDimFit --setParameterRanges r=0,4 --freezeNuisanceGroups theory,calibration,efficiency -n htt.freeze_theory_calibration_efficiency ``` -6. Run one last scan freezing all of the constrained nuisances (this represents the statistics only uncertainty). +6. Run one last scan freezing all of the constrained nuisance parameters (this represents the statistical uncertainty only). ``` combine higgsCombinehtt.postfit.MultiDimFit.mH120.root -M MultiDimFit --algo grid --snapshotName MultiDimFit --setParameterRanges r=0,4 --freezeParameters allConstrainedNuisances -n htt.freeze_all ``` -7. Use the `combineTool` script `plot1D.py` to report the breakdown of uncertainties. +7. Use the `combineTool` script `plot1DScan.py` to report the breakdown of uncertainties. ``` plot1DScan.py higgsCombinehtt.total.MultiDimFit.mH120.root --main-label "Total Uncert." --others higgsCombinehtt.freeze_theory.MultiDimFit.mH120.root:"freeze theory":4 higgsCombinehtt.freeze_theory_calibration.MultiDimFit.mH120.root:"freeze theory+calibration":7 higgsCombinehtt.freeze_theory_calibration_efficiency.MultiDimFit.mH120.root:"freeze theory+calibration+efficiency":2 higgsCombinehtt.freeze_all.MultiDimFit.mH120.root:"stat only":6 --output breakdown --y-max 10 --y-cut 40 --breakdown "theory,calibration,efficiency,rest,stat" ``` -The final step calculates the contribution of each group of nuisances as the subtraction in quadrature of each scan from the previous one. This procedure guarantees that the sum in quadrature of the individual components is the same as the total uncertainty. +The final step calculates the contribution of each group of nuisance parameters as the subtraction in quadrature of each scan from the previous one. This procedure guarantees that the sum in quadrature of the individual components is the same as the total uncertainty. The plot below is produced, ![](images/breakdown.png) !!! warning - While the above procedure is guaranteed the have the effect that the sum in quadrature of the breakdown will equal the total uncertainty, the order in which you freeze the groups can make a difference due to correlations induced by the fit. You should check if the answers change significantly if changing the order and we reccomend you start with the largest group (in terms of overall contribution to the uncertainty) first and work down the list in order of contribution. + While the above procedure is guaranteed the have the effect that the sum in quadrature of the breakdown will equal the total uncertainty, the order in which you freeze the groups can make a difference due to correlations induced by the fit. You should check if the answers change significantly if changing the order and we recommend you start with the largest group (in terms of overall contribution to the uncertainty) first, working down the list in order of the size of the contribution. ## Channel Masking -The `combine` tool has a number of features for diagnostics and plotting results of fits. It can often be useful to turn off particular channels in a combined analysis to see how constraints/pulls can vary. It can also be helpful to plot post-fit shapes + uncertainties of a particular channel (for example a signal region) *without* including the constraints from the data in that region. +The Combine tool has a number of features for diagnostics and plotting results of fits. It can often be useful to turn off particular channels in a combined analysis to see how constraints/shifts in parameter values can vary. It can also be helpful to plot the post-fit shapes and uncertainties of a particular channel (for example a signal region) *without* including the constraints from the data in that region. -This can in some cases be achieved by removing a specific datacard when running `combineCards.py` however, when doing so the information of particular nuisances and pdfs in that region will be lost. Instead, it is possible to ***mask*** that channel from the likelihood! This is acheived at the `text2Workspace` step using the option `--channel-masks`. +This can in some cases be achieved by removing a specific datacard when running `combineCards.py`. However, when doing so, the information of particular nuisance parameters and PDFs in that region will be lost. Instead, it is possible to ***mask*** that channel from the likelihood. This is achieved at the `text2Workspace` step using the option `--channel-masks`. ### Example: removing constraints from the signal region @@ -350,29 +350,29 @@ We will take the control region example from the rate parameters tutorial from [ The first step is to combine the cards combineCards.py signal=signal_region.txt dimuon=dimuon_control_region.txt singlemuon=singlemuon_control_region.txt > datacard.txt -Note that we use the directive `CHANNELNAME=CHANNEL_DATACARD.txt` so that the names of the channels are under our control and easier to interpret. Next, we make a workspace and tell combine to create the parameters used to *mask channels* +Note that we use the directive `CHANNELNAME=CHANNEL_DATACARD.txt` so that the names of the channels are under our control and easier to interpret. Next, we make a workspace and tell Combine to create the parameters used to *mask channels* text2workspace.py datacard.txt --channel-masks -Now lets try a fit *ignoring* the signal region. We can turn off the signal region by setting the channel mask parameter on: `--setParameters mask_signal=1`. Note that `text2workspace` has created a masking parameter for every channel with the naming scheme **mask_CHANNELNAME**. By default, every parameter is set to 0 so that the channel is unmasked by default. +Now we will try to do a fit *ignoring* the signal region. We can turn off the signal region by setting the corresponding channel mask parameter to 1: `--setParameters mask_signal=1`. Note that `text2workspace` has created a masking parameter for every channel with the naming scheme **mask_CHANNELNAME**. By default, every parameter is set to 0 so that the channel is unmasked by default. combine datacard.root -M FitDiagnostics --saveShapes --saveWithUncertainties --setParameters mask_signal=1 !!! warning - There will be a lot of warning from combine. This is safe to ignore as this is due to the s+b fit not converging since the free signal parameter cannot be constrained as the data in the signal region is being ignored. + There will be a lot of warnings from Combine. These are safe to ignore as they are due to the s+b fit not converging. This is expected as the free signal parameter cannot be constrained because the data in the signal region is being ignored. -We can compare the background post-fit and uncertainties with and without the signal region by re-running with `--setParameters mask_signal=0` (or just removing that command). Below is a comparison of the background in the signal region with and without masking the data in the signal region. We take these from the shapes folder +We can compare the post-fit background and uncertainties with and without the signal region included by re-running with `--setParameters mask_signal=0` (or just removing that option completely). Below is a comparison of the background in the signal region with and without masking the data in the signal region. We take these from the shapes folder **shapes_fit_b/signal/total_background** in the `fitDiagnostics.root` output. ![](images/masking_tutorial.png) -Clearly the background shape is different and much less constrained *without including the signal region*, as expected. Channel masking can be used with *any method* in combine. +Clearly the background shape is different and much less constrained *without including the signal region*, as expected. Channel masking can be used with *any method* in Combine. ## RooMultiPdf conventional bias studies -Several analyses within the Higgs group use a functional form to describe their background which is fit to the data (eg the Higgs to two photons (Hgg) analysis). Often however, there is some uncertainty associated to the choice of which background function to use and this choice will impact results of a fit. It is therefore often the case that in these analyses, a Bias study is performed which will indicate how much potential bias can be present given a certain choice of functional form. These studies can be conducted using combine. +Several analyses in CMS use a functional form to describe the background. This functional form is fit to the data. Often however, there is some uncertainty associated with the choice of which background function to use, and this choice will impact the fit results. It is therefore often the case that in these analyses, a bias study is performed. This study will give an indication of the size of the potential bias in the result, given a certain choice of functional form. These studies can be conducted using Combine. -Below is an example script which will produce a workspace based on a simplified Hgg analysis with a *single* category. It will produce the data and pdfs necessary for this example (use it as a basis to cosntruct your own studies). +Below is an example script that will produce a workspace based on a simplified Higgs to diphoton (Hgg) analysis with a *single* category. It will produce the data and PDFs necessary for this example, and you can use it as a basis to construct your own studies. ```c++ @@ -458,31 +458,31 @@ void makeRooMultiPdfWorkspace(){ } ``` -The signal is modelled as a simple Gaussian with a width approximately that of the diphoton resolution and the background is a choice of 3 functions. An exponential, a power-law and a 2nd order polynomial. This choice is accessible inside combine through the use of the [RooMultiPdf](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/interface/RooMultiPdf.h) object which can switch between the functions by setting its associated index (herein called **pdf_index**). This (as with all parameters in combine) is accessible via the `--setParameters` option. +The signal is modelled as a simple Gaussian with a width approximately that of the diphoton resolution. For the background there is a choice of 3 functions: an exponential, a power-law, and a 2nd order polynomial. This choice is accessible within Combine through the use of the [RooMultiPdf](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/interface/RooMultiPdf.h) object, which can switch between the functions by setting their associated indices (herein called **pdf_index**). This (as with all parameters in Combine) can be set via the `--setParameters` option. -To asses the bias, one can throw toys using one function and fit with another. All of this only needs to use one datacard [hgg_toy_datacard.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/tree/main/data/tutorials/bias_studies/hgg_toy_datacard.txt) +To assess the bias, one can throw toys using one function and fit with another. To do this, only a single datacard is needed: [hgg_toy_datacard.txt](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/tree/main/data/tutorials/bias_studies/hgg_toy_datacard.txt). -The bias studies are performed in two stages. The first is to generate toys using one of the functions under some value of the signal strength **r** (or $\mu$). This can be repeated for several values of **r** and also at different masses, but here the Higgs mass is fixed to 125 GeV. +The bias studies are performed in two stages. The first is to generate toys using one of the functions, under some value of the signal strength **r** (or $\mu$). This can be repeated for several values of **r** and also at different masses, but in this example the Higgs boson mass is fixed to 125 GeV. ```bash combine hgg_toy_datacard.txt -M GenerateOnly --setParameters pdf_index=0 --toysFrequentist -t 100 --expectSignal 1 --saveToys -m 125 --freezeParameters pdf_index ``` !!! warning - It is important to freeze `pdf_index` otherwise combine will try to iterate over the index in the frequentist fit. + It is important to freeze `pdf_index`, otherwise Combine will try to iterate over the index in the frequentist fit. -Now we have 100 toys which, by setting `pdf_index=0`, sets the background pdf to the exponential function i.e assumes the exponential is the *true* function. Note that the option `--toysFrequentist` is added. This first performs a fit of the pdf, assuming a signal strength of 1, to the data before generating the toys. This is the most obvious choice as to where to throw the toys from. +Now we have 100 toys which, by setting `pdf_index=0`, sets the background PDF to the exponential function. This means we assume that the exponential is the *true* function. Note that the option `--toysFrequentist` is added; this first performs a fit of the PDF, assuming a signal strength of 1, to the data before generating the toys. This is the most obvious choice as to where to throw the toys from. -The next step is to fit the toys under a different background pdf hypothesis. This time we set the `pdf_index` to be 1, the powerlaw and run fits with the `FitDiagnostics` method again freezing `pdf_index`. +The next step is to fit the toys under a different background PDF hypothesis. This time we set the `pdf_index` to 1, which selects the powerlaw, and run fits with the `FitDiagnostics` method, again freezing `pdf_index`. ```bash combine hgg_toy_datacard.txt -M FitDiagnostics --setParameters pdf_index=1 --toysFile higgsCombineTest.GenerateOnly.mH125.123456.root -t 100 --rMin -10 --rMax 10 --freezeParameters pdf_index --cminDefaultMinimizerStrategy=0 ``` -Note how we add the option `--cminDefaultMinimizerStrategy=0`. This is because we don't need the Hessian, as `FitDiagnostics` will run minos to get the uncertainty on `r`. If we don't do this, Minuit will think the fit failed as we have parameters (those not attached to the current pdf) for which the likelihood is flat. +Note how we add the option `--cminDefaultMinimizerStrategy=0`. This is because we do not need the Hessian, as `FitDiagnostics` will run MINOS to get the uncertainty on `r`. If we do not do this, Minuit will think the fit failed as we have parameters (those not attached to the current PDF) for which the likelihood is flat. !!! warning - You may get warnings about non-accurate errors such as `[WARNING]: Unable to determine uncertainties on all fit parameters in b-only fit` - These can be ignored since they are related to the free parameters of the background pdfs which are not active. + You may get warnings about non-accurate errors such as `[WARNING]: Unable to determine uncertainties on all fit parameters in b-only fit` - These can be ignored since they are related to the free parameters of the background PDFs which are not active. -In the output file `fitDiagnostics.root` there is a tree which contains the best fit results under the signal+background hypothesis. One measure of the bias is the *pull* defined as the difference between the measured value of $\mu$ and the generated value (here we used 1) relative to the uncertainty on $\mu$. The pull distribution can be drawn and the mean provides an estimate of the pull. In this example, we are averaging the +ve and -ve errors, but we could do something smarter if the errors are very asymmetric. +In the output file `fitDiagnostics.root` there is a tree that contains the best fit results under the signal+background hypothesis. One measure of the bias is the *pull* defined as the difference between the measured value of $\mu$ and the generated value (here we used 1) relative to the uncertainty on $\mu$. The pull distribution can be drawn and the mean provides an estimate of the pull. In this example, we are averaging the positive and negative uncertainties, but we could do something smarter if the uncertainties are very asymmetric. ```c++ root -l fitDiagnostics.root @@ -493,16 +493,16 @@ h->Fit("gaus") ![](images/biasexample.png) -From the fitted Gaussian, we see the mean is at -1.29 which would indicate a bias of 129% of the uncertainty on mu from choosing the polynomial when the true function is an exponential! +From the fitted Gaussian, we see the mean is at -1.29, which would indicate a bias of 129% of the uncertainty on mu from choosing the polynomial when the true function is an exponential. ### Discrete profiling -If the `discrete` nuisance is left floating, it will be profiled by looping through the possible index values and finding the pdf which gives the best fit. This allows for the [**discrete profiling method**](https://arxiv.org/pdf/1408.6865.pdf) to be applied for any method which involves a profiled likelihood (frequentist methods). +If the `discrete` nuisance is left floating, it will be profiled by looping through the possible index values and finding the PDF that gives the best fit. This allows for the [**discrete profiling method**](https://arxiv.org/pdf/1408.6865.pdf) to be applied for any method which involves a profiled likelihood (frequentist methods). !!! warning - You should be careful since MINOS knows nothing about the discretenuisances and hence estimations of uncertainties will be incorrect via MINOS. Instead, uncertainties from scans and limits will correctly account for these nuisances. Currently the Bayesian methods will *not* properly treat the nuisances so some care should be taken when interpreting Bayesian results. + You should be careful since MINOS knows nothing about the discrete nuisances and hence estimations of uncertainties will be incorrect via MINOS. Instead, uncertainties from scans and limits will correctly account for these nuisance parameters. Currently the Bayesian methods will *not* properly treat the nuisance parameters, so some care should be taken when interpreting Bayesian results. -As an example, we can use peform a likelihood scan as a function of the Higgs signal strength in the toy Hgg datacard. By leaving the object `pdf_index` non-constant, at each point in the likelihood scan, the pdfs will be iterated over and the one which gives the lowest -2 times log-likelihood, including the correction factor $c$ (as defined in the paper) will be stored in the output tree. We can also check the scan fixing to each pdf individually to check that the envelope is acheived. For this, you will need to include the option `--X-rtd REMOVE_CONSTANT_ZERO_POINT=1`. In this way, we can take a look at the absolute value to compare the curves, if we also include `--saveNLL`. +As an example, we can peform a likelihood scan as a function of the Higgs boson signal strength in the toy Hgg datacard. By leaving the object `pdf_index` non-constant, at each point in the likelihood scan, the PDFs will be iterated over and the one that gives the lowest -2 times log-likelihood, including the correction factor $c$ (as defined in the paper linked above) will be stored in the output tree. We can also check the scan when we fix at each PDF individually to check that the envelope is achieved. For this, you will need to include the option `--X-rtd REMOVE_CONSTANT_ZERO_POINT=1`. In this way, we can take a look at the absolute value to compare the curves, if we also include `--saveNLL`. For example for a full scan, you can run @@ -518,26 +518,26 @@ and for the individual `pdf_index` set to `X`, for `X=0,1,2` -You can then plot the value of `2*(deltaNLL+nll+nll0)` to plot the absolute value of (twice) the negative log-likelihood, including the correction term for extra parameters in the different pdfs. +You can then plot the value of `2*(deltaNLL+nll+nll0)` to plot the absolute value of (twice) the negative log-likelihood, including the correction term for extra parameters in the different PDFs. The above output will produce the following scans. ![](images/discrete_profile.png) As expected, the curve obtained by allowing the `pdf_index` to float (labelled "Envelope") picks out the best function (maximum corrected likelihood) for each value of the signal strength. -In general, you can improve the performance of combine, when using the disccrete profiling method, by including the following options `--X-rtd MINIMIZER_freezeDisassociatedParams`, which will stop parameters not associated to the current pdf from floating in the fits. Additionaly, you can also include the following +In general, the performance of Combine can be improved when using the discrete profiling method by including the option `--X-rtd MINIMIZER_freezeDisassociatedParams`. This will stop parameters not associated to the current PDF from floating in the fits. Additionally, you can include the following options: * `--X-rtd MINIMIZER_multiMin_hideConstants`: hide the constant terms in the likelihood when recreating the minimizer * `--X-rtd MINIMIZER_multiMin_maskConstraints`: hide the constraint terms during the discrete minimization process - * `--X-rtd MINIMIZER_multiMin_maskChannels=` mask in the NLL the channels that are not needed: + * `--X-rtd MINIMIZER_multiMin_maskChannels=` mask the channels that are not needed from the NLL: * ` 1`: keeps unmasked all channels that are participating in the discrete minimization. * ` 2`: keeps unmasked only the channel whose index is being scanned at the moment. -You may want to check with the combine dev team if using these options as they are somewhat for *expert* use. +You may want to check with the Combine development team if you are using these options, as they are somewhat for *expert* use. ## RooSplineND multidimensional splines -[RooSplineND](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/interface/RooSplineND.h) can be used to interpolate from tree of points to produce a continuous function in N-dimensions. This function can then be used as input to workspaces allowing for parametric rates/cross-sections/efficiencies etc OR can be used to up-scale the resolution of likelihood scans (i.e like those produced from combine) to produce smooth contours. +[RooSplineND](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/interface/RooSplineND.h) can be used to interpolate from a tree of points to produce a continuous function in N-dimensions. This function can then be used as input to workspaces allowing for parametric rates/cross-sections/efficiencies. It can also be used to up-scale the resolution of likelihood scans (i.e like those produced from Combine) to produce smooth contours. The spline makes use of a radial basis decomposition to produce a continous $N \to 1$ map (function) from $M$ provided sample points. The function of the $N$ variables $\vec{x}$ is assumed to be of the form, @@ -566,7 +566,7 @@ $$ The solution is obtained using the `eigen` c++ package. -The typical constructor of the object is done as follows; +The typical constructor of the object is as follows; ```c++ RooSplineND(const char *name, const char *title, RooArgList &vars, TTree *tree, const char* fName="f", double eps=3., bool rescale=false, std::string cutstring="" ) ; @@ -574,19 +574,19 @@ RooSplineND(const char *name, const char *title, RooArgList &vars, TTree *tree, where the arguments are: - * `vars`: A RooArgList of RooRealVars representing the $N$ dimensions of the spline. The length of this list determines the dimension $N$ of the spline. + * `vars`: A `RooArgList` of `RooRealVars` representing the $N$ dimensions of the spline. The length of this list determines the dimension $N$ of the spline. * `tree`: a TTree pointer where each entry represents a sample point used to construct the spline. The branch names must correspond to the names of the variables in `vars`. * `fName`: is a string representing the name of the branch to interpret as the target function $f$. * `eps` : is the value of $\epsilon$ and represents the *width* of the basis functions $\phi$. - * `rescale` : is an option to re-scale the input sample points so that each variable has roughly the same range (see above in the definition of $||.||$). + * `rescale` : is an option to rescale the input sample points so that each variable has roughly the same range (see above in the definition of $||.||$). * `cutstring` : a string to remove sample points from the tree. Can be any typical cut string (eg "var1>10 && var2<3"). -The object can be treaeted as a `RooAbsArg` and its value for the current values of the parameters is obtained as usual by using the `getVal()` method. +The object can be treated as a `RooAbsArg`; its value for the current values of the parameters is obtained as usual by using the `getVal()` method. !!! warning - You should not include more variable branches than contained in `vars` in the tree as the spline will interpret them as additional sample points. You should get a warning if there are two *nearby* points in the input samples and this will cause a failure in determining the weights. If you cannot create a reduced tree, you can remove entries by using the `cutstring`. + You should not include more variable branches than contained in `vars` in the tree, as the spline will interpret them as additional sample points. You will get a warning if there are two *nearby* points in the input samples and this will cause a failure in determining the weights. If you cannot create a reduced tree, you can remove entries by using the `cutstring`. -The following script is an example of its use which produces a 2D spline (`N=2`) from a set of 400 points (`M=400`) generated from a function. +The following script is an example that produces a 2D spline (`N=2`) from a set of 400 points (`M=400`) generated from a function. ```c++ void splinend(){ @@ -694,11 +694,11 @@ Running the script will produce the following plot. The plot shows the sampled p ## RooParametricHist gammaN for shapes -Currently, there is no straight-forward implementation of using per-bin **gmN** like uncertainties with shape (histogram) analyses. Instead, it is possible to tie control regions (written as datacards) with the signal region using three methods. +Currently, there is no straightforward implementation of using per-bin **gmN**-like uncertainties with shape (histogram) analyses. Instead, it is possible to tie control regions (written as datacards) with the signal region using three methods. -For analyses who take the normalisation of some process from a control region, it is possible to use either **lnU** or **rateParam** directives to float the normalisation in a correlated way of some process between two regions. Instead if each bin is intended to be determined via a control region, one can use a number of RooFit histogram pdfs/functions to accomplish this. The example below shows a simple implementation of a [RooParametricHist](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/interface/RooParametricHist.h) to achieve this. +For analyses that take the normalization of some process from a control region, it is possible to use either **lnU** or **rateParam** directives to float the normalization in a correlated way of some process between two regions. Instead if each bin is intended to be determined via a control region, one can use a number of `RooFit` histogram PDFs/functions to accomplish this. The example below shows a simple implementation of a [RooParametricHist](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/interface/RooParametricHist.h) to achieve this. -copy the script below into a file called `examplews.C` and create the input workspace using `root -l examplews.C`... +Copy the script below into a file called `examplews.C` and create the input workspace using `root -l examplews.C`... ```c++ @@ -869,12 +869,12 @@ void examplews(){ } ``` -Lets go through what the script is doing. First, the observable for the search is the missing energy so we create a parameter to represent that. +We will now discuss what the script is doing. First, the observable for the search is the missing energy, so we create a parameter to represent this observable. ```c++ RooRealVar met("met","E_{T}^{miss}",xmin,xmax); ``` -First, the following lines create a freely floating parameter for each of our bins (in this example, there are only 4 bins, defined for our observable `met`. +The following lines create a freely floating parameter for each of our bins (in this example, there are only 4 bins, defined for our observable `met`). ```c++ @@ -898,24 +898,24 @@ They are put into a list so that we can create a `RooParametricHist` and its nor RooAddition p_bkg_norm("bkg_SR_norm","Total Number of events from background in signal region",bkg_SR_bins); ``` -For the control region, the background process will be dependent on the yields of the background in the signal region using a *transfer factor*. The transfer factor `TF` must account for acceptance/efficiency etc differences in the signal to control regions. +For the control region, the background process will be dependent on the yields of the background in the signal region using a *transfer factor*. The transfer factor `TF` must account for acceptance/efficiency/etc differences between the signal region and the control regions. -In this example lets assume the control region is populated by the same process decaying to a different final state with twice as large branching ratio compared to the one in the signal region. +In this example we will assume the control region is populated by the same process decaying to a different final state with twice as large branching fraction as the one in the signal region. -We could imagine that the transfer factor could be associated with some uncertainty - lets say a 1% uncertainty due to efficiency and 2% due to acceptance. We need to make nuisance parameters ourselves to model this and give them a nominal value of 0. +We could imagine that the transfer factor could be associated with some uncertainty - for example a 1% uncertainty due to efficiency and a 2% uncertainty due to acceptance differences. We need to make nuisance parameters ourselves to model this, and give them a nominal value of 0. ```c++ RooRealVar efficiency("efficiency", "efficiency nuisance parameter",0); RooRealVar acceptance("acceptance", "acceptance nuisance parameter",0); ``` -We need to make the transfer factor a function of these parameters since variations in these uncertainties will lead to variations of the transfer factor. Here we've assumed Log-normal effects (i.e the same as putting lnN in the CR datacard) but we could use *any function* which could be used to parameterise the effect - eg if the systematic is due to some alternate template, we could use polynomials for example. +We need to make the transfer factor a function of these parameters, since variations in these uncertainties will lead to variations of the transfer factor. Here we have assumed Log-normal effects (i.e the same as putting lnN in the CR datacard), but we could use *any function* which could be used to parameterize the effect - for example if the systematic uncertainty is due to some alternate template, we could use polynomials. ```c++ RooFormulaVar TF("TF","Trasnfer factor","2*TMath::Power(1.01,@0)*TMath::Power(1.02,@1)",RooArgList(efficiency,acceptance) ); ``` -Then need to make each bin of the background in the control region a function of the background in the signal and the transfer factor - i.e $N_{CR} = N_{SR} \times TF $. +Then, we need to make each bin of the background in the control region a function of the background in the signal region and the transfer factor - i.e $N_{CR} = N_{SR} \times TF $. ```c++ RooFormulaVar CRbin1("bkg_CR_bin1","Background yield in control region, bin 1","@0*@1",RooArgList(TF,bin1)); @@ -924,7 +924,7 @@ Then need to make each bin of the background in the control region a function of RooFormulaVar CRbin4("bkg_CR_bin4","Background yield in control region, bin 4","@0*@1",RooArgList(TF,bin4)); ``` -As before, we also need to create the `RooParametricHist` for this process in the control region but this time the bin yields will be the `RooFormulaVars` we just created instead of free floating parameters. +As before, we also need to create the `RooParametricHist` for this process in the control region but this time the bin yields will be the `RooFormulaVars` we just created instead of freely floating parameters. ```c++ RooArgList bkg_CR_bins; @@ -937,7 +937,7 @@ As before, we also need to create the `RooParametricHist` for this process in th RooAddition p_CRbkg_norm("bkg_CR_norm","Total Number of events from background in control region",bkg_CR_bins); ``` -Finally, we can also create alternative shape variations (Up/Down) that can be fed to combine as we do with `TH1` or `RooDataHist` type workspaces. These need +Finally, we can also create alternative shape variations (Up/Down) that can be fed to Combine as we do with `TH1` or `RooDataHist` type workspaces. These need to be of type `RooDataHist`. The example below is for a Jet Energy Scale type shape uncertainty. ```c++ @@ -1024,9 +1024,9 @@ acceptance param 0 1 ``` -Note that for the control region, our nuisance parameters appear as `param` types so that combine will correctly constrain them. +Note that for the control region, our nuisance parameters appear as `param` types, so that Combine will correctly constrain them. -If we combine the two cards and fit the result with `-M MultiDimFit -v 3` we can see that the parameters which give the rate of background in each bin of the signal region, along with the nuisance parameters and signal strength, are determined by the fit - i.e we have properly included the constraint from the control region, just as with the 1-bin `gmN`. +If we combine the two cards and fit the result with `-M MultiDimFit -v 3` we can see that the parameters that give the rate of background in each bin of the signal region, along with the nuisance parameters and signal strength, are determined by the fit - i.e we have properly included the constraint from the control region, just as with the 1-bin `gmN`. ``` acceptance = 0.00374312 +/- 0.964632 (limited) @@ -1039,20 +1039,20 @@ lumi_8TeV = -0.0025911 +/- 0.994458 r = 0.00716347 +/- 12.513 (limited) ``` -The example given here is extremely basic and it should be noted that additional complexity in the transfer factors, additional uncertainties/backgrounds etc in the cards are supported as always. +The example given here is extremely basic and it should be noted that additional complexity in the transfer factors, as well as additional uncertainties/backgrounds etc in the cards are, as always, supported. !!! danger - If trying to implement parametric uncertainties in this setup (eg on transfer factors) which are correlated with other channels and implemented separately, you ***MUST*** normalise the uncertainty effect so that the datacard line can read `param name X 1`. That is the uncertainty on this parameter must be 1. Without this, there will be inconsistency with other nuisances of the same name in other channels implemented as **shape** or **lnN**. + If trying to implement parametric uncertainties in this setup (eg on transfer factors) that are correlated with other channels and implemented separately, you ***MUST*** normalize the uncertainty effect so that the datacard line can read `param name X 1`. That is, the uncertainty on this parameter must be 1. Without this, there will be inconsistency with other nuisances of the same name in other channels implemented as **shape** or **lnN**. ## Look-elsewhere effect for one parameter In case you see an excess somewhere in your analysis, you can evaluate the look-elsewhere effect (LEE) of that excess. For an explanation of the LEE, take a look at the CMS Statistics Committee Twiki [here](https://twiki.cern.ch/twiki/bin/viewauth/CMS/LookElsewhereEffect). -To calculate the look-elsewhere effect for a single parameter (in this case the mass of the resonance), you can follow the instructions. Note that these instructions assume you have a workspace which is parametric in your resonance mass $m$, otherwise you need to fit each background toy with separate workspaces. Assume the local significance for your excess is $\sigma$. +To calculate the look-elsewhere effect for a single parameter (in this case the mass of the resonance), you can follow the instructions below. Note that these instructions assume you have a workspace that is parametric in your resonance mass $m$, otherwise you need to fit each background toy with separate workspaces. We will assume the local significance for your excess is $\sigma$. * Generate background-only toys `combine ws.root -M GenerateOnly --toysFrequentist -m 16.5 -t 100 --saveToys --expectSignal=0`. The output will be something like `higgsCombineTest.GenerateOnly.mH16.5.123456.root`. - * For each toy, calculate the significance for a predefined range - e.g $m\in [10,35]$ GeV in steps suitable to the resolution - eg 1 GeV. For `toy_1` the procedure would be: `for i in $(seq 10 35); do combine ws.root -M Significance --redefineSignalPOI r --freezeParameters MH --setParameter MH=$i -n $i -D higgsCombineTest.GenerateOnly.mH16.5.123456.root:toys/toy_1`. Calculate the maximum significance over all of these mass points - call this $\sigma_{max}$. + * For each toy, calculate the significance for a predefined range (e.g $m\in [10,35]$ GeV) in steps suitable to the resolution (e.g. 1 GeV). For `toy_1` the procedure would be: `for i in $(seq 10 35); do combine ws.root -M Significance --redefineSignalPOI r --freezeParameters MH --setParameter MH=$i -n $i -D higgsCombineTest.GenerateOnly.mH16.5.123456.root:toys/toy_1`. Calculate the maximum significance over all of these mass points - call this $\sigma_{max}$. * Count how many toys have a maximum significance larger than the local one for your observed excess. This fraction of toys with $\sigma_{max}>\sigma$ is the global p-value. You can find more tutorials on the LEE [here](https://indico.cern.ch/event/456547/contributions/1126036/attachments/1188691/1724680/20151117_comb_tutorial_Lee.pdf) diff --git a/docs/part3/regularisation.md b/docs/part3/regularisation.md index 51751211a5f..6cf42ced570 100644 --- a/docs/part3/regularisation.md +++ b/docs/part3/regularisation.md @@ -1,11 +1,11 @@ # Unfolding & regularization -This section details how to perform an unfolded cross-section measurement, *including regularization*, inside Combine. +This section details how to perform an unfolded cross-section measurement, *including regularization*, within Combine. -There are many resources available that describe unfolding, including when to use it (or not), and what are the usual issues around it. A useful summary is available at the [CMS Statistics Committee pages](https://twiki.cern.ch/twiki/bin/view/CMS/ScrecUnfolding) on unfolding. You can also -find a nice overview of unfolding and its usage in combine in [these slides](https://indico.cern.ch/event/399923/contributions/956409/attachments/800899/1097609/2015_06_24_LHCXSWG.pdf#search=Marini%20AND%20StartDate%3E%3D2015-06-24%20AND%20EndDate%3C%3D2015-06-24). +There are many resources available that describe unfolding, including when to use it (or not), and what the common issues surrounding it are. For CMS users, useful summary is available in the [CMS Statistics Committee pages](https://twiki.cern.ch/twiki/bin/view/CMS/ScrecUnfolding) on unfolding. You can also +find an overview of unfolding and its usage in Combine in [these slides](https://indico.cern.ch/event/399923/contributions/956409/attachments/800899/1097609/2015_06_24_LHCXSWG.pdf#search=Marini%20AND%20StartDate%3E%3D2015-06-24%20AND%20EndDate%3C%3D2015-06-24). -The basic idea behind the unfolding technique to describe smearing introduced through the reconstruction (eg of the particle energy) in a given truth level bin $x_{i}$ through a linear relationship to the effects in the nearby truth-bins. We can make statements about the probability $p_{j}$ that the event falling in the truth bin $x_{i}$ is reconstructed in the bin $y_{i}$ via the linear relationship, +The basic idea behind the unfolding technique is to describe smearing introduced through the reconstruction (e.g. of the particle energy) in a given truth level bin $x_{i}$ through a linear relationship with the effects in the nearby truth-bins. We can make statements about the probability $p_{j}$ that the event falling in the truth bin $x_{i}$ is reconstructed in the bin $y_{i}$ via the linear relationship, $$ y_{obs} = \tilde{\boldsymbol{R}}\cdot x_{true} + b @@ -22,7 +22,7 @@ Unfolding aims to find the distribution at truth level $x$, given the observatio ## Likelihood-based unfolding -Since Combine has access to the full likelihood for any analysis written in the usual datacard format, we will use likelihood-based unfolding +Since Combine has access to the full likelihood for any analysis written in the usual datacard format, we will use likelihood-based unfolding throughout - for other approaches, there are many other tools available (eg `RooUnfold` or `TUnfold`), which can be used instead. The benefits of the likelihood-based approach are that, @@ -33,13 +33,13 @@ The benefits of the likelihood-based approach are that, In practice, one must construct the *response matrix* and unroll it in the reconstructed bins: -* First, one derives the truth distribution, eg after the generator-cut only, $x_{i}$. -* Each reconstructed bin (eg each datacard) should describe the contribution from each truth bin - this is how combine knows about $\boldsymbol{R}$ +* First, one derives the truth distribution, e.g. after the generator-level selection only, $x_{i}$. +* Each reconstructed bin (e.g. each datacard) should describe the contribution from each truth bin - this is how Combine knows about the response matrix $\boldsymbol{R}$ and folds in the acceptance/efficiency effects as usual. * The out-of-acceptance contributions can also be included in the above. The model we use for this is then just the usual [`PhysicsModel:multiSignalModel`](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/python/PhysicsModel.py#L98), where each *signal* refers to a particular truth level bin. The results can be extracted through a -simple maximum likelihood fit with, +simple maximum-likelihood fit with, ``` text2workspace.py -m 125 --X-allow-no-background -o datacard.root datacard.txt @@ -49,7 +49,7 @@ simple maximum likelihood fit with, combine -M MultiDimFit --setParameters=r_Bin0=1,r_Bin1=1,r_Bin2=1,r_Bin3=1,r_Bin4=1 -t -1 -m 125 --algo=grid --points=100 -P r_Bin1 --setParameterRanges r_Bin1=0.5,1.5 --floatOtherPOIs=1 datacard.root ``` -Notice that one can also perform the so called bin-by-bin unfolding (though it is strongly discouraged except for testing) with, +Notice that one can also perform the so called bin-by-bin unfolding (though it is strongly discouraged, except for testing) with, ``` text2workspace.py -m 125 --X-allow-no-background -o datacard.root datacard.txt @@ -58,7 +58,7 @@ Notice that one can also perform the so called bin-by-bin unfolding (though it i Nuisance parameters can be added to the likelihood function and profiled in the usual way via the datacards. Theory uncertainties on the inclusive cross section are typically not included in unfolded measurements. -The figure below shows a comparison of Likelihood based unfolding and a least-squares based unfolding as implemented in `RooUnfold`. +The figure below shows a comparison of likelihood-based unfolding and a least-squares based unfolding as implemented in `RooUnfold`. /// details | **Show comparison** @@ -70,24 +70,24 @@ The figure below shows a comparison of Likelihood based unfolding and a least-sq The main difference with respect to other models with multiple signal contributions is the introduction of **Regularization**, which is used to stabilize the unfolding process. -An example of unfolding in combine with and without regularization, can be found under +An example of unfolding in Combine with and without regularization, can be found under [data/tutorials/regularization](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/tree/102x/data/tutorials/regularization). Running `python createWs.py [-r]` will create a simple datacard and perform a fit both with and without including regularization. -The simplest way to introduce regularization in the likelihood based approach, is to apply a penalty term in the likelihood function which -depends on the values of the truth bins (so called *Tickonov regularization*): +The simplest way to introduce regularization in the likelihood based approach, is to apply a penalty term, which +depends on the values of the truth bins, in the likelihood function (so-called *Tikhonov regularization*): $$ -2\ln L = -2\ln L + P(\vec{x}) $$ -where $P$ is a linear operator. There are two different approches which are supported to construct $P$. -If instead you run `python makeModel.py`, you will create a more complex datacard with each the two regularization scheme implemented. You will need -to uncomment the relevant sections of code to activate SVD or TUnfold type regularization. +Here, $P$ is a linear operator. There are two different approaches that are supported to construct $P$. +If you run `python makeModel.py`, you will create a more complex datacard with the two regularization schemes implemented. You will need +to uncomment the relevant sections of code to activate `SVD` or `TUnfold`-type regularization. !!! warning - Any unfolding method which makes use of regularization must perform studies of the potential bias/coverage properties introduced through the + When using any unfolding method with regularization, you must perform studies of the potential bias/coverage properties introduced through the inclusion of regularization, and how strong the associated regularization is. Advice on this can be found in the CMS Statistics Committee pages. ### Singular Value Decomposition (SVD) @@ -115,8 +115,8 @@ row of the product $A\cdot\vec{\mu}$, by including them as lines in the datacard ``` name constr formula dependents delta ``` -where the regularization strength $\delta=\frac{1}{\sqrt{\tau}}$ and can either be a fixed value (eg by putting directly `0.01`) or as -a modifiable parameter with eg `delta[0.01]`. +where the regularization strength is $\delta=\frac{1}{\sqrt{\tau}}$ and can either be a fixed value (e.g. by directly putting `0.01`) or as +a modifiable parameter with e.g. `delta[0.01]`. For example, for 3 bins and a regularization strength of 0.03, the first line would be @@ -124,7 +124,7 @@ For example, for 3 bins and a regularization strength of 0.03, the first line wo name constr @0-2*@2+@1 r_Bin0,r_Bin1,r_Bin2 0.03 ``` -Alternative, valid syntaxes are +Alternative valid syntaxes are ``` constr1 constr r_bin0-r_bin1 0.01 @@ -133,7 +133,7 @@ Alternative, valid syntaxes are constr1 constr r_bin0+r_bin1 {r_bin0,r_bin1} delta[0.01] ``` -The figure below shows an example unfolding using the "SVD regularization" approach with the least squares method (as implemented by `RooUnfold`) and implemented as a penalty term added to the likelihood using the maximum likelihood approach in `Combine`. +The figure below shows an example unfolding using the "SVD regularization" approach with the least squares method (as implemented by `RooUnfold`) and implemented as a penalty term added to the likelihood using the maximum likelihood approach in Combine. /// details | **Show comparison** @@ -143,10 +143,10 @@ The figure below shows an example unfolding using the "SVD regularization" appro ### TUnfold method -The Tikhonov regularization as implemented in `TUnfold` uses the MC information, or rather the densities prediction, as a bias vector. -In order to give this information to Combine, a single datacard for each reco-level bin needs to be produced, so that we have access to the proper normalization terms during the minimization. In this case the bias vector is $\vec{x}_{obs}-\vec{x}_{true}$ +The Tikhonov regularization as implemented in `TUnfold` uses the MC information, or rather the density prediction, as a bias vector. +In order to give this information to Combine, a single datacard for each reconstruction-level bin needs to be produced, so that we have access to the proper normalization terms during the minimization. In this case the bias vector is $\vec{x}_{obs}-\vec{x}_{true}$ -Then one can write a constraint term in the datacard via (eg.) +Then one can write a constraint term in the datacard via, for example, ``` constr1 constr (r_Bin0-1.)*(shapeSig_GenBin0_RecoBin0__norm+shapeSig_GenBin0_RecoBin1__norm+shapeSig_GenBin0_RecoBin2__norm+shapeSig_GenBin0_RecoBin3__norm+shapeSig_GenBin0_RecoBin4__norm)+(r_Bin2-1.)*(shapeSig_GenBin2_RecoBin0__norm+shapeSig_GenBin2_RecoBin1__norm+shapeSig_GenBin2_RecoBin2__norm+shapeSig_GenBin2_RecoBin3__norm+shapeSig_GenBin2_RecoBin4__norm)-2*(r_Bin1-1.)*(shapeSig_GenBin1_RecoBin0__norm+shapeSig_GenBin1_RecoBin1__norm+shapeSig_GenBin1_RecoBin2__norm+shapeSig_GenBin1_RecoBin3__norm+shapeSig_GenBin1_RecoBin4__norm) {r_Bin0,r_Bin1,r_Bin2,shapeSig_GenBin1_RecoBin0__norm,shapeSig_GenBin0_RecoBin0__norm,shapeSig_GenBin2_RecoBin0__norm,shapeSig_GenBin1_RecoBin1__norm,shapeSig_GenBin0_RecoBin1__norm,shapeSig_GenBin2_RecoBin1__norm,shapeSig_GenBin1_RecoBin2__norm,shapeSig_GenBin0_RecoBin2__norm,shapeSig_GenBin2_RecoBin2__norm,shapeSig_GenBin1_RecoBin3__norm,shapeSig_GenBin0_RecoBin3__norm,shapeSig_GenBin2_RecoBin3__norm,shapeSig_GenBin1_RecoBin4__norm,shapeSig_GenBin0_RecoBin4__norm,shapeSig_GenBin2_RecoBin4__norm} delta[0.03] diff --git a/docs/part3/runningthetool.md b/docs/part3/runningthetool.md index dad20633347..bffcc5c520c 100644 --- a/docs/part3/runningthetool.md +++ b/docs/part3/runningthetool.md @@ -1,100 +1,100 @@ # How to run the tool -The executable **`combine`** provided by the package allows to use the Higgs Combination Tool indicating by command line which is the method to use for limit combination and which are user's preferences to run it. To see the entire list of all available options ask for the help: +The executable Combine provided by the package is used to invoke the tools via the command line. The statistical analysis method, as well as user settings, are also specified on the command line. To see the full list of available options, you can run: ```sh combine --help ``` -The option `-M` allows to chose the method used. There are several groups of statistical methods: +The option `-M` is used to choose the statistical evaluation method. There are several groups of statistical methods: - **Asymptotic** likelihood methods: - - `AsymptoticLimits`: limits calculated according to the asymptotic formulas in [arxiv:1007.1727](http://arxiv.org/abs/1007.1727) + - `AsymptoticLimits`: limits calculated according to the asymptotic formulae in [arxiv:1007.1727](http://arxiv.org/abs/1007.1727). - `Significance`: simple profile likelihood approximation, for calculating significances. - **Bayesian** methods: - - `BayesianSimple`: performing a classical numerical integration (for simple models only) + - `BayesianSimple`: performing a classical numerical integration (for simple models only). - `MarkovChainMC`: performing Markov Chain integration, for arbitrarily complex models. - **Frequentist** or hybrid bayesian-frequentist methods: - `HybridNew`: compute modified frequentist limits, significance/p-values and confidence intervals according to several possible prescriptions with toys. - **Fitting** - - `FitDiagnostics`: performs maximum likelihood fits to extract the signal yield and provide diagnostic tools such as pre and post-fit models and correlations - - `MultiDimFit`: perform maximum likelihood fits in multiple parameters and likelihood scans -- **Miscellaneous** other modules that don't compute limits but use the same framework: - - `GoodnessOfFit`: perform a goodness of fit test for models including shape information using several GOF estimators - - `ChannelConsistencyCheck`: check how consistent are the individual channels of a combination are + - `FitDiagnostics`: performs maximum likelihood fits to extract the signal rate, and provides diagnostic tools such as pre- and post-fit figures and correlations + - `MultiDimFit`: performs maximum likelihood fits and likelihood scans with an arbitrary number of parameters of interest. +- **Miscellaneous** other modules that do not compute limits or confidence intervals, but use the same framework: + - `GoodnessOfFit`: perform a goodness of fit test for models including shape information. Several GoF tests are implemented. + - `ChannelConsistencyCheck`: study the consistency between individual channels in a combination. - `GenerateOnly`: generate random or asimov toy datasets for use as input to other methods The command help is organized into five parts: -- *Main options* section indicates how to pass the datacard as input to the tool (`-d datacardName`) and how to choose the statistical method (`-M MethodName`) to compute a limit and level of verbosity for output `-v` -- *Common statistics options* include options common to different statistical methods such as `--cl` to specify the CL (default is 0.95) or `-t` to give the number of toy MC extractions required. -- *Common input-output options*. Is it possible to specify hypothesis point under analysis using `-m` or include specific string in output filename `--name`. +- The *Main options* section indicates how to pass the datacard as input to the tool (`-d datacardName`), how to choose the statistical method (`-M MethodName`), and how to set the verbosity level `-v` +- Under *Common statistics options*, options common to different statistical methods are given. Examples are `--cl`, to specify the confidence level (default is 0.95), or `-t`, to give the number of toy MC extractions required. +- The *Common input-output options* section includes, for example, the options to specify the mass hypothesis under study (`-m`) or to include a specific string in the output filename (`--name`). - *Common miscellaneous options*. -- Method specific options sections are dedicated to each method. By providing the Method name with the `-M` option, only the options for that specific method are shown in addition to the common options +- Further method-specific options are available for each method. By passing the method name via the `-M` option, along with `--help`, the options for that specific method are shown in addition to the common options. -Those options reported above are just a sample of all available.The command `--help` provides documentation of all of them. +Not all the available options are discussed in this online documentation; use `--help` to get the documentation of all options. -## Common command line options +## Common command-line options -There are a number of useful command line options which can be used to alter the model (or parameters of the model) at run These are the most commonly used, generic options, +There are a number of useful command-line options that can be used to alter the model (or parameters of the model) at run time. The most commonly used, generic options, are: -- `-H`: run first another faster algorithm (e.g. the ProfileLikelihood described below) to get a hint of the limit, allowing the real algorithm to converge more quickly. We **strongly recommend** to use this option when using MarkovChainMC, HybridNew or FeldmanCousins calculators, unless you know in which range your limit lies and you set it manually (the default is `[0, 20]`) +- `-H`: first run a different, faster, algorithm (e.g. the `ProfileLikelihood` described below) to obtain an approximate indication of the limit, which will allow the precise chosen algorithm to converge more quickly. We **strongly recommend** to use this option when using the `MarkovChainMC`, `HybridNew` or `FeldmanCousins` calculators, unless you know in which range your limit lies and you set this range manually (the default is `[0, 20]`) -- `--rMax`, `--rMin`: manually restrict the range of signal strengths to consider. For Bayesian limits with MCMC, `rMax` a rule of thumb is that rMax should be 3-5 times the limit (a too small value of `rMax` will bias your limit towards low values, since you are restricting the integration range, while a too large value will bias you to higher limits) +- `--rMax`, `--rMin`: manually restrict the range of signal strengths to consider. For Bayesian limits with MCMC, a rule of thumb is that `rMax` should be 3-5 times the limit (a too small value of `rMax` will bias your limit towards low values, since you are restricting the integration range, while a too large value will bias you to higher limits) -- `--setParameters name=value[,name2=value2,...]` sets the starting values of the parameters, useful e.g. when generating toy MC or when also setting the parameters as fixed. This option supports the use of regexp via by replacing `name` with `rgx{some regular expression}`. +- `--setParameters name=value[,name2=value2,...]` sets the starting values of the parameters, useful e.g. when generating toy MC or when setting the parameters as fixed. This option supports the use of regular expressions by replacing `name` with `rgx{some regular expression}`. -- `--setParameterRanges name=min,max[:name2=min2,max2:...]` sets the ranges of the parameters (useful e.g. for scanning in MultiDimFit, or for Bayesian integration). This option supports the use of regexp via by replacing `name` with `rgx{some regular expression}`. +- `--setParameterRanges name=min,max[:name2=min2,max2:...]` sets the ranges of the parameters (useful e.g. for scans in `MultiDimFit`, or for Bayesian integration). This option supports the use of regular expressions by replacing `name` with `rgx{some regular expression}`. - `--redefineSignalPOIs name[,name2,...]` redefines the set of parameters of interest. - - if the parameters where constant in the input workspace, they are re-defined to be floating. - - nuisances promoted to parameters of interest are removed from the list of nuisances, and thus they are not randomized in methods that randomize nuisances (e.g. HybridNew in non-frequentist mode, or BayesianToyMC, or in toy generation with `-t` but without `--toysFreq`). This doesn't have any impact on algorithms that don't randomize nuisances (e.g. fits, AsymptoticLimits, or HybridNew in fequentist mode) or on algorithms that treat all parameters in the same way (e.g. MarkovChainMC). - - Note that constraint terms for the nuisances are **dropped** after promotion to a POI using `--redefineSignalPOI`. To produce a likelihood scan for a nuisance parameter, using MultiDimFit with **`--algo grid`**, you should instead use the `--parameters (-P)` option which will not cause the loss of the constraint term when scanning. - - parameters of interest of the input workspace that are not selected by this command become unconstrained nuisance parameters, but they are not added to the list of nuisances so they will not be randomized (see above). + - If the parameters were constant in the input workspace, they are set to be floating. + - Nuisance parameters promoted to parameters of interest are removed from the list of nuisances, and thus they are not randomized in methods that randomize nuisances (e.g. `HybridNew` in non-frequentist mode, or `BayesianToyMC`, or in toy generation with `-t` but without `--toysFreq`). This does not have any impact on algorithms that do not randomize nuisance parameters (e.g. fits, `AsymptoticLimits`, or `HybridNew` in fequentist mode) or on algorithms that treat all parameters in the same way (e.g. `MarkovChainMC`). + - Note that constraint terms for the nuisances are **dropped** after promotion to a POI using `--redefineSignalPOI`. To produce a likelihood scan for a nuisance parameter, using `MultiDimFit` with **`--algo grid`**, you should instead use the `--parameters (-P)` option, which will not cause the loss of the constraint term when scanning. + - Parameters of interest of the input workspace that are not selected by this command become unconstrained nuisance parameters, but they are not added to the list of nuisances so they will not be randomized (see above). -- `--freezeParameters name1[,name2,...]` Will freeze the parameters with the given names to their set values. This option supports the use of regexps via by replacing `name` with `rgx{some regular expression}` for matching to *constrained nuisance parameters* or `var{some regular expression}` for matching to *any* parameter. For example `--freezeParameters rgx{CMS_scale_j.*}` will freeze all constrained nuisance parameters with the prefix `CMS_scale_j`, while `--freezeParameters var{.*rate_scale}` will freeze any parameter (constrained nuisance or otherwise) with the suffix `rate_scale`. - - use the option `--freezeParameters allConstrainedNuisances` to freeze all nuisance parameters that have a constraint term (i.e not `flatParams` or `rateParams` or other freely floating parameters). - - similarly the option `--floatParameters` sets the parameter floating. - - groups of nuisances (constrained or otherwise), as defined in the datacard, can be frozen using `--freezeNuisanceGroups`. You can also specify to freeze nuisances which are *not* contained in a particular group using a **^** before the group name (`--freezeNuisanceGroups=^group_name` will freeze everything except nuisance parameters in the group "group_name".) - - all *constrained* nuisance parameters (not `flatParam` or `rateParam`) can be set floating using `--floatAllNuisances`. +- `--freezeParameters name1[,name2,...]` Will freeze the parameters with the given names to their set values. This option supports the use of regular expression by replacing `name` with `rgx{some regular expression}` for matching to *constrained nuisance parameters* or `var{some regular expression}` for matching to *any* parameter. For example `--freezeParameters rgx{CMS_scale_j.*}` will freeze all constrained nuisance parameters with the prefix `CMS_scale_j`, while `--freezeParameters var{.*rate_scale}` will freeze any parameter (constrained nuisance parameter or otherwise) with the suffix `rate_scale`. + - Use the option `--freezeParameters allConstrainedNuisances` to freeze all nuisance parameters that have a constraint term (i.e not `flatParams` or `rateParams` or other freely floating parameters). + - Similarly, the option `--floatParameters name1[,name2,...]` sets the parameter(s) floating. + - Groups of nuisance parameters (constrained or otherwise), as defined in the datacard, can be frozen using `--freezeNuisanceGroups`. You can also freeze all nuisances that are *not* contained in a particular group using a **^** before the group name (`--freezeNuisanceGroups=^group_name` will freeze everything except nuisance parameters in the group "group_name".) + - All *constrained* nuisance parameters (not `flatParam` or `rateParam`) can be set floating using `--floatAllNuisances`. !!! warning - Note that the floating/freezing options have a priority ordering from lowest to highest as `floatParameters < freezeParameters < freezeNuisanceGroups < floatAllNuisances`. Options with higher priority will override those with lower priority. + Note that the floating/freezing options have a priority ordering from lowest to highest as `floatParameters < freezeParameters < freezeNuisanceGroups < floatAllNuisances`. Options with higher priority will take precedence over those with lower priority. -- `--trackParameters name1[,name2,...]` will add a branch to the output tree for each of the named parameters. This option supports the use of regexp via by replacing `name` with `rgx{some regular expression}` +- `--trackParameters name1[,name2,...]` will add a branch to the output tree for each of the named parameters. This option supports the use of regular expressions by replacing `name` with `rgx{some regular expression}` - - the name of the branch will be **trackedParam_*name***. - - the exact behaviour depends on the method. For example, when using `MultiDimFit` with the `--algo scan`, the value of the parameter at each point in the scan will be saved while for `FitDiagnostics`, only the value at the end of the method will be saved. + - The name of the branch will be **trackedParam_*name***. + - The exact behaviour depends on the method used. For example, when using `MultiDimFit` with `--algo scan`, the value of the parameter at each point in the scan will be saved, while for `FitDiagnostics`, only the value at the end of the fit will be saved. -- `--trackErrors name1[,name2,...]` will add a branch to the output tree for the error of each of the named parameters. This option supports the use of regexp via by replacing `name` with `rgx{some regular expression}` +- `--trackErrors name1[,name2,...]` will add a branch to the output tree for the error of each of the named parameters. This option supports the use of regular expressions by replacing `name` with `rgx{some regular expression}` - - the name of the branch will be **trackedError_*name***. - - the behaviour is the same as `--trackParameters` above. + - The name of the branch will be **trackedError_*name***. + - The behaviour, in terms of which values are saved, is the same as `--trackParameters` above. -By default, the dataset used by combine will be the one pointed to in the datacard. You can tell combine to use a different dataset (for example a toy one that you generated) by using the option `--dataset`. The argument should be `rootfile.root:workspace:location` or `rootfile.root:location`. In order to use this option, you must first convert your datacard to a binary workspace and use this binary workspace as the input to the command line. +By default, the data set used by Combine will be the one listed in the datacard. You can tell Combine to use a different data set (for example a toy data set that you generated) by using the option `--dataset`. The argument should be `rootfile.root:workspace:location` or `rootfile.root:location`. In order to use this option, you must first convert your datacard to a binary workspace and use this binary workspace as the input to Combine. ### Generic Minimizer Options -Combine uses its own minimizer class which is used to steer Minuit (via RooMinimizer) named the `CascadeMinimizer`. This allows for sequential minimization which can help in case a particular setting/algo fails. Also, the `CascadeMinimizer` knows about extra features of Combine such as *discrete* nuisance parameters. +Combine uses its own minimizer class, which is used to steer Minuit (via RooMinimizer), named the `CascadeMinimizer`. This allows for sequential minimization, which can help in case a particular setting or algorithm fails. The `CascadeMinimizer` also knows about extra features of Combine such as *discrete* nuisance parameters. -All of the fits which are performed in several of the methods available use this minimizer. This means that the fits can be tuned using these common options, +All of the fits that are performed in Combine's methods use this minimizer. This means that the fits can be tuned using these common options, * `--cminPoiOnlyFit`: First, perform a fit floating *only* the parameters of interest. This can be useful to find, roughly, where the global minimum is. -* `--cminPreScan`: Do a scan before first minimization -* `--cminPreFit` arg: If set to a value N > 0, the minimizer will perform a pre-fit with strategy (N-1) with frozen nuisance parameters. - * `--cminApproxPreFitTolerance arg`: If non-zero, do first a pre-fit with this tolerance (or 10 times the final tolerance, whichever is largest) +* `--cminPreScan`: Do a scan before the first minimization. +* `--cminPreFit arg` If set to a value N > 0, the minimizer will perform a pre-fit with strategy (N-1), with the nuisance parameters frozen. + * `--cminApproxPreFitTolerance arg`: If non-zero, first do a pre-fit with this tolerance (or 10 times the final tolerance, whichever is largest) * `--cminApproxPreFitStrategy arg`: Strategy to use in the pre-fit. The default is strategy 0. -* `--cminDefaultMinimizerType arg`: Set the default minimizer Type. Default is Minuit2. -* `--cminDefaultMinimizerAlgo arg`: Set the default minimizer Algo. The default is Migrad -* `--cminDefaultMinimizerTolerance arg`: Set the default minimizer Tolerance, the default is 0.1 -* `--cminDefaultMinimizerStrategy arg`: Set the default minimizer Strategy between 0 (speed), 1 (balance - *default*), 2 (robustness). The [Minuit documentation](http://www.fresco.org.uk/minuit/cern/node6.html) for this is pretty sparse but in general, 0 means evaluate the function less often, while 2 will waste function calls to get precise answers. An important note is that Hesse (error/correlation estimation) will be run *only* if the strategy is 1 or 2. -* `--cminFallbackAlgo arg`: Provides a list of fallback algorithms if the default minimizer fails. You can provide multiple ones using the syntax is `Type[,algo],strategy[:tolerance]`: eg `--cminFallbackAlgo Minuit2,Simplex,0:0.1` will fall back to the simplex algo of Minuit2 with strategy 0 and a tolerance 0.1, while `--cminFallbackAlgo Minuit2,1` will use the default algo (migrad) of Minuit2 with strategy 1. +* `--cminDefaultMinimizerType arg`: Set the default minimizer type. By default this is set to Minuit2. +* `--cminDefaultMinimizerAlgo arg`: Set the default minimizer algorithm. The default algorithm is Migrad. +* `--cminDefaultMinimizerTolerance arg`: Set the default minimizer tolerance, the default is 0.1. +* `--cminDefaultMinimizerStrategy arg`: Set the default minimizer strategy between 0 (speed), 1 (balance - *default*), 2 (robustness). The [Minuit documentation](http://www.fresco.org.uk/minuit/cern/node6.html) for this is pretty sparse but in general, 0 means evaluate the function less often, while 2 will waste function calls to get precise answers. An important note is that the `Hesse` algorithm (for error and correlation estimation) will be run *only* if the strategy is 1 or 2. +* `--cminFallbackAlgo arg`: Provides a list of fallback algorithms, to be used in case the default minimizer fails. You can provide multiple options using the syntax `Type[,algo],strategy[:tolerance]`: eg `--cminFallbackAlgo Minuit2,Simplex,0:0.1` will fall back to the simplex algorithm of Minuit2 with strategy 0 and a tolerance 0.1, while `--cminFallbackAlgo Minuit2,1` will use the default algorithm (Migrad) of Minuit2 with strategy 1. * `--cminSetZeroPoint (0/1)`: Set the reference of the NLL to 0 when minimizing, this can help faster convergence to the minimum if the NLL itself is large. The default is true (1), set to 0 to turn off. -The allowed combinations of minimizer types and minimizer algos are as follows +The allowed combinations of minimizer types and minimizer algorithms are as follows: -| **Minimizer Type** | **Minimizer Algo** | +| **Minimizer type** | **Minimizer algorithm** | |--------------------|--------------------| |`Minuit` | `Migrad`, `Simplex`, `Combined`, `Scan` | |`Minuit2` | `Migrad`, `Simplex`, `Combined`, `Scan` | @@ -107,125 +107,125 @@ More of these options can be found in the **Cascade Minimizer options** section ### Output from combine -Most methods will print the results of the computation to the screen, however, in addition, combine will also produce a root file containing a tree called **limit** with these results. The name of this file will be of the format, +Most methods will print the results of the computation to the screen. However, in addition, Combine will also produce a root file containing a tree called **limit** with these results. The name of this file will be of the format, higgsCombineTest.MethodName.mH$MASS.[word$WORD].root where **$WORD** is any user defined keyword from the datacard which has been set to a particular value. -A few command line options of combine can be used to control this output: +A few command-line options can be used to control this output: -- The option `-n` allows you to specify part of the name of the rootfile. e.g. if you do `-n HWW` the roofile will be called `higgsCombineHWW....` instead of `higgsCombineTest` -- The option `-m` allows you to specify the higgs boson mass, which gets written in the filename and also in the tree (this simplifies the bookeeping because you can merge together multiple trees corresponding to different higgs masses using `hadd` and then use the tree to plot the value of the limit vs mass) (default is m=120) -- The option `-s` allows to specify the seed (eg `-s 12345`) used in toy generation. If this option is given, the name of the file will be extended by this seed, eg `higgsCombineTest.AsymptoticLimits.mH120.12345.root` +- The option `-n` allows you to specify part of the name of the root file. e.g. if you pass `-n HWW` the root file will be called `higgsCombineHWW....` instead of `higgsCombineTest` +- The option `-m` allows you to specify the (Higgs boson) mass hypothesis, which gets written in the filename and in the output tree. This simplifies the bookeeping, as it becomes possible to merge multiple trees corresponding to different (Higgs boson) masses using `hadd`. Quantities can then be plotted as a function of the mass. The default value is m=120. +- The option `-s` can be used to specify the seed (eg `-s 12345`) used in toy generation. If this option is given, the name of the file will be extended by this seed, eg `higgsCombineTest.AsymptoticLimits.mH120.12345.root` - The option `--keyword-value` allows you to specify the value of a keyword in the datacard such that **$WORD** (in the datacard) will be given the value of **VALUE** in the command `--keyword-value WORD=VALUE`, eg `higgsCombineTest.AsymptoticLimits.mH120.WORDVALUE.12345.root` The output file will contain a `TDirectory` named **toys**, which will be empty if no toys are generated (see below for details) and a `TTree` called **limit** with the following branches; | **Branch name** | **Type** | **Description** | |------------------------|---------------| ------------------------------------------------------------------------------------------------------------------------| -| **`limit`** | `Double_t` | Main result of combine run with method dependent meaning | +| **`limit`** | `Double_t` | Main result of combine run, with method-dependent meaning | | **`limitErr`** | `Double_t` | Estimated uncertainty on the result | | **`mh`** | `Double_t` | Value of **MH**, specified with `-m` option | | **`iToy`** | `Int_t` | Toy number identifier if running with `-t`| | **`iSeed`** | `Int_t` | Seed specified with `-s`| | **`t_cpu`** | `Float_t` | Estimated CPU time for algorithm| | **`t_real`** | `Float_t` | Estimated real time for algorithm| -| **`quantileExpected`** | `Float_t` | Quantile identifier for methods which calculated expected (quantiles) and observed results (eg conversions from $\Delta\ln L$ values) with method dependent meaning. Negative values are reserved for entries which *do not* related to quantiles of a calculation with the default being set to -1 (usually meaning the *observed* result). | +| **`quantileExpected`** | `Float_t` | Quantile identifier for methods that calculated expected (quantiles) and observed results (eg conversions from $\Delta\ln L$ values), with method-dependent meaning. Negative values are reserved for entries that *do not* relate to quantiles of a calculation, with the default being set to -1 (usually meaning the *observed* result). | -The value of any user defined keyword **$WORD** which is set using `keyword-value` described above will also be included as a branch with type `string` named **WORD**. The option can be repeated multiple times for multiple keywords. +The value of any user-defined keyword **$WORD** that is set using `keyword-value` described above will also be included as a branch with type `string` named **WORD**. The option can be repeated multiple times for multiple keywords. -In some cases, the precise meanings of the branches will depend on the Method being used, which is included in this documentation. +In some cases, the precise meanings of the branches will depend on the method being used. In this case, it will be specified in this documentation. ## Toy data generation -By default, each of these methods will be run using the **observed data** as the input. In several cases (as detailed below), it might be useful to run the tool using toy datasets, including Asimov data. +By default, each of the methods described so far will be run using the **observed data** as the input. In several cases (as detailed below), it is useful to run the tool using toy datasets, including Asimov data sets. -The option `-t` is used to specify to combine to first generate a toy dataset(s) which will be used in replacement of the real data. There are two versions of this, +The option `-t` is used to tell Combine to first generate one or more toy data sets, which will be used instead of the observed data. There are two versions, - * `-t N` with N > 0. Combine will generate N toy datasets from the model and re-run the method once per toy. The seed for the toy generation can be modified with the option `-s` (use `-s -1` for a random seed). The output file will contain one entry in the tree for each of these toys. + * `-t N` with N > 0. Combine will generate N toy datasets from the model and re-run the method once per toy. The seed for the toy generation can be modified with the option `-s` (use `-s -1` for a random seed). The output file will contain one entry in the tree for each of these toys. - * `-t -1` will produce an Asimov dataset in which statistical fluctuations are suppressed. The procedure to generate this Asimov dataset depends on which type of analysis you are using, see below for details. + * `-t -1` will produce an Asimov data set, in which statistical fluctuations are suppressed. The procedure for generating this Asimov data set depends on the type of analysis you are using. More details are given below. !!! warning - The default values of the nuisance parameters (or any parameter) are used to generate the toy. This means that if, for example, you are using parametric shapes and the parameters inside the workspace are set to arbitrary values, *those* arbitrary values will be used to generate the toy. This behaviour can be modified through the use of the option `--setParameters x=value_x,y=value_y...` which will set the values of the parameters (`x` and `y`) before toy generation. You can also load a snap-shot from a previous fit to set the nuisances to their *post-fit* values (see below). + The default values of the nuisance parameters (or any parameter) are used to generate the toy. This means that if, for example, you are using parametric shapes and the parameters inside the workspace are set to arbitrary values, *those* arbitrary values will be used to generate the toy. This behaviour can be modified through the use of the option `--setParameters x=value_x,y=value_y...`, which will set the values of the parameters (`x` and `y`) before toy generation. You can also load a snapshot from a previous fit to set the nuisance parameters to their *post-fit* values (see below). The output file will contain the toys (as `RooDataSets` for the observables, including global observables) in the **toys** directory if the option `--saveToys` is provided. If you include this option, the `limit` TTree in the output will have an entry corresponding to the state of the POI used for the generation of the toy, with the value of **`quantileExpected`** set to **-2**. !!! info - The branches that are created by methods like `MultiDimFit` *will not* show the values used to generate the toy. If you also want the TTree to show the values of the POIs used to generate to toy, you should add additional branches using the `--trackParameters` option as described in the [common command line options](#common-command-line-options) section above. These branches will behave as expected when adding the option `--saveToys`. + The branches that are created by methods like `MultiDimFit` *will not* show the values used to generate the toy. If you also want the TTree to show the values of the POIs used to generate to toy, you should add additional branches using the `--trackParameters` option as described in the [common command-line options](#common-command-line-options) section above. These branches will behave as expected when adding the option `--saveToys`. ### Asimov datasets -If you are using wither `-t -1` or using `AsymptoticLimits`, combine will calculate results based on an Asimov dataset. +If you are using either `-t -1` or `AsymptoticLimits`, Combine will calculate results based on an Asimov data set. - * For counting experiments, the Asimov data will just be set to the total number of expected events (given the values of the nuisance parameters and POIs of the model) + * For counting experiments, the Asimov data set will just be the total number of expected events (given the values of the nuisance parameters and POIs of the model) - * For shape analyses with templates, the Asimov dataset will be constructed as a histogram using the same binning which is defined for your analysis. + * For shape analyses with templates, the Asimov data set will be constructed as a histogram using the same binning that is defined for your analysis. - * If your model uses parametric shapes (for example when you are using binned data, there are some options as to what Asimov dataset to produce. By *default*, combine will produce the Asimov data as a histogram using the binning which is associated to each observable (ie as set using `RooRealVar::setBins`). If this binning doesn't exist, combine will **guess** a suitable binning - it is therefore best to use `RooRealVar::setBins` to associate a binning to each observable, even if your data is unbinned, if you intend to use Asimov datasets. + * If your model uses parametric shapes, there are some options as to what Asimov data set to produce. By *default*, Combine will produce the Asimov data set as a histogram using the binning that is associated with each observable (ie as set using `RooRealVar::setBins`). If this binning does not exist, Combine will **guess** a suitable binning - it is therefore best to use `RooRealVar::setBins` to associate a binning with each observable, even if your data is unbinned, if you intend to use Asimov data sets. -You can also ask combine to use a **Pseudo-Asimov** dataset, which is created from many weighted unbinned events. +You can also ask Combine to use a **Pseudo-Asimov** dataset, which is created from many weighted unbinned events. Setting `--X-rtd TMCSO_AdaptivePseudoAsimov=`$\beta$ with $\beta>0$ will trigger the internal logic of whether to produce a Pseudo-Asimov dataset. This logic is as follows; - 1. For each observable in your dataset, the number of bins, $n_{b}$ is determined either from the value of `RooRealVar::getBins` if it exists or assumed to be 100. + 1. For each observable in your dataset, the number of bins, $n_{b}$ is determined either from the value of `RooRealVar::getBins`, if it exists, or assumed to be 100. - 2. If $N_{b}=\prod_{b}n_{b}>5000$, the number of expected events $N_{ev}$ is determined. *Note* if you are combining multiple channels, $N_{ev}$ refers to the number of expected events in a single channel, the logic is separate for each channel. If $N_{ev}/N_{b}<0.01$ then a Pseudo-Asimov dataset is created with the number of events equal to $\beta \cdot \mathrm{max}\{100*N_{ev},1000\}$. If $N_{ev}/N_{b}\geq 0.01$ , then a normal Asimov dataset is produced. + 2. If $N_{b}=\prod_{b}n_{b}>5000$, the number of expected events $N_{ev}$ is determined. *Note* if you are combining multiple channels, $N_{ev}$ refers to the number of expected events in a single channel. The logic is separate for each channel. If $N_{ev}/N_{b}<0.01$ then a Pseudo-Asimov data set is created with the number of events equal to $\beta \cdot \mathrm{max}\{100*N_{ev},1000\}$. If $N_{ev}/N_{b}\geq 0.01$ , then a normal Asimov data set is produced. - 3. If $N_{b}\leq 5000$ then a normal Asimov dataset will be produced + 3. If $N_{b}\leq 5000$ then a normal Asimov data set will be produced -The production of a Pseudo-Asimov dataset can be *forced* by using the option `--X-rtd TMCSO_PseudoAsimov=X` where `X>0` will determine the number of weighted events for the Pseudo-Asimov dataset. You should try different values of `X` since larger values leads to more events in the Pseudo-Asimov dataset resulting in higher precision but in general the fit will be slower. +The production of a Pseudo-Asimov data set can be *forced* by using the option `--X-rtd TMCSO_PseudoAsimov=X` where `X>0` will determine the number of weighted events for the Pseudo-Asimov data set. You should try different values of `X`, since larger values lead to more events in the Pseudo-Asimov data set, resulting in higher precision. However, in general, the fit will be slower. -You can turn off the internal logic by setting `--X-rtd TMCSO_AdaptivePseudoAsimov=0 --X-rtd TMCSO_PseudoAsimov=0` thereby forcing histograms to be generated. +You can turn off the internal logic by setting `--X-rtd TMCSO_AdaptivePseudoAsimov=0 --X-rtd TMCSO_PseudoAsimov=0`, thereby forcing histograms to be generated. !!! info - If you set `--X-rtd TMCSO_PseudoAsimov=X` with `X>0` and also turn on `--X-rtd TMCSO_AdaptivePseudoAsimov=`$\beta$, with $\beta>0$, the internal logic will be used but this time the default will be to generate Pseudo-Asimov datasets, rather than the normal Asimov ones. + If you set `--X-rtd TMCSO_PseudoAsimov=X` with `X>0` and also turn on `--X-rtd TMCSO_AdaptivePseudoAsimov=`$\beta$, with $\beta>0$, the internal logic will be used, but this time the default will be to generate Pseudo-Asimov data sets, rather than the standard Asimov ones. ### Nuisance parameter generation -The default method of dealing with systematics is to generate random values (around their nominal values, see above) for the nuisance parameters, according to their prior pdfs centred around their default values, *before* generating the data. The *unconstrained* nuisance parameters (eg `flatParam` or `rateParam`) or those with *flat* priors are **not** randomised before the data generation. If you wish to also randomise these parameters, you **must** declare these as `flatParam` in your datacard and when running text2workspace you must add the option `--X-assign-flatParam-prior` in the command line. +The default method of handling systematics is to generate random values (around their nominal values, see above) for the nuisance parameters, according to their prior PDFs centred around their default values, *before* generating the data. The *unconstrained* nuisance parameters (eg `flatParam` or `rateParam`), or those with *flat* priors are **not** randomized before the data generation. If you wish to also randomize these parameters, you **must** declare them as `flatParam` in your datacard and, when running text2workspace, you must add the option `--X-assign-flatParam-prior` to the command line. -The following are options which define how the toys will be generated, +The following options define how the toys will be generated, - * `--toysNoSystematics` the nuisance parameters in each toy are *not* randomised when generating the toy datasets - i.e their nominal values are used to generate the data. Note that for methods which profile (fit) the nuisances, the parameters are still floating when evaluating the likelihood. + * `--toysNoSystematics` the nuisance parameters in each toy are *not* randomized when generating the toy data sets - i.e their nominal values are used to generate the data. Note that for methods which profile (fit) the nuisances, the parameters are still floating when evaluating the likelihood. - * `--toysFrequentist` the nuisance parameters in each toy are set to their nominal values which are obtained *after fitting first to the data*, with POIs fixed, before generating the data. For evaluating likelihoods, the constraint terms are instead randomised within their pdfs around the post-fit nuisance parameter values. + * `--toysFrequentist` the nuisance parameters in each toy are set to their nominal values which are obtained *after first fitting to the observed data*, with the POIs fixed, before generating the toy data sets. For evaluating likelihoods, the constraint terms are instead randomized within their PDFs around the post-fit nuisance parameter values. -If you are using `toysFrequentist`, be aware that the values set by `--setParameters` will be *ignored* for the toy generation as the *post-fit* values will instead be used (except for any parameter which is also a parameter of interest). You can override this behaviour and choose the nominal values for toy generation for any parameter by adding the option `--bypassFrequentistFit` which will skip the initial fit to data or by loading a snapshot (see below). +If you are using `toysFrequentist`, be aware that the values set by `--setParameters` will be *ignored* for the toy generation as the *post-fit* values will instead be used (except for any parameter that is also a parameter of interest). You can override this behaviour and choose the nominal values for toy generation for any parameter by adding the option `--bypassFrequentistFit`, which will skip the initial fit to data, or by loading a snapshot (see below). !!! warning - The methods such as `AsymptoticLimits` and `HybridNew --LHCmode LHC-limits`, the "nominal" nuisance parameter values are taken from fits to the data and are therefore not "blind" to the observed data by default (following the fully frequentist paradigm). See the detailed documentation on these methods for avoiding this and running in a completely "blind" mode. + For methods such as `AsymptoticLimits` and `HybridNew --LHCmode LHC-limits`, the "nominal" nuisance parameter values are taken from fits to the data and are, therefore, not "blind" to the observed data by default (following the fully frequentist paradigm). See the detailed documentation on these methods for how to run in fully "blinded" mode. ### Generate only -It is also possible to generate the toys first and then feed them to the Methods in combine. This can be done using `-M GenerateOnly --saveToys`. The toys can then be read and used with the other methods by specifying `--toysFile=higgsCombineTest.GenerateOnly...` and using the same options for the toy generation. +It is also possible to generate the toys first, and then feed them to the methods in Combine. This can be done using `-M GenerateOnly --saveToys`. The toys can then be read and used with the other methods by specifying `--toysFile=higgsCombineTest.GenerateOnly...` and using the same options for the toy generation. !!! warning - Some Methods also use toys within the method itself (eg `AsymptoticLimits` and `HybridNew`). For these, you should **not** specify the toy generation with `-t` or the options above and instead follow the specific instructions. + Some methods also use toys within the method itself (eg `AsymptoticLimits` and `HybridNew`). For these, you should **not** specify the toy generation with `-t` or the options above. Instead, you should follow the method-specific instructions. ### Loading snapshots -Snapshots from workspaces can be loaded and used in order to generate toys using the option `--snapshotName `. This will first set the parameters to the values in the snapshot *before* any other parameter options are set and toys are generated. +Snapshots from workspaces can be loaded and used in order to generate toys using the option `--snapshotName `. This will first set the parameters to the values in the snapshot, *before* any other parameter options are set and toys are generated. See the section on [saving post-fit workspaces](/HiggsAnalysis-CombinedLimit/part3/commonstatsmethods/#using-best-fit-snapshots) for creating workspaces with post-fit snapshots from `MultiDimFit`. Here are a few examples of calculations with toys from post-fit workspaces using a workspace with $r, m_{H}$ as parameters of interest -- Throw post-fit toy with b from s+b(floating $r,m_{H}$) fit, s with **r=1.0**, **m=best fit MH**, using nuisance values and constraints re-centered on s+b(floating $r,m_{H}$) fit values (aka frequentist post-fit expected) and compute post-fit expected **r** uncertainty profiling **MH** +- Throw post-fit toy with b from s+b(floating $r,m_{H}$) fit, s with **r=1.0**, **m=best fit MH**, using nuisance parameter values and constraints re-centered on s+b(floating $r,m_{H}$) fit values (aka frequentist post-fit expected) and compute post-fit expected **r** uncertainty profiling **MH** `combine higgsCombinemumhfit.MultiDimFit.mH125.root --snapshotName MultiDimFit -M MultiDimFit --verbose 9 -n randomtest --toysFrequentist --bypassFrequentistFit -t -1 --expectSignal=1 -P r --floatOtherPOIs=1 --algo singles` -- Throw post-fit toy with b from s+b(floating $r,m_{H}$) fit, s with **r=1.0, m=128.0**, using nuisance values and constraints re-centered on s+b(floating $r,m_{H}$) fit values (aka frequentist post-fit expected) and compute post-fit expected significance (with **MH** fixed at 128 implicitly) +- Throw post-fit toy with b from s+b(floating $r,m_{H}$) fit, s with **r=1.0, m=128.0**, using nuisance parameter values and constraints re-centered on s+b(floating $r,m_{H}$) fit values (aka frequentist post-fit expected) and compute post-fit expected significance (with **MH** fixed at 128 implicitly) `combine higgsCombinemumhfit.MultiDimFit.mH125.root -m 128 --snapshotName MultiDimFit -M ProfileLikelihood --significance --verbose 9 -n randomtest --toysFrequentist --bypassFrequentistFit --overrideSnapshotMass -t -1 --expectSignal=1 --redefineSignalPOIs r --freezeParameters MH` -- Throw post-fit toy with b from s+b(floating $r,m_{H}$) fit, s with **r=0.0**, using nuisance values and constraints re-centered on s+b(floating $r,m_{H}$) fit values (aka frequentist post-fit expected) and compute post-fit expected and observed asymptotic limit (with **MH** fixed at 128 implicitly) +- Throw post-fit toy with b from s+b(floating $r,m_{H}$) fit, s with **r=0.0**, using nuisance parameter values and constraints re-centered on s+b(floating $r,m_{H}$) fit values (aka frequentist post-fit expected) and compute post-fit expected and observed asymptotic limit (with **MH** fixed at 128 implicitly) `combine higgsCombinemumhfit.MultiDimFit.mH125.root -m 128 --snapshotName MultiDimFit -M AsymptoticLimits --verbose 9 -n randomtest --bypassFrequentistFit --overrideSnapshotMass--redefineSignalPOIs r --freezeParameters MH` ## combineTool for job submission -For longer tasks which cannot be run locally, several methods in combine can be split to run on the *LSF batch* or the *Grid*. The splitting and submission is handled using the `combineTool` (see [this getting started](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/#combine-tool) section to get the tool) +For longer tasks that cannot be run locally, several methods in Combine can be split to run on a *batch* system or on the *Grid*. The splitting and submission is handled using the `combineTool` (see [this getting started](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/#combine-tool) section to check out the tool) ### Submission to Condor @@ -236,40 +236,40 @@ The syntax for running on condor with the tool is combineTool.py -M ALGO [options] --job-mode condor --sub-opts='CLASSADS' --task-name NAME [--dry-run] ``` -with `options` being the usual list of `combine` options. The help option `-h` will give a list of both `combine` and `combineTool` sets of options. This can be used with several different methods from `combine`. +with `options` being the usual list of Combine options. The help option `-h` will give a list of both Combine and `combineTool` options. It is possible to use this tool with several different methods from Combine. The `--sub-opts` option takes a string with the different ClassAds that you want to set, separated by `\n` as argument (e.g. `'+JobFlavour="espresso"\nRequestCpus=1'`). The `--dry-run` option will show what will be run without actually doing so / submitting the jobs. -For example, to generate toys (eg for use with limit setting) users running on lxplus at CERN the **condor** mode can be used eg +For example, to generate toys (eg for use with limit setting) users running on lxplus at CERN can use the **condor** mode: ```sh combineTool.py -d workspace.root -M HybridNew --LHCmode LHC-limits --clsAcc 0 -T 2000 -s -1 --singlePoint 0.2:2.0:0.05 --saveHybridResult -m 125 --job-mode condor --task-name condor-test --sub-opts='+JobFlavour="tomorrow"' ``` -The `--singlePoint` option is over-ridden so that this will produce a script for each value of the POI in the range 0.2 to 2.0 in steps of 0.05. You can merge multiple points into a script using `--merge` - e.g adding `--merge 10` to the above command will mean that each job contains *at most* 10 of the values. The scripts are labelled by the `--task-name` option. These will be submitted directly to condor adding any options in `--sub-opts` to the condor submit script. Make sure multiple options are separated by `\n`. The jobs will run and produce output in the **current directory**. +The `--singlePoint` option is over-ridden, so that this will produce a script for each value of the POI in the range 0.2 to 2.0 in steps of 0.05. You can merge multiple points into a script using `--merge` - e.g adding `--merge 10` to the above command will mean that each job contains *at most* 10 of the values. The scripts are labelled by the `--task-name` option. They will be submitted directly to condor, adding any options in `--sub-opts` to the condor submit script. Make sure multiple options are separated by `\n`. The jobs will run and produce output in the **current directory**. Below is an example for splitting points in a multi-dimensional likelihood scan. #### Splitting jobs for a multi-dimensional likelihood scan -The option `--split-points` issues the command to split the jobs for `MultiDimFit` when using `--algo grid`. The following example will split the jobs such that there are **10 points** in each of the jobs, which will be submitted to the **8nh** queue. +The option `--split-points` issues the command to split the jobs for `MultiDimFit` when using `--algo grid`. The following example will split the jobs such that there are **10 points** in each of the jobs, which will be submitted to the **workday** queue. ```sh combineTool.py datacard.txt -M MultiDimFit --algo grid --points 50 --rMin 0 --rMax 1 --job-mode condor --split-points 10 --sub-opts='+JobFlavour="workday"' --task-name mytask -n mytask ``` -Remember, any usual options (such as redefining POIs or freezing parameters) are passed to combine and can be added to the command line for `combineTool`. +Remember, any usual options (such as redefining POIs or freezing parameters) are passed to Combine and can be added to the command line for `combineTool`. !!! info - The option `-n NAME` should be included to avoid overwriting output files as the jobs will be run inside the directory from which the command is issued. + The option `-n NAME` should be included to avoid overwriting output files, as the jobs will be run inside the directory from which the command is issued. ### Grid submission with combineTool For more CPU-intensive tasks, for example determining limits for complex models using toys, it is generally not feasible to compute all the results interactively. Instead, these jobs can be submitted to the Grid. -In this example we will use the `HybridNew` method of combine to determine an upper limit for a sub-channel of the Run 1 SM $H\rightarrow\tau\tau$ analysis. For full documentation, see the section on [computing limits with toys](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/commonstatsmethods/#computing-limits-with-toys). +In this example we will use the `HybridNew` method of Combine to determine an upper limit for a sub-channel of the Run 1 SM $H\rightarrow\tau\tau$ analysis. For full documentation, see the section on [computing limits with toys](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/commonstatsmethods/#computing-limits-with-toys). With this model it would take too long to find the limit in one go, so instead we create a set of jobs in which each one throws toys and builds up the test statistic distributions for a fixed value of the signal strength. These jobs can then be submitted to a batch system or to the Grid using `crab3`. From the set of output distributions it is possible to extract the expected and observed limits. @@ -282,7 +282,7 @@ $ text2workspace.py data/tutorials/htt/125/htt_mt.txt -m 125 $ mv data/tutorials/htt/125/htt_mt.root ./ ``` -To get an idea of the range of signal strength values we will need to build test-statistic distributions for we will first use the `AsymptoticLimits` method of combine, +To get an idea of the range of signal strength values we will need to build test-statistic distributions for, we will first use the `AsymptoticLimits` method of Combine, ```nohighlight $ combine -M Asymptotic htt_mt.root -m 125 @@ -299,9 +299,9 @@ Expected 97.5%: r < 1.7200 Based on this, a range of 0.2 to 2.0 should be suitable. -We can use the same command for generating the distribution of test statistics with `combineTool`. The `--singlePoint` option is now enhanced to support expressions that generate a set of calls to combine with different values. The accepted syntax is of the form **MIN:MAX:STEPSIZE**, and multiple comma-separated expressions can be specified. +We can use the same command for generating the distribution of test statistics with `combineTool`. The `--singlePoint` option is now enhanced to support expressions that generate a set of calls to Combine with different values. The accepted syntax is of the form **MIN:MAX:STEPSIZE**, and multiple comma-separated expressions can be specified. -The script also adds an option `--dry-run` which will not actually call combine but just prints out the commands that would be run, e.g, +The script also adds an option `--dry-run`, which will not actually call comCombinebine but just prints out the commands that would be run, e.g, ```sh combineTool.py -M HybridNew -d htt_mt.root --LHCmode LHC-limits --singlePoint 0.2:2.0:0.2 -T 2000 -s -1 --saveToys --saveHybridResult -m 125 --dry-run @@ -316,7 +316,7 @@ When the `--dry-run` option is removed each command will be run in sequence. ### Grid submission with crab3 -Submission to the grid with `crab3` works in a similar way. Before doing so ensure that the `crab3` environment has been sourced in addition to the CMSSW environment. We will use the example of generating a grid of test-statistic distributions for limits. +Submission to the grid with `crab3` works in a similar way. Before doing so, ensure that the `crab3` environment has been sourced in addition to the CMSSW environment. We will use the example of generating a grid of test-statistic distributions for limits. ```sh $ cmsenv; source /cvmfs/cms.cern.ch/crab3/crab.sh @@ -334,9 +334,9 @@ def custom_crab(config): config.Site.blacklist = ['SOME_SITE', 'SOME_OTHER_SITE'] ``` -Again it is possible to use the option `--dry-run` to see what the complete crab config will look like before actually submitted it. +Again it is possible to use the option `--dry-run` to see what the complete crab config will look like before actually submitting it. -Once submitted the progress can be monitored using the standard `crab` commands. When all jobs are completed copy the output from your sites storage element to the local output folder. +Once submitted, the progress can be monitored using the standard `crab` commands. When all jobs are completed, copy the output from your site's storage element to the local output folder. ```sh $ crab getoutput -d crab_grid-test @@ -347,5 +347,5 @@ $ mv higgsCombine*.root ../../ $ cd ../../ ``` -These output files should be combined with `hadd`, after which we invoke combine as usual to calculate observed and expected limits from the merged grid as usual. +These output files should be combined with `hadd`, after which we invoke Combine as usual to calculate observed and expected limits from the merged grid, as usual. diff --git a/docs/part3/simplifiedlikelihood.md b/docs/part3/simplifiedlikelihood.md index 382dda73a49..cde56c9fee1 100644 --- a/docs/part3/simplifiedlikelihood.md +++ b/docs/part3/simplifiedlikelihood.md @@ -4,11 +4,11 @@ This page is to give a brief outline for the creation of (potentially aggregated ## Requirements -You need an up to date version of combine. Note You should use the latest release of combine for the exact commands on this page. You should be using combine tag `v9.0.0` or higher or the latest version of the `112x` branch to follow these instructions. +You need an up to date version of Combine. Note You should use the latest release of Combine for the exact commands on this page. You should be using Combine tag `v9.0.0` or higher or the latest version of the `112x` branch to follow these instructions. -You will find the python scripts needed to convert combine outputs into simplified likelihood inputs under `test/simplifiedLikelihood` +You will find the python scripts needed to convert Combine outputs into simplified likelihood inputs under `test/simplifiedLikelihood` -If you're using the `102x` branch (not reccomended), then you can obtain these scripts from here by running: +If you're using the `102x` branch (not recommended), then you can obtain these scripts from here by running: ``` curl -s https://raw.githubusercontent.com/nucleosynthesis/work-tools/master/sparse-checkout-SL-ssh.sh > checkoutSL.sh bash checkoutSL.sh @@ -22,7 +22,7 @@ git clone https://gitlab.cern.ch/SimplifiedLikelihood/SLtools.git ## Producing covariance for recasting -Producing the necessary predictions and covariance for recasting varies depending on whether control regions are explicitly included in the datacard when running fits. Instructions for cases where the control regions *are* and *are not* included are detailed below. +Producing the necessary predictions and covariance for recasting varies depending on whether or not control regions are explicitly included in the datacard when running fits. Instructions for cases where the control regions *are* and *are not* included are detailed below. !!! warning The instructions below will calculate moments based on the assumption that $E[x]=\hat{x}$, i.e it will use the maximum likelihood estimators for the yields as the expectation values. If instead you want to use the full definition of the moments, you can run the `FitDiagnostics` method with the `-t` option and include `--savePredictionsPerToy` and remove the other options, which will produce a tree of the toys in the output from which moments can be calculated. @@ -35,14 +35,14 @@ For an example datacard 'datacard.txt' including two signal channels 'Signal1' a text2workspace.py --channel-masks --X-allow-no-signal --X-allow-no-background datacard.txt -o datacard.root ``` -Run the fit making the covariance (output saved as `fitDiagnostics.root`) masking the Signal channel. Note that all signal channels must be masked! +Run the fit making the covariance (output saved as `fitDiagnostics.root`) masking the signal channels. Note that all signal channels must be masked! ``` combine datacard.root -M FitDiagnostics --saveShapes --saveWithUnc --numToysForShape 2000 --setParameters mask_Signal1=1,mask_Signal2=1 --saveOverall -N Name ``` Where "Name" can be specified by you. -Outputs including predictions and covariance will be saved in `fitDiagnosticsName.root` folder `shapes_fit_b` +Outputs, including predictions and covariance, will be saved in `fitDiagnosticsName.root` folder `shapes_fit_b` ### Type B - Control regions not included in datacard @@ -52,16 +52,16 @@ For an example datacard 'datacard.txt' including two signal channels 'Signal1' a text2workspace.py --X-allow-no-signal --X-allow-no-background datacard.txt -o datacard.root ``` -Run the fit making the covariance (output saved as `fitDiagnosticsName.root`) setting no signal contribution in the prefit. Note we *must* set `--preFitValue 0` in this case since we will be using the pre-fit uncertainties for the covariance calculation and we don't want to include the uncertainties on the signal. +Run the fit making the covariance (output saved as `fitDiagnosticsName.root`) setting no pre-fit signal contribution. Note we *must* set `--preFitValue 0` in this case since, we will be using the pre-fit uncertainties for the covariance calculation and we do not want to include the uncertainties on the signal. ``` combine datacard.root -M FitDiagnostics --saveShapes --saveWithUnc --numToysForShape 2000 --saveOverall --preFitValue 0 -n Name ``` Where "Name" can be specified by you. -Outputs including predictions and covariance will be saved in `fitDiagnosticsName.root` folder `shapes_prefit` +Outputs, including predictions and covariance, will be saved in `fitDiagnosticsName.root` folder `shapes_prefit` -In order to also pull out the signal yields corresponding to `r=1` (in case you want to run the validation step later), you also need to produce a second file with the prefit value set to 1. For this you don't need to run many toys so to save time, just set `--numToysForShape` to some low value. +In order to also extract the signal yields corresponding to `r=1` (in case you want to run the validation step later), you also need to produce a second file with the pre-fit value set to 1. For this you do not need to run many toys. To save time you can set `--numToysForShape` to a low value. ``` combine datacard.root -M FitDiagnostics --saveShapes --saveWithUnc --numToysForShape 1 --saveOverall --preFitValue 1 -n Name2 @@ -72,12 +72,12 @@ You should check that the order of the bins in the covariance matrix is as expec ## Produce simplified likelihood inputs -Head over to the `test/simplifiedLikelihoods` directory inside your combine area. The following instructions depend on whether you are aggregating or not aggregating your signal regions. Choose the instructions for your case. +Head over to the `test/simplifiedLikelihoods` directory inside your Combine area. The following instructions depend on whether you are aggregating or not aggregating your signal regions. Choose the instructions for your case. ### Not Aggregating -Run the `makeLHInputs.py` script to prepare the inputs for the simplified likelihood. The filter flag can be used to select only signal regions based on the channel names. To include all channels don't include the filter flag. +Run the `makeLHInputs.py` script to prepare the inputs for the simplified likelihood. The filter flag can be used to select only signal regions based on the channel names. To include all channels do not include the filter flag. -The SL input must NOT include any control regions which were not masked in the fit. +The SL input must NOT include any control regions that were not masked in the fit. If your analysis is Type B (i.e everything in the datacard is a signal region), then you can just run @@ -85,29 +85,29 @@ If your analysis is Type B (i.e everything in the datacard is a signal region), python makeLHInputs.py -i fitDiagnosticsName.root -o SLinput.root ``` -If necessary (i.e as in Type B analyses) you may also need to run the same on the run where the prefit value was set to 1. +If necessary (i.e as in Type B analyses) you may also need to run the same on the output of the run where the pre-fit value was set to 1. ``` python makeLHInputs.py -i fitDiagnosticsName2.root -o SLinput2.root ``` -If you instead have a Type A analysis (some of the regions are control regions that were used to fit but not masked) then you should add the option `--filter SignalName` where `SignalName` is some string that defines the signal regions in your datacards (eg "SR" is a common name for these). +If you instead have a Type A analysis (some of the regions are control regions that were used to fit but not masked) then you should add the option `--filter SignalName` where `SignalName` is some string that defines the signal regions in your datacards (for example, "SR" is a common name for these). -Note: If your signal regions cannot be easily identified by a string, follow the instructions below for aggregating but define only one channel for each aggregate region which will maintain the full information and won't actually aggregate any regions. +Note: If your signal regions cannot be easily identified by a string, follow the instructions below for aggregating, but define only one channel for each aggregate region. This will maintain the full information and will not actually aggregate any regions. ### Aggregating -If aggregating based on covariance edit the config file `aggregateCFG.py` to define aggregate regions based on channel names, note that wildcards are supported. You can then make likelihood inputs using +If aggregating based on covariance, edit the config file `aggregateCFG.py` to define aggregate regions based on channel names. Note that wildcards are supported. You can then make likelihood inputs using ``` python makeLHInputs.py -i fitDiagnosticsName.root -o SLinput.root --config aggregateCFG.py ``` -At this point you now have the inputs as ROOT files necessary to publish and run the simplified likelihood. +At this point you have the inputs as ROOT files necessary to publish and run the simplified likelihood. ## Validating the simplified likelihood approach -The simplified likelihood relies on several assumptions (detailed in the documentation at the top). To test the validity for your analysis, statistical results between combine and the simplified likelihood can be compared. +The simplified likelihood relies on several assumptions (detailed in the documentation at the top). To test the validity for your analysis, statistical results between Combine and the simplified likelihood can be compared. We will use the package [SLtools](https://gitlab.cern.ch/SimplifiedLikelihood/SLtools/-/blob/master/README.md) from the [Simplified Likelihood Paper](https://link.springer.com/article/10.1007/JHEP04(2019)064) for this. The first step is to convert the ROOT files into python configs to run in the tool. @@ -121,13 +121,13 @@ If you followed the steps above, you have all of the histograms already necessar * `-d/--data` : The data TGraph, should be of format `file.root:location/to/graph` * `-c/--covariance` : The covariance TH2 histogram, should be of format `file.root:location/to/histogram` -For example to get the correct output from a Type B analysis with no Aggregating, you can run +For example, to get the correct output from a Type B analysis with no aggregating, you can run ```sh python test/simplifiedLikelihoods/convertSLRootToPython.py -O mymodel.py -s SLinput.root:shapes_prefit/total_signal -b SLinput.root:shapes_prefit/total_M2 d -d SLinput.root:shapes_prefit/total_data -c SLinput.root:shapes_prefit/total_M2 ``` -The output will be a python file with the right format for the SL tool. You can mix different ROOT files for these inputs. Note that the SLtools package also has some tools to covert `.yaml` based inputs into the python config for you. +The output will be a python file with the right format for the SL tool. You can mix different ROOT files for these inputs. Note that the `SLtools` package also has some tools to covert `.yaml`-based inputs into the python config for you. ### Run a likelihood scan with the SL @@ -149,7 +149,7 @@ plt.plot(mus,tmus1) plt.show() ``` -Where, the `mymodel.py` config is a simple python file defined as; +Where the `mymodel.py` config is a simple python file defined as; * `data` : A python array of observed data, one entry per bin. * `background` : A python array of expected background, one entry per bin. @@ -178,26 +178,26 @@ covariance = array.array('d', [ 18774.2, -2866.97, -5807.3, -4460.52, -2777.25, ## Example using tutorial datacard -For this example, we'll use the tutorial datacard `data/tutorials/longexercise/datacard_part3.txt`. This datacard is of **Type B** since there are no control regions (all regions are signal regions). +For this example, we will use the tutorial datacard `data/tutorials/longexercise/datacard_part3.txt`. This datacard is of **Type B** since there are no control regions (all regions are signal regions). -First, we'll create the binary file (run `text2workspace`) +First, we will create the binary file (run `text2workspace`) ``` text2workspace.py --X-allow-no-signal --X-allow-no-background data/tutorials/longexercise/datacard_part3.txt -m 200 ``` -And next, we'll generate the covariance between the bins of the background model. +And next, we will generate the covariance between the bins of the background model. ``` combine data/tutorials/longexercise/datacard_part3.root -M FitDiagnostics --saveShapes --saveWithUnc --numToysForShape 10000 --saveOverall --preFitValue 0 -n SimpleTH1 -m 200 combine data/tutorials/longexercise/datacard_part3.root -M FitDiagnostics --saveShapes --saveWithUnc --numToysForShape 1 --saveOverall --preFitValue 1 -n SimpleTH1_Signal1 -m 200 ``` -We will also want to compare our scan to that from the full likelihood, which we can get as usual from combine. +We will also want to compare our scan to that from the full likelihood, which we can get as usual from Combine. ``` combine -M MultiDimFit data/tutorials/longexercise/datacard_part3.root --rMin -0.5 --rMax 2 --algo grid -n SimpleTH1 -m 200 ``` -Next, since we don't plan to aggregate any of the bins, we'll follow the instructions for this and pick out the right covariance matrix. +Next, since we do not plan to aggregate any of the bins, we will follow the instructions for this and pick out the right covariance matrix. ``` python test/simplifiedLikelihoods/makeLHInputs.py -i fitDiagnosticsSimpleTH1.root -o SLinput.root @@ -205,7 +205,7 @@ python test/simplifiedLikelihoods/makeLHInputs.py -i fitDiagnosticsSimpleTH1.roo python test/simplifiedLikelihoods/makeLHInputs.py -i fitDiagnosticsSimpleTH1_Signal1.root -o SLinput_Signal1.root ``` -We now have everything we need to provide the simplified likelihood inputs. E.G +We now have everything we need to provide the simplified likelihood inputs: ``` $ root -l SLinput.root @@ -221,13 +221,13 @@ TFile** SLinput.root KEY: TDirectoryFile shapes_fit_s;1 shapes_fit_s ``` -We can convert this to a python module that we can use to run a scan with the SLtools package. Note, since we have a **Type B** datacard, we'll be using the *pre-fit* covariance matrix. Also, this means we want to take the signal from the file where the prefit value of `r` was 1. +We can convert this to a python module that we can use to run a scan with the `SLtools` package. Note, since we have a **Type B** datacard, we will be using the *pre-fit* covariance matrix. Also, this means we want to take the signal from the file where the prefit value of `r` was 1. ``` python test/simplifiedLikelihoods/convertSLRootToPython.py -O mymodel.py -s SLinput_Signal1.root:shapes_prefit/total_signal -b SLinput.root:shapes_prefit/total_M1-d SLinput.root:shapes_prefit/total_data -c SLinput.root:shapes_prefit/total_M2 ``` -Let's compare the profiled likelihood scans from our simplified likelihood (using the python file we just created) and from the full likelihood (that we created with combine.). For the former, we need to first checkout the SLtools package +We can compare the profiled likelihood scans from our simplified likelihood (using the python file we just created) and from the full likelihood (that we created with Combine.). For the former, we need to first checkout the `SLtools` package ``` git clone https://gitlab.cern.ch/SimplifiedLikelihood/SLtools.git @@ -277,4 +277,4 @@ This will produce a figure like the one below. ![](SLexample.jpg) -It is also possible to include the 3rd moment of each bin to improve the precision of the simplified likelihood [ [JHEP 64 2019](https://link.springer.com/article/10.1007/JHEP04(2019)064) ]. The necessary information is stored in the outputs from combine so you just need to include the option `-t SLinput.root:shapes_prefit/total_M3` in the options list for `convertSLRootToPython.py` to include this in the model file. The 3rd moment information can be included in SLtools by using ` sl.SLParams(background, covariance, third_moment, obs=data, sig=signal)` +It is also possible to include the third moment of each bin to improve the precision of the simplified likelihood [ [JHEP 64 2019](https://link.springer.com/article/10.1007/JHEP04(2019)064) ]. The necessary information is stored in the outputs from Combine, therefore you just need to include the option `-t SLinput.root:shapes_prefit/total_M3` in the options list for `convertSLRootToPython.py` to include this in the model file. The third moment information can be included in `SLtools` by using ` sl.SLParams(background, covariance, third_moment, obs=data, sig=signal)` diff --git a/docs/part3/validation.md b/docs/part3/validation.md index 64afe41beaf..06499c3fadb 100644 --- a/docs/part3/validation.md +++ b/docs/part3/validation.md @@ -1,12 +1,12 @@ # Validating datacards -This section covers the main features of the datacard validation tool which helps you spot potential problems with your datacards at an early stage. The tool is implemented +This section covers the main features of the datacard validation tool that helps you spot potential problems with your datacards at an early stage. The tool is implemented in the [`CombineHarvester/CombineTools`](https://github.com/cms-analysis/CombineHarvester/blob/113x/CombineTools) subpackage. See the [`combineTool`](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/#combine-tool) section of the documentation for checkout instructions. -The datacard validation tool contains a number of checks. It is possible to call sub-sets of these checks when creating datacards within CombineHarvester. However, for now we will only -describe the usage of the validation tool on already existing datacards. If you create your datacards with CombineHarvester and would like to include the checks at the datacard creation -stage, please contact us via [https://hypernews.cern.ch/HyperNews/CMS/get/higgs-combination.html](https://hypernews.cern.ch/HyperNews/CMS/get/higgs-combination.html). +The datacard validation tool contains a number of checks. It is possible to call subsets of these checks when creating datacards within `CombineHarvester`. However, for now we will only +describe the usage of the validation tool on already existing datacards. If you create your datacards with `CombineHarvester` and would like to include the checks at the datacard creation +stage, please contact us via [https://cms-talk.web.cern.ch/c/physics/cat/cat-stats/279](https://cms-talk.web.cern.ch/c/physics/cat/cat-stats/279). ## How to use the tool @@ -53,7 +53,7 @@ optional arguments: --readOnly If this is enabled, skip validation and only read the output json --checkUncertOver CHECKUNCERTOVER, -c CHECKUNCERTOVER - Report uncertainties which have a normalisation effect + Report uncertainties which have a normalization effect larger than this fraction (default:0.1) --reportSigUnder REPORTSIGUNDER, -s REPORTSIGUNDER Report signals contributing less than this fraction of @@ -63,21 +63,21 @@ optional arguments: --mass MASS Signal mass to use (default:*) ``` `printLevel` adjusts how much information is printed to the screen. When set to 0, the results are only written to the json file, but not to the screen. When set to 1 (default), the number of warnings/alerts -of a given type is printed to the screen. Setting this option to 2 prints the same information as level 1, and additionally which uncertainties are affected (if the check is related to uncertainties) or which processes are affected (if the check is related only to processes). When `printLevel` is set to 3, the information from level 2 is printed, and additionaly for checks related to uncertainties prints which processes are affected. +of a given type is printed to the screen. Setting this option to 2 prints the same information as level 1, and additionally prints which uncertainties are affected (if the check is related to uncertainties) or which processes are affected (if the check is related only to processes). When `printLevel` is set to 3, the information from level 2 is printed, and additionaly for checks related to uncertainties it prints which processes are affected. -To print information to screen, the script parses the json file which contains the results of the validation checks, so if you have already run the validation tool and produced this json file, you can simply change the `printLevel` by re-running the tool with `printLevel` set to a different value, and enabling the `--readOnly` option. +To print information to screen, the script parses the json file that contains the results of the validation checks. Therefore, if you have already run the validation tool and produced this json file, you can simply change the `printLevel` by re-running the tool with `printLevel` set to a different value, and enabling the `--readOnly` option. The options `--checkUncertOver` and `--reportSigUnder` will be described in more detail in the section that discusses the checks for which they are relevant. -Note: the `--mass` argument should only be set if you normally use it when running Combine, otherwise you can leave it at the default. +Note: the `--mass` argument should only be set if you normally use it when running Combine, otherwise you can leave it at the default. -The datacard validation tool is primarily intended for shape (histogram)-based analyses. However, when running on a parametric model or counting experiment the checks for small signal processes, empty processes and uncertainties with large normalisation effects will still be performed. +The datacard validation tool is primarily intended for shape (histogram) based analyses. However, when running on a parametric model or counting experiment the checks for small signal processes, empty processes, and uncertainties with large normalization effects can still be performed. ## Details on checks -### Uncertainties with large normalisation effect +### Uncertainties with large normalization effect -This check highlights nuisance parameters which have a normalisation effect larger than the fraction set by the setting `--checkUncertOver`. The default value is 0.1, meaning that any uncertainties with a normalisation +This check highlights nuisance parameters that have a normalization effect larger than the fraction set by the option `--checkUncertOver`. The default value is 0.1, meaning that any uncertainties with a normalization effect larger than 10% are flagged up. The output file contains the following information for this check: @@ -95,7 +95,7 @@ largeNormEff: { } ``` -Where `value_u` and `value_d` are the values of the 'up' and 'down' normalisation effects. +Where `value_u` and `value_d` are the values of the 'up' and 'down' normalization effects. ### At least one of the Up/Down systematic templates is empty @@ -116,11 +116,11 @@ emptySystematicShape: { } } ``` -Where `value_u` and `value_d` are the values of the 'up' and 'down' normalisation effects. +Where `value_u` and `value_d` are the values of the 'up' and 'down' normalization effects. ### Identical Up/Down templates -This check applies to shape uncertainties only, and will highlight cases where the shape uncertainties have identical Up and Down templates (identical in shape and in normalisation). +This check applies to shape uncertainties only, and will highlight cases where the shape uncertainties have identical Up and Down templates (identical in shape and in normalization). The information given in the output file for this check is: @@ -136,12 +136,12 @@ uncertTemplSame: { } } ``` -Where `value_u` and `value_d` are the values of the 'up' and 'down' normalisation effects. +Where `value_u` and `value_d` are the values of the 'up' and 'down' normalization effects. ### Up and Down templates vary the yield in the same direction -Again this check only applies to shape uncertainties - it highlights cases where the 'Up' template and the 'Down' template both have the effect of increasing or decreasing the normalisation of a process. +Again, this check only applies to shape uncertainties - it highlights cases where the 'Up' template and the 'Down' template both have the effect of increasing or decreasing the normalization of a process. The information given in the output file for this check is: @@ -158,15 +158,15 @@ uncertVarySameDirect: { } } ``` -Where `value_u` and `value_d` are the values of the 'up' and 'down' normalisation effects. +Where `value_u` and `value_d` are the values of the 'up' and 'down' normalization effects. ### Uncertainty probably has no genuine shape effect -In this check, applying only to shape uncertainties, the normalised nominal templates are compared with the normalised templates for the 'up' and 'down' systematic variations. The script calculates +In this check, applying only to shape uncertainties, the normalized nominal templates are compared with the normalized templates for the 'up' and 'down' systematic variations. The script calculates $$ \Sigma_i \frac{2|\text{up}(i) - \text{nominal}(i)|}{|\text{up}(i)| + |\text{nominal}(i)|}$$ and $$ \Sigma_i \frac{2|\text{down}(i) - \text{nominal}(i)|}{|\text{down}(i)| + |\text{nominal}(i)|} $$ -where the sums run over all bins in the histograms, and 'nominal', 'up', and 'down' are the central template and up and down varied templates, all normalised. +where the sums run over all bins in the histograms, and 'nominal', 'up', and 'down' are the central template and up and down varied templates, all normalized. If both sums are smaller than 0.001, the uncertainty is flagged up as probably not having a genuine shape effect. This means a 0.1% variation in one bin is enough to avoid being reported, but many smaller variations can also sum to be large enough to pass the threshold. It should be noted that the chosen threshold is somewhat arbitrary: if an uncertainty is flagged up as probably having no genuine shape effect you should take this as a starting point to investigate. @@ -204,9 +204,9 @@ emptyProcessShape: { ``` -### Bins which have signal but no background +### Bins that have signal but no background -For shape-based analyses, this checks whether there are any bins in the nominal templates which have signal contributions, but no background contributions. +For shape-based analyses, this checks whether there are any bins in the nominal templates that have signal contributions, but no background contributions. The information given in the output file for this check is: @@ -223,7 +223,7 @@ emptyBkgBin: { ### Small signal process -This reports signal processes which contribute less than the fraction specified by `--reportSigUnder` (default 0.001 = 0.1%) of the total signal in a given category. This produces an alert, not a warning, as it does not hint at a potential problem. +This reports signal processes that contribute less than the fraction specified by `--reportSigUnder` (default 0.001 = 0.1%) of the total signal in a given category. This produces an alert, not a warning, as it does not hint at a potential problem. However, in analyses with many signal contributions and with long fitting times, it can be helpful to remove signals from a category in which they do not contribute a significant amount. The information given in the output file for this check is: @@ -243,12 +243,12 @@ Where `sigrate_tot` is the total signal yield in the analysis category and `proc ## What to do in case of a warning -These checks are mostly a tool to help you investigate your datacards: a warning does not necessarily mean there is a mistake in your datacard, but you should use it as a starting point to investigate. Empty processes and emtpy shape uncertainties connected to nonempty processes will most likely be unintended. The same holds for cases where the 'up' and 'down' shape templates are identical. If there are bins which contain signal but no background contributions, this should be corrected. See the [FAQ](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part4/usefullinks/#faq) for more information on that point. +These checks are mostly a tool to help you investigate your datacards: a warning does not necessarily mean there is a mistake in your datacard, but you should use it as a starting point to investigate. Empty processes and emtpy shape uncertainties connected to nonempty processes will most likely be unintended. The same holds for cases where the 'up' and 'down' shape templates are identical. If there are bins that contain signal but no background contributions, this should be corrected. See the [FAQ](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part4/usefullinks/#faq) for more information on that point. -For other checks it depends on where the check is fired whether there is a problem or not. Some examples: +For other checks it depends on the situation whether there is a problem or not. Some examples: -- An analysis-specific noncloser uncertainty could be larger than 10%. A theoretical uncertainty in the ttbar normalisation probably not. -- In an analysis with a selection that requires the presence of exactly 1 jet, 'up' and 'down' variations in the jet energy uncertainty *could* both change the process normalisation in the same direction. (But they don't have to!) +- An analysis-specific nonclosure uncertainty could be larger than 10%. A theoretical uncertainty in the ttbar normalization probably not. +- In an analysis with a selection that requires the presence of exactly 1 jet, 'up' and 'down' variations in the jet energy uncertainty *could* both change the process normalization in the same direction. (But they do not have to!) -As always: think about whether you expect a check to yield a warning in case of your analysis, and investigate to make sure. +As always: think about whether you expect a check to yield a warning in case of your analysis, and if not, investigate to make sure there are no issues. diff --git a/docs/part4/usefullinks.md b/docs/part4/usefullinks.md index 08900587eee..43d44bcc89d 100644 --- a/docs/part4/usefullinks.md +++ b/docs/part4/usefullinks.md @@ -2,7 +2,7 @@ ### Tutorials and reading material -There are several tutorials which have been run over the last few years with instructions and examples for running the combine tool. +There are several tutorials that have been run over the last few years with instructions and examples for running the Combine tool. Tutorial Sessions: @@ -15,7 +15,7 @@ Tutorial Sessions: * [7th tutorial 3rd Feb 2023](https://indico.cern.ch/event/1227742/) - Uses `113x` branch -Worked examples from Higgs analyses using combine: +Worked examples from Higgs analyses using Combine: * [The CMS DAS at CERN 2014](https://twiki.cern.ch/twiki/bin/viewauth/CMS/SWGuideCMSDataAnalysisSchool2014HiggsCombPropertiesExercise) * [The CMS DAS at DESY 2018](https://twiki.cern.ch/twiki/bin/view/CMS/SWGuideCMSDataAnalysisSchoolHamburg2018LongStatisticsExercise) @@ -25,11 +25,11 @@ Higgs combinations procedures * [Conventions to be used when preparing inputs for Higgs combinations](https://twiki.cern.ch/twiki/bin/view/CMS/HiggsWG/HiggsCombinationConventions) - * [CMS AN-2011/298](http://cms.cern.ch/iCMS/jsp/db_notes/noteInfo.jsp?cmsnoteid=CMS AN-2011/298) Procedure for the LHC Higgs boson search combination in summer 2011. This describes in more detail some of the methods used in Combine. + * [CMS AN-2011/298](http://cms.cern.ch/iCMS/jsp/db_notes/noteInfo.jsp?cmsnoteid=CMS AN-2011/298) Procedure for the LHC Higgs boson search combination in summer 2011. This describes in more detail some of the methods used in Combine. ### Citations -There is no document currently which can be cited for using the combine tool, however you can use the following publications for the procedures we use, +There is no document currently which can be cited for using the Combine tool, however, you can use the following publications for the procedures we use, * [Summer 2011 public ATLAS-CMS note](https://cds.cern.ch/record/1379837) for any Frequentist limit setting procedures with toys or Bayesian limits, constructing likelihoods, descriptions of nuisance parameter options (like log-normals (`lnN`) or gamma (`gmN`), and for definitions of test-statistics. @@ -45,7 +45,7 @@ There is no document currently which can be cited for using the combine tool, ho ### Combine based packages -* [SWGuideHiggs2TauLimits](https://twiki.cern.ch/twiki/bin/view/CMS/SWGuideHiggs2TauLimits) +* [SWGuideHiggs2TauLimits](https://twiki.cern.ch/twiki/bin/view/CMS/SWGuideHiggs2TauLimits) (Deprecated) * [ATGCRooStats](https://twiki.cern.ch/twiki/bin/view/CMS/ATGCRooStats) @@ -53,7 +53,7 @@ There is no document currently which can be cited for using the combine tool, ho ### Contacts -* **Hypernews forum**: hn-cms-higgs-combination [https://hypernews.cern.ch/HyperNews/CMS/get/higgs-combination.html](https://hypernews.cern.ch/HyperNews/CMS/get/higgs-combination.html) +* **CMStalk forum**: [https://cms-talk.web.cern.ch/c/physics/cat/cat-stats/279](https://cms-talk.web.cern.ch/c/physics/cat/cat-stats/279) ### CMS Statistics Committee @@ -61,34 +61,34 @@ There is no document currently which can be cited for using the combine tool, ho # FAQ -* _Why does combine have trouble with bins that have zero expected contents?_ - * If you're computing only upper limits, and your zero-prediction bins are all empty in data, then you can just set the background to a very small value instead of zero as anyway the computation is regular for background going to zero (e.g. a counting experiment with $B\leq1$ will have essentially the same expected limit and observed limit as one with $B=0$). If you're computing anything else, e.g. p-values, or if your zero-prediction bins are not empty in data, you're out of luck, and you should find a way to get a reasonable background prediction there (and set an uncertainty on it, as per the point above) +* _Why does Combine have trouble with bins that have zero expected contents?_ + * If you are computing only upper limits, and your zero-prediction bins are all empty in data, then you can just set the background to a very small value instead of zero as the computation is regular for background going to zero (e.g. a counting experiment with $B\leq1$ will have essentially the same expected limit and observed limit as one with $B=0$). If you are computing anything else, e.g. p-values, or if your zero-prediction bins are not empty in data, you're out of luck, and you should find a way to get a reasonable background prediction there (and set an uncertainty on it, as per the point above) * _How can an uncertainty be added to a zero quantity?_ - * You can put an uncertainty even on a zero event yield if you use a gamma distribution. That's in fact the more proper way of doing it if the prediction of zero comes from the limited size of your MC or data sample used to compute it. + * You can put an uncertainty even on a zero event yield if you use a gamma distribution. That is in fact the more proper way of doing it if the prediction of zero comes from the limited size of your MC or data sample used to compute it. * _Why does changing the observation in data affect my expected limit?_ * The expected limit (if using either the default behaviour of `-M AsymptoticLimits` or using the `LHC-limits` style limit setting with toys) uses the _**post-fit**_ expectation of the background model to generate toys. This means that first the model is fit to the _**observed data**_ before toy generation. See the sections on [blind limits](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/commonstatsmethods/#blind-limits) and [toy generation](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#toy-data-generation) to avoid this behavior. * _How can I deal with an interference term which involves a negative contribution?_ - * You will need to set up a specific PhysicsModel to deal with this, however you can [see this section](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part2/physicsmodels/#interference) to implement such a model which can incorperate a negative contribution to the physics process -* _How does combine work?_ - * That is not a question which can be answered without someone's head exploding so please try to formulate something specific. + * You will need to set up a specific PhysicsModel to deal with this, however you can [see this section](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part2/physicsmodels/#interference) to implement such a model that can incorperate a negative contribution to the physics process +* _How does Combine work?_ + * That is not a question that can be answered without someone's head exploding; please try to formulate something specific. * _What does fit status XYZ mean?_ - * Combine reports the fit status in some routines (for example in the `FitDiagnostics` method). These are typically the status of the last call from Minuit. For details on the meanings of these status codes see the [Minuit2Minimizer](https://root.cern.ch/root/html/ROOT__Minuit2__Minuit2Minimizer.html) documentation page. + * Combine reports the fit status in some routines (for example in the `FitDiagnostics` method). These are typically the status of the last call from Minuit. For details on the meanings of these status codes see the [Minuit2Minimizer](https://root.cern.ch/root/html/ROOT__Minuit2__Minuit2Minimizer.html) documentation page. * _Why does my fit not converge?_ - * There are several reasons why some fits may not converge. Often some indication can be obtained from the `RooFitResult` or status which you will see information from when using the `--verbose X` (with $X>2$) option. Sometimes however, it can be that the likelihood for your data is very unusual. You can get a rough idea about what the likelihood looks like as a function of your parameters (POIs and nuisances) using `combineTool.py -M FastScan -w myworkspace.root` (use --help for options). - * We have seen often that fits in combine using `RooCBShape` as a parametric function will fail. This is related to an optimisation that fails. You can try to fix the problem as described in this issue: [issues#347](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/issues/347) (i.e add the option `--X-rtd ADDNLL_CBNLL=0`). + * There are several reasons why some fits may not converge. Often some indication can be obtained from the `RooFitResult` or status that you will see information from when using the `--verbose X` (with $X>2$) option. Sometimes however, it can be that the likelihood for your data is very unusual. You can get a rough idea about what the likelihood looks like as a function of your parameters (POIs and nuisances) using `combineTool.py -M FastScan -w myworkspace.root` (use --help for options). + * We have often seen that fits in Combine using `RooCBShape` as a parametric function will fail. This is related to an optimization that fails. You can try to fix the problem as described in this issue: [issues#347](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/issues/347) (i.e add the option `--X-rtd ADDNLL_CBNLL=0`). * _Why does the fit/fits take so long?_ - * The minimisation routines are common to many methods in combine. You can tune the fitting using the generic optimisation command line options described [here](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#generic-minimizer-options). For example, setting the default minimizer strategy to 0 can greatly improve the speed since this avoids running Hesse. In calculations such as `AsymptoticLimits`, Hesse is not needed and hence this can be done, however, for `FitDiagnostics` the uncertainties and correlations are part of the output so using strategy 0 may not be particularly accurate. + * The minimization routines are common to many methods in Combine. You can tune the fits using the generic optimization command line options described [here](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#generic-minimizer-options). For example, setting the default minimizer strategy to 0 can greatly improve the speed, since this avoids running HESSE. In calculations such as `AsymptoticLimits`, HESSE is not needed and hence this can be done, however, for `FitDiagnostics` the uncertainties and correlations are part of the output, so using strategy 0 may not be particularly accurate. * _Why are the results for my counting experiment so slow or unstable?_ - * There is a known issue with counting experiments with ***large*** numbers of events which will cause unstable fits or even the fit to fail. You can avoid this by creating a "fake" shape datacard (see [this section](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part2/settinguptheanalysis/#combination-of-multiple-datacards) from the setting up the datacards page). The simplest way to do this is to run `combineCards.py -S mycountingcard.txt > myshapecard.txt`. You may still find that your parameter uncertainties are not correct when you have large numbers of events. This can be often fixed using the `--robustHesse` option. An example of this issue is detailed [here](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/issues/498). + * There is a known issue with counting experiments with ***large*** numbers of events that will cause unstable fits or even the fit to fail. You can avoid this by creating a "fake" shape datacard (see [this section](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part2/settinguptheanalysis/#combination-of-multiple-datacards) from the setting up the datacards page). The simplest way to do this is to run `combineCards.py -S mycountingcard.txt > myshapecard.txt`. You may still find that your parameter uncertainties are not correct when you have large numbers of events. This can be often fixed using the `--robustHesse` option. An example of this issue is detailed [here](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/issues/498). * _Why do some of my nuisance parameters have uncertainties > 1?_ - * When running `-M FitDiagnostics` you may find that the post-fit uncertainties of the nuisances are $> 1$ (or larger than their pre-fit values). If this is the case, you should first check if the same is true when adding the option `--minos all` which will invoke minos to scan the likelihood as a function of these parameters to determine the crossing at $-2\times\Delta\log\mathcal{L}=1$ rather than relying on the estimate from Hesse. However, this is not guaranteed to succeed, in which case you can scan the likelihood yourself using `MultiDimFit` (see [here](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/commonstatsmethods/#likelihood-fits-and-scans) ) and specifying the option `--poi X` where `X` is your nuisance parameter. + * When running `-M FitDiagnostics` you may find that the post-fit uncertainties of the nuisances are $> 1$ (or larger than their pre-fit values). If this is the case, you should first check if the same is true when adding the option `--minos all`, which will invoke MINOS to scan the likelihood as a function of these parameters to determine the crossing at $-2\times\Delta\log\mathcal{L}=1$ rather than relying on the estimate from HESSE. However, this is not guaranteed to succeed, in which case you can scan the likelihood yourself using `MultiDimFit` (see [here](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/commonstatsmethods/#likelihood-fits-and-scans) ) and specifying the option `--poi X` where `X` is your nuisance parameter. * _How can I avoid using the data?_ * For almost all methods, you can use toy data (or an Asimov dataset) in place of the real data for your results to be blind. You should be careful however as in some methods, such as `-M AsymptoticLimits` or `-M HybridNew --LHCmode LHC-limits` or any other method using the option `--toysFrequentist`, the data will be used to determine the most likely nuisance parameter values (to determine the so-called a-posteriori expectation). See the section on [toy data generation](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/runningthetool/#toy-data-generation) for details on this. * _What if my nuisance parameters have correlations which are not 0 or 1?_ - * Combine is designed under the assumption that each *source* of nuisance parameter is uncorrelated with the other sources. If you have a case where some pair (or set) of nuisances have some known correlation structure, you can compute the eigenvectors of their correlation matrix and provide these *diagonalised* nuisances to combine. You can also model *partial correlations*, between different channels or data taking periods, of a given nuisance parameter using the `combineTool` as described in [this page](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/issues/503). + * Combine is designed under the assumption that each *source* of nuisance parameter is uncorrelated with the other sources. If you have a case where some pair (or set) of nuisances have some known correlation structure, you can compute the eigenvectors of their correlation matrix and provide these *diagonalised* nuisances to Combine. You can also model *partial correlations*, between different channels or data taking periods, of a given nuisance parameter using the `combineTool` as described in [this page](https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/issues/503). * _My nuisances are (artificially) constrained and/or the impact plot show some strange behaviour, especially after including MC statistical uncertainties. What can I do?_ - * Depending on the details of the analysis, several solutions can be adopted to mitigate these effects. We advise to run the validation tools at first, to identify possible redundant shape uncertainties that can be safely eliminated or replaced with lnN ones. Any remaining artificial constrain should be studies. Possible mitigating strategies can be to (a) smooth the templates or (b) adopt some rebinning in order to reduce statistical fluctuations in the templates. A description of possible strategies and effects can be found in [this talk by Margaret Eminizer](https://indico.cern.ch/event/788727/contributions/3401374/attachments/1831680/2999825/higgs_combine_4_17_2019_fitting_details.pdf) + * Depending on the details of the analysis, several solutions can be adopted to mitigate these effects. We advise to run the validation tools at first, to identify possible redundant shape uncertainties that can be safely eliminated or replaced with lnN ones. Any remaining artificial constraints should be studies. Possible mitigating strategies can be to (a) smooth the templates or (b) adopt some rebinning in order to reduce statistical fluctuations in the templates. A description of possible strategies and effects can be found in [this talk by Margaret Eminizer](https://indico.cern.ch/event/788727/contributions/3401374/attachments/1831680/2999825/higgs_combine_4_17_2019_fitting_details.pdf) * _What do CLs, CLs+b and CLb in the code mean?_ - * The names CLs+b and CLb what are found within some of the RooStats tools are rather outdated and should instead be referred to as p-values - $p_{\mu}$ and $1-p_{b}$, respectively. We use the CLs (which itself is not a p-value) criterion often in High energy physics as it is designed to avoid excluding a signal model when the sensitivity is low (and protects against excluding due to underfluctuations in the data). Typically, when excluding a signal model the p-value $p_{\mu}$ often refers to the p-value under the signal+background hypothesis, assuming a particular value of the signal stregth ($\mu$) while $p_{b}$ is the p-value under the background only hypothesis. You can find more details and definitions of the CLs criterion and $p_{\mu}$ and $p_{b}$ in section 39.4.2.4 of the [2016 PDG review](http://pdg.lbl.gov/2016/reviews/rpp2016-rev-statistics.pdf). + * The names CLs+b and CLb what are found within some of the `RooStats` tools are rather outdated and should instead be referred to as p-values - $p_{\mu}$ and $1-p_{b}$, respectively. We use the CLs (which itself is not a p-value) criterion often in High energy physics as it is designed to avoid excluding a signal model when the sensitivity is low (and protects against excluding due to underfluctuations in the data). Typically, when excluding a signal model the p-value $p_{\mu}$ often refers to the p-value under the signal+background hypothesis, assuming a particular value of the signal strength ($\mu$) while $p_{b}$ is the p-value under the background only hypothesis. You can find more details and definitions of the CLs criterion and $p_{\mu}$ and $p_{b}$ in section 39.4.2.4 of the [2016 PDG review](http://pdg.lbl.gov/2016/reviews/rpp2016-rev-statistics.pdf). diff --git a/docs/part5/longexercise.md b/docs/part5/longexercise.md index 0f02c08c612..c291669e8d3 100644 --- a/docs/part5/longexercise.md +++ b/docs/part5/longexercise.md @@ -1,5 +1,5 @@ # Long exercise: main features of Combine -This exercise is designed to give a broad overview of the tools available for statistical analysis in CMS using the combine tool. Combine is a high-level tool for building RooFit/RooStats models and running common statistical methods. We will cover the typical aspects of setting up an analysis and producing the results, as well as look at ways in which we can diagnose issues and get a deeper understanding of the statistical model. This is a long exercise - expect to spend some time on it especially if you are new to Combine. If you get stuck while working through this exercise or have questions specifically about the exercise, you can ask them on [this mattermost channel](https://mattermost.web.cern.ch/cms-exp/channels/hcomb-tutorial). Finally, we also provide some solutions to some of the questions that are asked as part of the exercise. These are available [here](https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part5/longexerciseanswers). +This exercise is designed to give a broad overview of the tools available for statistical analysis in CMS using the combine tool. Combine is a high-level tool for building `RooFit`/`RooStats` models and running common statistical methods. We will cover the typical aspects of setting up an analysis and producing the results, as well as look at ways in which we can diagnose issues and get a deeper understanding of the statistical model. This is a long exercise - expect to spend some time on it especially if you are new to Combine. If you get stuck while working through this exercise or have questions specifically about the exercise, you can ask them on [this mattermost channel](https://mattermost.web.cern.ch/cms-exp/channels/hcomb-tutorial). Finally, we also provide some solutions to some of the questions that are asked as part of the exercise. These are available [here](https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part5/longexerciseanswers). For the majority of this course we will work with a simplified version of a real analysis, that nonetheless will have many features of the full analysis. The analysis is a search for an additional heavy neutral Higgs boson decaying to tau lepton pairs. Such a signature is predicted in many extensions of the standard model, in particular the minimal supersymmetric standard model (MSSM). You can read about the analysis in the paper [here](https://arxiv.org/pdf/1803.06553.pdf). The statistical inference makes use of a variable called the total transverse mass ($M_{\mathrm{T}}^{\mathrm{tot}}$) that provides good discrimination between the resonant high-mass signal and the main backgrounds, which have a falling distribution in this high-mass region. The events selected in the analysis are split into a several categories which target the main di-tau final states as well as the two main production modes: gluon-fusion (ggH) and b-jet associated production (bbH). One example is given below for the fully-hadronic final state in the b-tag category which targets the bbH signal: @@ -12,7 +12,7 @@ You can find a presentation with some more background on likelihoods and extract If you are not yet familiar with these concepts, or would like to refresh your memory, we recommend that you have a look at these presentations before you start with the exercise. ## Getting started -We need to set up a new CMSSW area and checkout the combine package: +We need to set up a new CMSSW area and checkout the Combine package: ```shell cmsrel CMSSW_11_3_4 @@ -26,7 +26,7 @@ git fetch origin git checkout v9.0.0 ``` -We will also make use another package, `CombineHarvester`, which contains some high-level tools for working with combine. The following command will download the repository and checkout just the parts of it we need for this tutorial: +We will also make use another package, `CombineHarvester`, which contains some high-level tools for working with Combine. The following command will download the repository and checkout just the parts of it we need for this tutorial: ```shell bash <(curl -s https://raw.githubusercontent.com/cms-analysis/CombineHarvester/main/CombineTools/scripts/sparse-checkout-https.sh) ``` @@ -82,10 +82,10 @@ The layout of the datacard is as follows: - The first line starting with `bin` gives a unique label to each channel, and the following line starting with `observation` gives the number of events observed in data. - In the remaining part of the card there are several columns: each one represents one process in one channel. The first four lines labelled `bin`, `process`, `process` and `rate` give the channel label, the process label, a process identifier (`<=0` for signal, `>0` for background) and the number of expected events respectively. - The remaining lines describe sources of systematic uncertainty. Each line gives the name of the uncertainty, (which will become the name of the nuisance parameter inside our RooFit model), the type of uncertainty ("lnN" = log-normal normalisation uncertainty) and the effect on each process in each channel. E.g. a 20% uncertainty on the yield is written as 1.20. - - It is also possible to add a hash symbol (`#`) at the start of a line, which combine will then ignore when it reads the card. + - It is also possible to add a hash symbol (`#`) at the start of a line, which Combine will then ignore when it reads the card. -We can now run combine directly using this datacard as input. The general format for running combine is: +We can now run Combine directly using this datacard as input. The general format for running Combine is: ```shell combine -M [method] [datacard] [additional options...] @@ -93,15 +93,15 @@ combine -M [method] [datacard] [additional options...] ### A: Computing limits using the asymptotic approximation -As we are searching for a signal process that does not exist in the standard model, it's natural to set an upper limit on the cross section times branching fraction of the process (assuming our dataset does not contain a significant discovery of new physics). Combine has dedicated method for calculating upper limits. The most commonly used one is `AsymptoticLimits`, which implements the [CLs criterion](https://inspirehep.net/literature/599622) and uses the profile likelihood ratio as the test statistic. As the name implies, the test statistic distributions are determined analytically in the [asymptotic approximation](https://arxiv.org/abs/1007.1727), so there is no need for more time-intensive toy throwing and fitting. Try running the following command: +As we are searching for a signal process that does not exist in the standard model, it's natural to set an upper limit on the cross section times branching fraction of the process (assuming our dataset does not contain a significant discovery of new physics). Combine has dedicated method for calculating upper limits. The most commonly used one is `AsymptoticLimits`, which implements the [CLs criterion](https://inspirehep.net/literature/599622) and uses the profile likelihood ratio as the test statistic. As the name implies, the test statistic distributions are determined analytically in the [asymptotic approximation](https://arxiv.org/abs/1007.1727), so there is no need for more time-intensive toy throwing and fitting. Try running the following command: ```shell combine -M AsymptoticLimits datacard_part1.txt -n .part1A ``` -You should see the results of the observed and expected limit calculations printed to the screen. Here we have added an extra option, `-n .part1A`, which is short for `--name`, and is used to label the output file combine produces, which in this case will be called `higgsCombine.part1A.AsymptoticLimits.mH120.root`. The file name depends on the options we ran with, and is of the form: `higgsCombine[name].[method].mH[mass].root`. The file contains a TTree called `limit` which stores the numerical values returned by the limit computation. Note that in our case we did not set a signal mass when running combine (i.e. `-m 800`), so the output file just uses the default value of `120`. This does not affect our result in any way though, just the label that is used on the output file. +You should see the results of the observed and expected limit calculations printed to the screen. Here we have added an extra option, `-n .part1A`, which is short for `--name`, and is used to label the output file Combine produces, which in this case will be called `higgsCombine.part1A.AsymptoticLimits.mH120.root`. The file name depends on the options we ran with, and is of the form: `higgsCombine[name].[method].mH[mass].root`. The file contains a TTree called `limit` which stores the numerical values returned by the limit computation. Note that in our case we did not set a signal mass when running Combine (i.e. `-m 800`), so the output file just uses the default value of `120`. This does not affect our result in any way though, just the label that is used on the output file. -The limits are given on a parameter called `r`. This is the default **parameter of interest (POI)** that is added to the model automatically. It is a linear scaling of the normalisation of all signal processes given in the datacard, i.e. if $s_{i,j}$ is the nominal number of signal events in channel $i$ for signal process $j$, then the normalisation of that signal in the model is given as $r\cdot s_{i,j}(\vec{\theta})$, where $\vec{\theta}$ represents the set of nuisance parameters which may also affect the signal normalisation. We therefore have some choice in the interpretation of r: for the measurement of a process with a well defined SM prediction we may enter this as the nominal yield in the datacard, such that $r=1$ corresponds to this SM expectation, whereas for setting limits on BSM processes we may choose the nominal yield to correspond to some cross section, e.g. 1 pb, such that we can interpret the limit as a cross section limit directly. In this example the signal has been normalised to a cross section times branching fraction of 1 fb. +The limits are given on a parameter called `r`. This is the default **parameter of interest (POI)** that is added to the model automatically. It is a linear scaling of the normalization of all signal processes given in the datacard, i.e. if $s_{i,j}$ is the nominal number of signal events in channel $i$ for signal process $j$, then the normalization of that signal in the model is given as $r\cdot s_{i,j}(\vec{\theta})$, where $\vec{\theta}$ represents the set of nuisance parameters which may also affect the signal normalization. We therefore have some choice in the interpretation of r: for the measurement of a process with a well-defined SM prediction we may enter this as the nominal yield in the datacard, such that $r=1$ corresponds to this SM expectation, whereas for setting limits on BSM processes we may choose the nominal yield to correspond to some cross section, e.g. 1 pb, such that we can interpret the limit as a cross section limit directly. In this example the signal has been normalised to a cross section times branching fraction of 1 fb. The expected limit is given under the background-only hypothesis. The median value under this hypothesis as well as the quantiles needed to give the 68% and 95% intervals are also calculated. These are all the ingredients needed to produce the standard limit plots you will see in many CMS results, for example the $\sigma \times \mathcal{B}$ limits for the $\text{bb}\phi\rightarrow\tau\tau$ process: @@ -116,19 +116,19 @@ In this case we only computed the values for one signal mass hypothesis, indicat - Now try changing the number of observed events. The observed limit will naturally change, but the expected does too - why might this be? -There are other command line options we can supply to combine which will change its behaviour when run. You can see the full set of supported options by doing `combine -h`. Many options are specific to a given method, but others are more general and are applicable to all methods. Throughout this tutorial we will highlight some of the most useful options you may need to use, for example: +There are other command line options we can supply to Combine which will change its behaviour when run. You can see the full set of supported options by doing `combine -h`. Many options are specific to a given method, but others are more general and are applicable to all methods. Throughout this tutorial we will highlight some of the most useful options you may need to use, for example: - - The range on the signal strength modifier: `--rMin=X` and `--rMax=Y`: In RooFit parameters can optionally have a range specified. The implication of this is that their values cannot be adjusted beyond the limits of this range. The min and max values can be adjusted though, and we might need to do this for our POI `r` if the order of magnitude of our measurement is different from the default range of `[0, 20]`. This will be discussed again later in the tutorial. + - The range on the signal strength modifier: `--rMin=X` and `--rMax=Y`: In `RooFit` parameters can optionally have a range specified. The implication of this is that their values cannot be adjusted beyond the limits of this range. The min and max values can be adjusted though, and we might need to do this for our POI `r` if the order of magnitude of our measurement is different from the default range of `[0, 20]`. This will be discussed again later in the tutorial. - Verbosity: `-v X`: By default combine does not usually produce much output on the screen other the main result at the end. However, much more detailed information can be printed by setting the `-v N` with N larger than zero. For example at `-v 3` the logs from the minimizer, Minuit, will also be printed. These are very useful for debugging problems with the fit. ### Advanced section: B: Computing limits with toys -Now we will look at computing limits without the asymptotic approximation, so instead using toy datasets to determine the test statistic distributions under the signal+background and background-only hypotheses. This can be necessary if we are searching for signal in bins with a small number of events expected. In combine we will use the `HybridNew` method to calculate limits using toys. This mode is capable of calculating limits with several different test statistics and with fine-grained control over how the toy datasets are generated internally. To calculate LHC-style profile likelihood limits (i.e. the same as we did with the asymptotic) we set the option `--LHCmode LHC-limits`. You can read more about the different options in the [Combine documentation](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/commonstatsmethods/#computing-limits-with-toys). +Now we will look at computing limits without the asymptotic approximation, so instead using toy datasets to determine the test statistic distributions under the signal+background and background-only hypotheses. This can be necessary if we are searching for signal in bins with a small number of events expected. In Combine we will use the `HybridNew` method to calculate limits using toys. This mode is capable of calculating limits with several different test statistics and with fine-grained control over how the toy datasets are generated internally. To calculate LHC-style profile likelihood limits (i.e. the same as we did with the asymptotic) we set the option `--LHCmode LHC-limits`. You can read more about the different options in the [Combine documentation](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/commonstatsmethods/#computing-limits-with-toys). Run the following command: ```shell combine -M HybridNew datacard_part1.txt --LHCmode LHC-limits -n .part1B --saveHybridResult --fork 0 ``` -In contrast to `AsymptoticLimits` this will only determine the observed limit, and will take a few minutes. There will not be much output to the screen while combine is running. You can add the option `-v 1` to get a better idea of what is going on. You should see combine stepping around in `r`, trying to find the value for which CLs = 0.05, i.e. the 95% CL limit. The `--saveHybridResult` option will cause the test statistic distributions that are generated at each tested value of `r` to be saved in the output ROOT file. +In contrast to `AsymptoticLimits` this will only determine the observed limit, and will take a few minutes. There will not be much output to the screen while combine is running. You can add the option `-v 1` to get a better idea of what is going on. You should see Combine stepping around in `r`, trying to find the value for which CLs = 0.05, i.e. the 95% CL limit. The `--saveHybridResult` option will cause the test statistic distributions that are generated at each tested value of `r` to be saved in the output ROOT file. To get an expected limit add the option `--expectedFromGrid X`, where `X` is the desired quantile, e.g. for the median: @@ -136,7 +136,7 @@ To get an expected limit add the option `--expectedFromGrid X`, where `X` is the combine -M HybridNew datacard_part1.txt --LHCmode LHC-limits -n .part1B --saveHybridResult --fork 0 --expectedFromGrid 0.500 ``` -Calculate the median expected limit and the 68% range. The 95% range could also be done, but note it will take much longer to run the 0.025 quantile. While combine is running you can move on to the next steps below. +Calculate the median expected limit and the 68% range. The 95% range could also be done, but note it will take much longer to run the 0.025 quantile. While Combine is running you can move on to the next steps below. **Tasks and questions:** - In contrast to `AsymptoticLimits`, with `HybridNew` each limit comes with an uncertainty. What is the origin of this uncertainty? @@ -169,7 +169,7 @@ Note that for more complex models the fitting time can increase significantly, m Topics covered in this section: - A: Setting up the datacard - - B: Running combine for a blind analysis + - B: Running Combine for a blind analysis - C: Using FitDiagnostics - D: MC statistical uncertainties @@ -247,12 +247,12 @@ A more general way of blinding is to use combine's toy and Asimov dataset genera **Task:** Calculate a blind limit by generating a background-only Asimov with the `-t -1` option instead of using the `AsymptoticLimits` specific options. You should find the observed limit is the same as the expected. Then see what happens if you inject a signal into the Asimov dataset using the `--expectSignal [X]` option. ### C: Using FitDiagnostics -We will now explore one of the most commonly used modes of combine: `FitDiagnostics` . As well as allowing us to make a **measurement** of some physical quantity (as opposed to just setting a limit on it), this method is useful to gain additional information about the model and the behaviour of the fit. It performs two fits: +We will now explore one of the most commonly used modes of Combine: `FitDiagnostics` . As well as allowing us to make a **measurement** of some physical quantity (as opposed to just setting a limit on it), this method is useful to gain additional information about the model and the behaviour of the fit. It performs two fits: - A "background-only" (b-only) fit: first POI (usually "r") fixed to zero - A "signal+background" (s+b) fit: all POIs are floating -With the s+b fit combine will report the best-fit value of our signal strength modifier `r`. As well as the usual output file, a file named `fitDiagnosticsTest.root` is produced which contains additional information. In particular it includes two `RooFitResult` objects, one for the b-only and one for the s+b fit, which store the fitted values of all the **nuisance parameters (NPs)** and POIs as well as estimates of their uncertainties. The covariance matrix from both fits is also included, from which we can learn about the correlations between parameters. Run the `FitDiagnostics` method on our workspace: +With the s+b fit Combine will report the best-fit value of our signal strength modifier `r`. As well as the usual output file, a file named `fitDiagnosticsTest.root` is produced which contains additional information. In particular it includes two `RooFitResult` objects, one for the b-only and one for the s+b fit, which store the fitted values of all the **nuisance parameters (NPs)** and POIs as well as estimates of their uncertainties. The covariance matrix from both fits is also included, from which we can learn about the correlations between parameters. Run the `FitDiagnostics` method on our workspace: ```shell combine -M FitDiagnostics workspace_part2.root -m 800 --rMin -20 --rMax 20 @@ -334,13 +334,13 @@ The numbers in each column are respectively $\frac{\theta-\theta_I}{\sigma_I}$ ( So far there is an important source of uncertainty we have neglected. Our estimates of the backgrounds come either from MC simulation or from sideband regions in data, and in both cases these estimates are subject to a statistical uncertainty on the number of simulated or data events. In principle we should include an independent statistical uncertainty for every bin of every process in our model. -It's important to note that combine/RooFit does not take this into account automatically - statistical fluctuations of the data are implicitly accounted +It's important to note that Combine/`RooFit` does not take this into account automatically - statistical fluctuations of the data are implicitly accounted for in the likelihood formalism, but statistical uncertainties in the model must be specified by us. One way to implement these uncertainties is to create a `shape` uncertainty for each bin of each process, in which the up and down histograms have the contents of the bin shifted up and down by the $1\sigma$ uncertainty. However this makes the likelihood evaluation computationally inefficient, and can lead to a large number of nuisance parameters -in more complex models. Instead we will use a feature in combine called `autoMCStats` that creates these automatically from the datacard, +in more complex models. Instead we will use a feature in Combine called `autoMCStats` that creates these automatically from the datacard, and uses a technique called "Barlow-Beeston-lite" to reduce the number of systematic uncertainties that are created. This works on the assumption that for high MC event counts we can model the uncertainty with a Gaussian distribution. Given the uncertainties in different bins are independent, the total uncertainty of several processes in a particular bin is just the sum of $N$ individual Gaussians, which is itself a Gaussian distribution. So instead of $N$ nuisance parameters we need only one. This breaks down when the number of events is small and we are not in the Gaussian regime. @@ -484,7 +484,7 @@ To produce these distributions add the `--saveShapes` and `--saveWithUncertainti combine -M FitDiagnostics workspace_part3.root -m 200 --rMin -1 --rMax 2 --saveShapes --saveWithUncertainties -n .part3B ``` -Combine will produce pre- and post-fit distributions (for fit_s and fit_b) in the fitDiagnosticsTest.root output file: +Combine will produce pre- and post-fit distributions (for fit_s and fit_b) in the fitDiagnosticsTest.root output file: ![](images/fit_diag_shapes.png) @@ -497,7 +497,7 @@ Combine will produce pre- and post-fit distributions (for fit_s and fit_b) in th ### D: Calculating the significance -In the event that you observe a deviation from your null hypothesis, in this case the b-only hypothesis, combine can be used to calculate the p-value or significance. To do this using the asymptotic approximation simply do: +In the event that you observe a deviation from your null hypothesis, in this case the b-only hypothesis, Combine can be used to calculate the p-value or significance. To do this using the asymptotic approximation simply do: ```shell combine -M Significance workspace_part3.root -m 200 --rMin -1 --rMax 2 @@ -601,7 +601,7 @@ python plot1DScan.py higgsCombine.part3E.MultiDimFit.mH200.root --others 'higgsC ![](images/freeze_first_attempt.png) -This doesn't look quite right - the best-fit has been shifted because unfortunately the `--freezeParameters` option acts before the initial fit, whereas we only want to add it for the scan after this fit. To remedy this we can use a feature of combine that lets us save a "snapshot" of the best-fit parameter values, and reuse this snapshot in subsequent fits. First we perform a single fit, adding the `--saveWorkspace` option: +This doesn't look quite right - the best-fit has been shifted because unfortunately the `--freezeParameters` option acts before the initial fit, whereas we only want to add it for the scan after this fit. To remedy this we can use a feature of Combine that lets us save a "snapshot" of the best-fit parameter values, and reuse this snapshot in subsequent fits. First we perform a single fit, adding the `--saveWorkspace` option: ```shell combine -M MultiDimFit workspace_part3.root -n .part3E.snapshot -m 200 --rMin -1 --rMax 2 --saveWorkspace @@ -642,7 +642,7 @@ While it is perfectly fine to just list the relevant nuisance parameters in the - How important are these tau-related uncertainties compared to the others? ### F: Use of channel masking -We will now return briefly to the topic of blinding. We've seen that we can compute expected results by performing any combine method on an Asimov dataset generated using `-t -1`. This is useful, because we can optimise our analysis without introducing any accidental bias that might come from looking at the data in the signal region. However our control regions have been chosen specifically to be signal-free, and it would be useful to use the data here to set the normalisation of our backgrounds even while the signal region remains blinded. Unfortunately there's no easy way to generate a partial Asimov dataset just for the signal region, but instead we can use a feature called "channel masking" to remove specific channels from the likelihood evaluation. One useful application of this feature is to make post-fit plots of the signal region from a control-region-only fit. +We will now return briefly to the topic of blinding. We've seen that we can compute expected results by performing any Combine method on an Asimov dataset generated using `-t -1`. This is useful, because we can optimise our analysis without introducing any accidental bias that might come from looking at the data in the signal region. However our control regions have been chosen specifically to be signal-free, and it would be useful to use the data here to set the normalisation of our backgrounds even while the signal region remains blinded. Unfortunately there's no easy way to generate a partial Asimov dataset just for the signal region, but instead we can use a feature called "channel masking" to remove specific channels from the likelihood evaluation. One useful application of this feature is to make post-fit plots of the signal region from a control-region-only fit. To use the masking we first need to rerun `text2workspace.py` with an extra option that will create variables named like `mask_[channel]` in the workspace: @@ -659,7 +659,7 @@ Topics covered in this section: - A: Writing a simple physics model - B: Performing and plotting 2D likelihood scans -With combine we are not limited to parametrising the signal with a single scaling parameter `r`. In fact we can define any arbitrary scaling using whatever functions and parameters we would like. +With Combine we are not limited to parametrising the signal with a single scaling parameter `r`. In fact we can define any arbitrary scaling using whatever functions and parameters we would like. For example, when measuring the couplings of the Higgs boson to the different SM particles we would introduce a POI for each coupling parameter, for example $\kappa_{\text{W}}$, $\kappa_{\text{Z}}$, $\kappa_{\tau}$ etc. We would then generate scaling terms for each $i\rightarrow \text{H}\rightarrow j$ process in terms of how the cross section ($\sigma_i(\kappa)$) and branching ratio ($\frac{\Gamma_i(\kappa)}{\Gamma_{\text{tot}}(\kappa)}$) scale relative to the SM prediction. This parametrisation of the signal (and possibly backgrounds too) is specified in a **physics model**. This is a python class that is used by `text2workspace.py` to construct the model in terms of RooFit objects. There is documentation on using phyiscs models [here](http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part2/physicsmodels/#physics-models). @@ -692,7 +692,7 @@ dasModel = DASModel() In this we override two methods of the basic `PhysicsModel` class: `doParametersOfInterest` and `getYieldScale`. In the first we define our POI variables, using the doVar function which accepts the RooWorkspace factory syntax for creating variables, and then define all our POIs in a set via the doSet function. The second function will be called for every process in every channel (bin), and using the corresponding strings we have to specify how that process should be scaled. Here we check if the process was declared as signal in the datacard, and if so scale it by `r`, otherwise if it is a background no scaling is applied (`1`). -To use the physics model with `text2workspace.py` first copy it to the python directory in the combine package: +To use the physics model with `text2workspace.py` first copy it to the python directory in the Combine package: ```shell cp DASModel.py $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/python/ ``` @@ -712,7 +712,7 @@ combine -M MultiDimFit workspace_part4.root -n .part4A -m 200 --rMin 0 --rMax 2 ### B: Performing and plotting 2D likelihood scans -For a model with two POIs it is often useful to look at the how well we are able to measure both simultaneously. A natural extension of determining 1D confidence intervals on a single parameter like we did in part 3D is to determine confidence level regions in 2D. To do this we also use combine in a similar way, with `-M MultiDimFit --algo grid`. When two POIs are found combine will scan a 2D grid of points instead of a 1D array. +For a model with two POIs it is often useful to look at the how well we are able to measure both simultaneously. A natural extension of determining 1D confidence intervals on a single parameter like we did in part 3D is to determine confidence level regions in 2D. To do this we also use combine in a similar way, with `-M MultiDimFit --algo grid`. When two POIs are found, Combine will scan a 2D grid of points instead of a 1D array. **Tasks and questions:** diff --git a/docs/part5/roofit.md b/docs/part5/roofit.md index 5508e2135fc..cbd8cf2d220 100644 --- a/docs/part5/roofit.md +++ b/docs/part5/roofit.md @@ -4,8 +4,8 @@ This section covers a few of the basics of `RooFit`. There are many more tutorials available at this link: [https://root.cern.ch/root/html600/tutorials/roofit/index.html](https://root.cern.ch/root/html600/tutorials/roofit/index.html) ## Objects -In Roofit, any variable, data point, function, PDF (etc.) is represented by a c++ object -The most basic of these is the `RooRealVar`. Let's create one which will represent the mass of some hypothetical particle, we name it and give it an initial starting value and range. +In `RooFit`, any variable, data point, function, PDF (etc.) is represented by a c++ object +The most basic of these is the `RooRealVar`. We will create one that will represent the mass of some hypothetical particle, we name it and give it an initial starting value and range. ```c++ RooRealVar MH("MH","mass of the Hypothetical Boson (H-boson) in GeV",125,120,130); @@ -15,26 +15,26 @@ MH.Print(); RooRealVar::MH = 125 L(120 - 130) ``` -ok, great. This variable is now an object we can play around with. We can access this object and modify it's properties, such as its value. +Ok, great. This variable is now an object we can play around with. We can access this object and modify its properties, such as its value. ```c++ MH.setVal(130); MH.getVal(); ``` -In particle detectors we typically don't observe this particle mass but usually define some observable which is *sensitive* to this mass. Lets assume we can detect and reconstruct the decay products of the H-boson and measure the invariant mass of those particles. We need to make another variable which represents that invariant mass. +In particle detectors we typically do not observe this particle mass, but usually define some observable which is *sensitive* to this mass. We will assume we can detect and reconstruct the decay products of the H-boson and measure the invariant mass of those particles. We need to make another variable that represents that invariant mass. ```c++ RooRealVar mass("m","m (GeV)",100,80,200); ``` -In the perfect world we would perfectly measure the exact mass of the particle in every single event. However, our detectors are usually far from perfect so there will be some resolution effect. Lets assume the resolution of our measurement of the invariant mass is 10 GeV and call it "sigma" +In the perfect world we would perfectly measure the exact mass of the particle in every single event. However, our detectors are usually far from perfect so there will be some resolution effect. We will assume the resolution of our measurement of the invariant mass is 10 GeV and call it "sigma" ```c++ RooRealVar sigma("resolution","#sigma",10,0,20); ``` -More exotic variables can be constructed out of these `RooRealVar`s using `RooFormulaVars`. For example, suppose we wanted to make a function out of the variables which represented the relative resolution as a function of the hypothetical mass MH. +More exotic variables can be constructed out of these `RooRealVar`s using `RooFormulaVars`. For example, suppose we wanted to make a function out of the variables that represented the relative resolution as a function of the hypothetical mass MH. ```c++ RooFormulaVar func("R","@0/@1",RooArgList(sigma,mass)); @@ -67,7 +67,7 @@ func.Print("v"); -Notice how there is a list of the variables we passed (the servers or "actual vars"). We can now plot the function. RooFit has a special plotting object `RooPlot` which keeps track of the objects (and their normalisations) which we want to draw. Since RooFit doesn't know the difference between which objects are/aren't dependant, we need to tell it. +Notice how there is a list of the variables we passed (the servers or "actual vars"). We can now plot the function. `RooFit` has a special plotting object `RooPlot` which keeps track of the objects (and their normalisations) that we want to draw. Since `RooFit` does not know the difference between objects that are and are not dependent, we need to tell it. Right now, we have the relative resolution as $R(m,\sigma)$, whereas we want to plot $R(m,\sigma(m))$! @@ -84,13 +84,13 @@ can->Draw(); ``` ![](images/expo.png) -The main objects we are interested in using from RooFit are *probability denisty functions* or (PDFs). We can construct the PDF, +The main objects we are interested in using from `RooFit` are *probability denisty functions* or (PDFs). We can construct the PDF, $$ f(m|M_{H},\sigma) $$ -as a simple Gaussian shape for example or a `RooGaussian` in RooFit language (think McDonald's logic, everything is a `RooSomethingOrOther`) +as a simple Gaussian shape for example or a `RooGaussian` in `RooFit` language (think McDonald's logic, everything is a `RooSomethingOrOther`) ```c++ RooGaussian gauss("gauss","f(m|M_{H},#sigma)",mass,MH,sigma); @@ -150,7 +150,7 @@ can->Draw(); Note that as we change the value of `MH`, the PDF gets updated at the same time. -PDFs can be used to generate Monte Carlo data. One of the benefits of RooFit is that to do so only uses a single line of code! As before, we have to tell `RooFit` which variables to generate in (e.g which are the observables for an experiment). In this case, each of our events will be a single value of "mass" $m$. The arguments for the function are the set of observables, follwed by the number of events, +PDFs can be used to generate Monte Carlo data. One of the benefits of `RooFit` is that to do so only uses a single line of code! As before, we have to tell `RooFit` which variables to generate in (e.g which are the observables for an experiment). In this case, each of our events will be a single value of "mass" $m$. The arguments for the function are the set of observables, follwed by the number of events, ```c++ RooDataSet *gen_data = (RooDataSet*) gauss.generate(RooArgSet(mass),500); @@ -172,11 +172,11 @@ can->Draw(); ![](images/gausdata.png) -Of course we're not in the business of generating MC events, but collecting *real data!*. Next we will look at using real data in `RooFit`. +Of course we are not in the business of generating MC events, but collecting *real data!*. Next we will look at using real data in `RooFit`. ## Datasets -A dataset is essentially just a collection of points in N-dimensional (N-observables) space. There are two basic implementations in RooFit, +A dataset is essentially just a collection of points in N-dimensional (N-observables) space. There are two basic implementations in `RooFit`, 1) an "unbinned" dataset - `RooDataSet` @@ -186,7 +186,7 @@ both of these use the same basic structure as below ![](images/datastructure.png) -Lets create an empty dataset where the only observable, the mass. Points can be added to the dataset one by one ... +We will create an empty dataset where the only observable is the mass. Points can be added to the dataset one by one ... ```c++ RooDataSet mydata("dummy","My dummy dataset",RooArgSet(mass)); @@ -213,7 +213,7 @@ There are also other ways to manipulate datasets in this way as shown in the dia Luckily there are also Constructors for a `RooDataSet` from a `TTree` and for a `RooDataHist` from a `TH1` so its simple to convert from your usual ROOT objects. -Let's take an example dataset put together already. The file `tutorial.root` can be downloaded [here](https://github.com/amarini/Prefit2020/blob/master/Session%201/tutorial.root). +We will take an example dataset put together already. The file `tutorial.root` can be downloaded [here](https://github.com/amarini/Prefit2020/blob/master/Session%201/tutorial.root). ```c++ TFile *file = TFile::Open("tutorial.root"); @@ -230,9 +230,9 @@ TFile** tutorial.root -Inside the file, there is something called a `RooWorkspace`. This is just the RooFit way of keeping a persistent link between the objects for a model. It is a very useful way to share data and PDFs/functions etc among CMS collaborators. +Inside the file, there is something called a `RooWorkspace`. This is just the `RooFit` way of keeping a persistent link between the objects for a model. It is a very useful way to share data and PDFs/functions etc among CMS collaborators. -Let's take a look at it. It contains a `RooDataSet` and one variable. This time we called our variable (or observable) `CMS_hgg_mass`, let's assume now that this is the invariant mass of photon pairs where we assume our H-boson decays to photons. +We will now take a look at it. It contains a `RooDataSet` and one variable. This time we called our variable (or observable) `CMS_hgg_mass`, we will assume that this is the invariant mass of photon pairs where we assume our H-boson decays to photons. ```c++ RooWorkspace *wspace = (RooWorkspace*) file->Get("workspace"); @@ -254,7 +254,7 @@ RooDataSet::dataset(CMS_hgg_mass) ``` -Let's have a look at the data. The `RooWorkspace` has several accessor functions, we will use the `RooWorkspace::data` one. +Now we will have a look at the data. The `RooWorkspace` has several accessor functions, we will use the `RooWorkspace::data` one. There are also `RooWorkspace::var`, `RooWorkspace::function` and `RooWorkspace::pdf` with (hopefully) obvious purposes. ```c++ @@ -275,29 +275,29 @@ hggcan->Draw(); # Likelihoods and Fitting to data -The data we have in our file doesn't look like a Gaussian distribution. Instead, we could probably use something like an exponential to describe it. +The data we have in our file does not look like a Gaussian distribution. Instead, we could probably use something like an exponential to describe it. -There is an exponential PDF already in `RooFit` (yep you guessed it) `RooExponential`. For a pdf, we only need one parameter which is the exponential slope $\alpha$ so our pdf is, +There is an exponential PDF already in `RooFit` (yes, you guessed it) `RooExponential`. For a PDF, we only need one parameter which is the exponential slope $\alpha$ so our pdf is, $$ f(m|\alpha) = \dfrac{1}{N} e^{-\alpha m}$$ Where of course, $N = \int_{110}^{150} e^{-\alpha m} dm$ is the normalisation constant. -You can find a bunch of available RooFit functions here: [https://root.cern.ch/root/html/ROOFIT_ROOFIT_Index.html](https://root.cern.ch/root/html/ROOFIT_ROOFIT_Index.html) +You can find several available `RooFit` functions here: [https://root.cern.ch/root/html/ROOFIT_ROOFIT_Index.html](https://root.cern.ch/root/html/ROOFIT_ROOFIT_Index.html) -There is also support for a generic pdf in the form of a `RooGenericPdf`, check this link: [https://root.cern.ch/doc/v608/classRooGenericPdf.html](https://root.cern.ch/doc/v608/classRooGenericPdf.html) +There is also support for a generic PDF in the form of a `RooGenericPdf`, check this link: [https://root.cern.ch/doc/v608/classRooGenericPdf.html](https://root.cern.ch/doc/v608/classRooGenericPdf.html) -Let's create an exponential PDF for our background, +Now we will create an exponential PDF for our background, ```c++ RooRealVar alpha("alpha","#alpha",-0.05,-0.2,0.01); RooExponential expo("exp","exponential function",*hgg_mass,alpha); ``` -We can use RooFit to tell us to estimate the value of $\alpha$ using this dataset. You will learn more about parameter estimation but for now we will just assume you know about maximising likelihoods. This *maximum likelihood estimator* is common in HEP and is known to give unbiased estimates for things like distribution means etc. +We can use `RooFit` to tell us to estimate the value of $\alpha$ using this dataset. You will learn more about parameter estimation, but for now we will just assume you know about maximizing likelihoods. This *maximum likelihood estimator* is common in HEP and is known to give unbiased estimates for things like distribution means etc. -This also introduces the other main use of PDFs in RooFit. They can be used to construct *likelihoods* easily. +This also introduces the other main use of PDFs in `RooFit`. They can be used to construct *likelihoods* easily. The likelihood $\mathcal{L}$ is defined for a particluar dataset (and model) as being proportional to the probability to observe the data assuming some pdf. For our data, the probability to observe an event with a value in an interval bounded by a and b is given by, @@ -314,7 +314,7 @@ Note that for a specific dataset, the $dm$ factors which should be there are con The maximum likelihood esitmator for $\alpha$, usually written as $\hat{\alpha}$, is found by maximising $\mathcal{L}(\alpha)$. -Note that this won't depend on the value of the constant of proportionality so we can ignore it. This is true in most scenarios because usually only the *ratio* of likelihoods is needed, in which the constant factors out. +Note that this will not depend on the value of the constant of proportionality so we can ignore it. This is true in most scenarios because usually only the *ratio* of likelihoods is needed, in which the constant factors out. Obviously this multiplication of exponentials can lead to very large (or very small) numbers which can lead to numerical instabilities. To avoid this, we can take logs of the likelihood. Its also common to multiply this by -1 and minimize the resulting **N**egative **L**og **L**ikelihood : $\mathrm{-Log}\mathcal{L}(\alpha)$. @@ -345,9 +345,9 @@ nll->Print("v"); ``` -Notice that the NLL object knows which RooRealVar is the parameter because it doesn't find that one in the dataset. This is how RooFit distiguishes between *observables* and *parameters*. +Notice that the NLL object knows which RooRealVar is the parameter because it doesn't find that one in the dataset. This is how `RooFit` distiguishes between *observables* and *parameters*. -RooFit has an interface to Minuit via the `RooMinimizer` class which takes the NLL as an argument. To minimize, we just call the `RooMinimizer::minimize()` function. **`Minuit2`** is the program and **`migrad`** is the minimization routine which uses gradient descent. +`RooFit` has an interface to Minuit via the `RooMinimizer` class which takes the NLL as an argument. To minimize, we just call the `RooMinimizer::minimize()` function. **`Minuit2`** is the program and **`migrad`** is the minimization routine which uses gradient descent. ```c++ RooMinimizer minim(*nll); @@ -427,7 +427,7 @@ alpha.Print("v"); Error = 0.00291959 ``` -Lets plot the resulting exponential on the data. Notice that the value of $\hat{\alpha}$ is used for the exponential. +We will plot the resulting exponential on top of the data. Notice that the value of $\hat{\alpha}$ is used for the exponential. ```c++ expo.plotOn(plot); @@ -439,9 +439,9 @@ hggcan->Draw(); ![](images/expofit.png) -It looks like there could be a small region near 125 GeV for which our fit doesn't quite go through the points. Maybe our hypothetical H-boson isn't so hypothetical after all! +It looks like there could be a small region near 125 GeV for which our fit does not quite go through the points. Maybe our hypothetical H-boson is not so hypothetical after all! -Let's see what happens if we include some resonant signal into the fit. We can take our Gaussian function again and use that as a signal model. A reasonable value for the resolution of a resonant signal with a mass around 125 GeV decaying to a pair of photons is around a GeV. +We will now see what happens if we include some resonant signal into the fit. We can take our Gaussian function again and use that as a signal model. A reasonable value for the resolution of a resonant signal with a mass around 125 GeV decaying to a pair of photons is around a GeV. ```c++ sigma.setVal(1.); @@ -453,9 +453,9 @@ MH.setConstant(); RooGaussian hgg_signal("signal","Gaussian PDF",*hgg_mass,MH,sigma); ``` -By setting these parameters constant, RooFit knows (either when creating the NLL by hand or when using `fitTo`) that there is not need to fit for these parameters. +By setting these parameters constant, `RooFit` knows (either when creating the NLL by hand or when using `fitTo`) that there is not need to fit for these parameters. -We need to add this to our exponential model and fit a "Sigmal+Background model" by creating a `RooAddPdf`. In RooFit there are two ways to add PDFs, recursively where the fraction of yields for the signal and background is a parameter or absolutely where each PDF has its own normalisation. We're going to use the second one. +We need to add this to our exponential model and fit a "Sigmal+Background model" by creating a `RooAddPdf`. In `RooFit` there are two ways to add PDFs, recursively where the fraction of yields for the signal and background is a parameter or absolutely where each PDF has its own normalization. We're going to use the second one. ```c++ RooRealVar norm_s("norm_s","N_{s}",10,100); @@ -498,14 +498,14 @@ Cached value = 0 ``` -Ok now lets fit the model. Note this time we add the option `Extended()` which tells RooFit that we care about the overall number of observed events in the data $n$ too. It will add an additional Poisson term in the likelihood to account for this so our likelihood this time looks like, +Ok, now we will fit the model. Note this time we add the option `Extended()`, which tells `RooFit` that we care about the overall number of observed events in the data $n$ too. It will add an additional Poisson term in the likelihood to account for this so our likelihood this time looks like, $$L_{s+b}(N_{s},N_{b},\alpha) = \dfrac{ N_{s}+N_{b}^{n} e^{N_{s}+N_{b}} }{n!} \cdot \prod_{i}^{n} \left[ c f_{s}(m_{i}|M_{H},\sigma)+ (1-c)f_{b}(m_{i}|\alpha) \right] $$ where $c = \dfrac{ N_{s} }{ N_{s} + N_{b} }$, $f_{s}(m|M_{H},\sigma)$ is the Gaussian signal pdf and $f_{b}(m|\alpha)$ is the exponential pdf. Remember that $M_{H}$ and $\sigma$ are fixed so that they are no longer parameters of the likelihood. -There is a simpler interface for maximum likelihood fits which is the `RooAbsPdf::fitTo` method. With this simple method, RooFit will construct the negative log-likelihood function, from the pdf, and minimize all of the free parameters in one step. +There is a simpler interface for maximum-likelihood fits which is the `RooAbsPdf::fitTo` method. With this simple method, `RooFit` will construct the negative log-likelihood function, from the pdf, and minimize all of the free parameters in one step. ```c++ model.fitTo(*hgg_data,RooFit::Extended()); @@ -687,14 +687,14 @@ nominal_values = (MH=124.627 +/- 0.398094,resolution=1[C],norm_s=33.9097 +/- 11. ``` -This is exactly what needs to be done when you want to use shape based datacards in combine with parametric models. +This is exactly what needs to be done when you want to use shape based datacards in Combine with parametric models. ## A likelihood for a counting experiment An introductory presentation about likelihoods and interval estimation is available [here](https://indico.cern.ch/event/976099/contributions/4138517/). **Note: We will use python syntax in this section; you should use a .py script. Make sure to do `import ROOT` at the top of your script ** -We've seen how to create variables and pdfs, and how to fit a pdf to data. But what if we have a counting experiment, or a histogram template shape? And what about systematic uncertainties? Let's build a likelihood +We have seen how to create variables and PDFs, and how to fit a PDF to data. But what if we have a counting experiment, or a histogram template shape? And what about systematic uncertainties? We are going to build a likelihood for this: $\mathcal{L} \propto p(\text{data}|\text{parameters})$ @@ -704,12 +704,12 @@ where our parameters are parameters of interest, $\mu$, and nuisance parameters, So we have $\mathcal{L} \propto p(\text{data}|\mu,\vec{\theta})\cdot \pi(\vec{\theta}_0|\vec{\theta})$ -let's try to build the likelihood by hand for a 1-bin counting experiment. -The data is the number of observed events $N$, and the probability is just a poisson probability $p(N|\lambda) = \frac{\lambda^N e^{-\lambda}}{N!}$, where $\lambda$ is the number of events expected in our signal+background model: $\lambda = \mu\cdot s(\vec{\theta}) + b(\vec{\theta})$. +now we will try to build the likelihood by hand for a 1-bin counting experiment. +The data is the number of observed events $N$, and the probability is just a Poisson probability $p(N|\lambda) = \frac{\lambda^N e^{-\lambda}}{N!}$, where $\lambda$ is the number of events expected in our signal+background model: $\lambda = \mu\cdot s(\vec{\theta}) + b(\vec{\theta})$. -In the expression, s and b are the numbers of expected signal- and background events, which both depend on the nuisance parameters. Let's start by building a simple likelihood function with one signal process and one background process. We'll assume there are no nuisance parameters for now. The number of observed events in data is 15, the expected number of signal events is 5 and the expected number of background events 8.1. +In the expression, s and b are the numbers of expected signal and background events, which both depend on the nuisance parameters. We will start by building a simple likelihood function with one signal process and one background process. We will assume there are no nuisance parameters for now. The number of observed events in data is 15, the expected number of signal events is 5 and the expected number of background events 8.1. -It's easiest to use the RooFit workspace factory to build our model ([this tutorial](https://root.cern/doc/master/rf511__wsfactory__basic_8py.html) has more information on the factory syntax). +It is easiest to use the `RooFit` workspace factory to build our model ([this tutorial](https://root.cern/doc/master/rf511__wsfactory__basic_8py.html) has more information on the factory syntax). ``` import ROOT @@ -720,18 +720,18 @@ We need to create an expression for the number of events in our model, $\mu s +b ``` w.factory('expr::n("mu*s +b", mu[1.0,0,4], s[5],b[8.1])') ``` -Now we can build the likelihood, which is just our poisson pdf: +Now we can build the likelihood, which is just our Poisson PDF: ``` w.factory('Poisson::poisN(N[15],n)') ``` -To find the best-fit value for our parameter of interest $\mu$ we need to maximize the likelihood. In practice it's actually easier to minimize the **N**egative **l**og of the **l**ikelihood, or NLL: +To find the best fit value for our parameter of interest $\mu$ we need to maximize the likelihood. In practice it is actually easier to minimize the **N**egative **l**og of the **l**ikelihood, or NLL: ``` w.factory('expr::NLL("-log(@0)",poisN)') ``` -We can now use the RooMinimizer to find the minimum of the NLL +We can now use the `RooMinimizer` to find the minimum of the NLL ``` @@ -743,7 +743,7 @@ bestfitnll = nll.getVal() ``` Notice that we need to set the error level to 0.5 to get the uncertainties (relying on Wilks' theorem!) - note that there is a more reliable way of extracting the confidence interval (explicitly rather than relying on migrad). We will discuss this a bit later in this section. -Now let's add a nuisance parameter, *lumi*, which represents the luminosity uncertainty. It has a 2.5% effect on both the signal and the background. The parameter will be log-normally distributed: when it's 0, the normalization of the signal and background are not modified; at $+1\sigma$ the signal and background normalizations will be multiplied by 1.025 and at $-1\sigma$ they will be divided by 1.025. We should modify the expression for the number of events in our model: +Now we will add a nuisance parameter, *lumi*, which represents the luminosity uncertainty. It has a 2.5% effect on both the signal and the background. The parameter will be log-normally distributed: when it's 0, the normalization of the signal and background are not modified; at $+1\sigma$ the signal and background normalizations will be multiplied by 1.025 and at $-1\sigma$ they will be divided by 1.025. We should modify the expression for the number of events in our model: ``` w.factory('expr::n("mu*s*pow(1.025,lumi) +b*pow(1.025,lumi)", mu[1.0,0,4], s[5],b[8.1],lumi[0,-4,4])') @@ -765,11 +765,11 @@ w.factory('expr::NLL("-log(@0)",likelihood)') Which we can minimize in the same way as before. -Now let's extend our model a bit. +Now we will extend our model a bit. - Expanding on what was demonstrated above, build the likelihood for $N=15$, a signal process *s* with expectation 5 events, a background *ztt* with expectation 3.7 events and a background *tt* with expectation 4.4 events. The luminosity uncertainty applies to all three processes. The signal process is further subject to a 5% log-normally distributed uncertainty *sigth*, *tt* is subject to a 6% log-normally distributed uncertainty *ttxs*, and *ztt* is subject to a 4% log-normally distributed uncertainty *zttxs*. Find the best-fit value and the associated uncertainty -- Also perform an explicit scan of the $\Delta$ NLL ( = log of profile likelihood ratio) and make a graph of the scan. Some example code can be found below to get you started. Hint: you'll need to perform fits for different values of mu, where mu is fixed. In RooFit you can set a variable to be constant as `var("VARNAME").setConstant(True)` -- From the curve that you've created by performing an explicit scan, we can extract the 68% CL interval. You can do so by eye or by writing some code to find the relevant intersections of the curve. +- Also perform an explicit scan of the $\Delta$ NLL ( = log of profile likelihood ratio) and make a graph of the scan. Some example code can be found below to get you started. Hint: you'll need to perform fits for different values of mu, where mu is fixed. In `RooFit` you can set a variable to be constant as `var("VARNAME").setConstant(True)` +- From the curve that you have created by performing an explicit scan, we can extract the 68% CL interval. You can do so by eye or by writing some code to find the relevant intersections of the curve. ``` gr = ROOT.TGraph() @@ -791,7 +791,7 @@ canv.SaveAs("likelihoodscan.pdf") ``` Well, this is doable - but we were only looking at a simple one-bin counting experiment. This might become rather cumbersome for large models... $[*]$ -We'll now switch to Combine which will make it a lot easier to set up your model and do the statistical analysis than trying to build the likelihood yourself. +We will now switch to Combine which will make it a lot easier to set up your model and do the statistical analysis than trying to build the likelihood yourself. -$[*]$ Side note - RooFit does have additional functionality to help with statistical model building, but we won't go into detail today. +$[*]$ Side note - `RooFit` does have additional functionality to help with statistical model building, but we will not go into detail today.