Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to run ttbar analysis with MLFlow instrumentation #65

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,11 @@ __pycache__
venv/
.env/
.ipynb_checkpoints
.idea/

servicex.yml
.servicex


analyses/**/*.root
analyses/**/*.pdf
Expand All @@ -24,3 +27,6 @@ workshops/agctools2022/statistical-inference/input

# CMS ttbar
analyses/cms-open-data-ttbar/workspace.json

# MLFlow
mlruns/
26 changes: 26 additions & 0 deletions analyses/cms-open-data-ttbar/MLproject
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: analysis-grand-challenge

conda_env: conda.yaml

entry_points:
ttbar:
parameters:
num-input-files: {type: int, default: 10}
num-bins: {type: int, default: 25}
bin-low: {type: int, default: 50}
bin-high: {type: int, default: 550}
pt-threshold: {type: int, default: 25}

command: "python analysis.py --num-input-files {num-input-files} --num-bins {num-bins} --bin-low {bin-low} --bin-high {bin-high} --pt-threshold {pt-threshold}"

# Use Hyperopt to optimize hyperparams of the ttbar entry_point.
hyperopt:
parameters:
max_runs: {type: int, default: 12}
metric: {type: string, default: "ttbar_norm_bestfit"}
algo: {type: string, default: "tpe.suggest"}
command: "python -O search_hyperparameter.py
--max-runs {max_runs}
--metric {metric}
--algo {algo}"

58 changes: 58 additions & 0 deletions analyses/cms-open-data-ttbar/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
CMS Open Data $t\\bar{t}$: from data delivery to statistical inference

We are using [2015 CMS Open Data](https://cms.cern/news/first-cms-open-data-lhc-run-2-released)
in this demonstration to showcase an analysis pipeline. It features data
delivery and processing, histogram construction and visualization, as well as
statistical inference.

This notebook was developed in the context of the
[IRIS-HEP AGC tools 2022 workshop](https://indico.cern.ch/e/agc-tools-2). This
work was supported by the U.S. National Science Foundation (NSF) Cooperative
Agreement OAC-1836650 (IRIS-HEP).

This is a technical demonstration. We are including the relevant workflow
aspects that physicists need in their work, but we are not focusing on making
every piece of the demonstration physically meaningful. This concerns in
particular systematic uncertainties: we capture the workflow, but the actual
implementations are more complex in practice. If you are interested in the
physics side of analyzing top pair production, check out the latest results from
ATLAS and CMS! If you would like to see more technical demonstrations, also
check out an ATLAS Open Data example demonstrated previously.

## Tracking Analysis Runs with MLFlow
A version of this analysis has been instrumented with
[MLFlow](https://mlflow.org) to record runs of this analysis along with the
input parameters, the fit results, and generated plots. To use the tracking
service you will need:
* Conda
* Access to an MLFlow tracking service instance
* Environment variables set to allow the script to communicate with the tracking service and the back-end object store:
* `MLFLOW_TRACKING_URI`
* `MLFLOW_S3_ENDPOINT_URL`
* `AWS_ACCESS_KEY_ID`
* `AWS_SECRET_ACCESS_KEY`

If you would like to install a local instance of the MLFlow tracking service on
you Kubernetes cluster, this
[helm chart](https://artifacthub.io/packages/helm/ncsa/mlflow) is a good start.

For reproducibility, MLFlow insists on running the analysis in a conda
environment. This is defined in `conda.yaml`.

The MLFlow project is defined in `MLprojec` - this file specifies two different
_entrypoints_

`ttbar` is the entrypoint for running a single analysis. It offers a number
of command line parameters to control the analysis. It can be run as
```shell
mlflow run -P num-bins=25 -P pt-threshold=25 .
```
### Hyperparameter Searches
MLFlow is often used in optimizing models by running with different
hyperparamters until a minimal loss function is realized. We've borrowed this
approach for optimizing an analysis. You can orchestrate a number of analysis
runs with different input settings by using the `hyperopt` entrypoint.

```shell
mlflow run -e hyperopt -P max_runs=20 .
```
Loading