Skip to content

Latest commit

 

History

History

Spark_Physics

Apache Spark for High Energy Physics

This collects a few simple examples of how Apache Spark can be used in the domain of High Energy Physics data analysis.
Most of the examples are just for education purposes, use a small subset of data and can be run on laptop-sized computing resources.
See also the blog post Can High Energy Physics Analysis Profit from Apache Spark APIs?

Contents:

  1. Dimuon mass spectrum analysis
  2. HEP analysis benchmark
  3. ATLAS Higgs analysis
  4. CMS Higgs analysis
  5. LHCb matter antimatter analysis
  6. Spark ML on HEP data

1. Dimuon mass spectrum analysis

This is a sort of "Hello World!" example for High Energy Physics analysis.
The implementations proposed here using Apache Spark APIs are a direct "Spark translation" of a tutorial using ROOT DataFrame

Data

  • These examples use CMS open data from 2012:
  • Data has been converted and made available for this work in snappy-compressed Apache Parquet and Apache ORC formats
  • You can download the following datasets:
    • 61 million events (2 GB)
      • original files in ROOT format: root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root
        • see notes on how to access data using the XRootD protocol (root://) and how to read it.
      • dataset converted to Parquet: Run2012BC_DoubleMuParked_Muons.parquet
        • wget https://sparkdltrigger.web.cern.ch/sparkdltrigger/Run2012BC_DoubleMuParked_Muons.parquet
      • dataset converted to ORC: Run2012BC_DoubleMuParked_Muons.orc
    • 6.5 billion events (200 GB, this is the 2GB dataset repeated 105 times)
      • original files, in ROOT format root://eospublic.cern.ch//eos/root-eos/benchmark/CMSOpenDataDimuon
      • dataset converted to Parquet: CMSOpenDataDimuon_large.parquet
        • download using wget -r -np -R "index.html*" -e robots=off https://sparkdltrigger.web.cern.ch/sparkdltrigger/CMSOpenDataDimuon_large.parquet/
      • dataset converted to ORC: CMSOpenDataDimuon_large.orc
        • download using wget -r -np -R "index.html*" -e robots=off https://sparkdltrigger.web.cern.ch/sparkdltrigger/CMSOpenDataDimuon_large.orc/

Notebooks

Multiple notebook solutions are provided, to illustrate different approaches with Apache Spark.
Notes on the execution environment:

  • The notebooks use the dataset with 61 million events (Except the SCALE test that uses 6.5 billion events)
  • Compatibility and tests: these notebooks have been developed with Spark 3.2.1 and tested up to Spark 3.5.1
    • For this we use Apache Spark vectorized reader for complex types in Parquet, mapInArrow UDF (introduced in Spark 3.3.0).
    • The server used for testing and measuring the execution time has 4 physical CPU cores, SSD disk storage , and 32 GB of RAM (which is more than the minimum requirements for running this workload)
Notebook Run Time Short description
1. DataFrame API, Parquet 11 sec The analysis is implemented using Apache Spark DataFrame API. This uses the dataset in Apache Parquet format
2. DataFrame API, ORC 11 sec Same as 1a., with the exception that this uses the dataset in Apache ORC format
3. Spark SQL, Parquet 11 sec The analysis is implemented using Spark SQL
4. Scala UDF 12 sec Implementation that mixes DataFrame API and Scala UDF. The dimuon invariant mass formula computation is implemented using a UDF written in Scala. Link to Scala UDF code
5a. Pandas UDF flattened data 19 sec Implementation that mixes DataFrame API and Python Pandas UDF. The dimuon invariant mass formula computation is implemented using a Pandas UDF
5b. Pandas UDF data arrays 82 sec Same as 5a, but the Pandas UDF in this case uses data in arrays and lists
6. MapInArrow flattened data 22 sec This uses mapInArrow, introduced in SPARK-37227
7. RumbleDB on Spark 155 sec This implementation runs with RumbleDB query engine on top of Apache Spark. RumbleDB implements the JSONiq language.
8. DataFrame API at scale, Parquet (*)35 sec (*)This has processed 6.5 billion events, at scale on a cluster using 200 CPU cores.
Dimuon spectrum analysis on Colab - You can run this on Google's Colaboratory
Dimuon spectrum analysis on CERN SWAN - You can run this on CERN SWAN (requires CERN SSO credentials)

2. HEP analysis benchmark

Here you will find an implementation of the High Energy Physics benchmark tasks using Apache Spark.
It follows the IRIS-HEP benchmark specifications and solutions linked there.
Solutions to the benchmark tasks are also directly inspired by the article Evaluating Query Languages and Systems for High-Energy Physics Data.

Data

  • For this exercise we use CMS open data from 2012, made available via the CERN opendata portal: DOI:10.7483/OPENDATA.CMS.IYVQ.1J0W
  • Data has been converted and made available for this work in snappy-compressed Apache Parquet and Apache ORC formats
  • A list of the downloadable datasets for this analysis:
  • 53 million events (16 GB), original files in ROOT format: root://eospublic.cern.ch//eos/root-eos/benchmark/Run2012B_SingleMu.root
    • see notes on how to access data using the XRootD protocol (root://) and how to read it.
  • 53 million events (16 GB), converted to Parquet: Run2012B_SingleMu.parquet
    • download using wget -r -np -R "index.html*" -e robots=off https://sparkdltrigger.web.cern.ch/sparkdltrigger/Run2012B_SingleMu.parquet/
  • 7 million events (2 GB) Parquet format Run2012B_SingleMu_sample.parquet

Notebooks

Notes on the execution environment:

  • The notebooks can be run using the datasets with 53 million events for more accuracy or with the reduced dataset with 7 million events for faster execution
    • Compatibility and tests: these notebooks have been developed with Spark 3.2.1 and tested up to Spark 3.5.1
    • Notable Spark features used: Apache Spark vectorized reader for complex types in Parquet, mapInArrow UDF.
    • The server used for testing and measuring the execution time has 4 physical CPU cores, SSD disk storage , and 32 GB of RAM (which is more than the minimum requirements for running this workload)
Notebook Short description
Benchmark tasks 1 to 5, Parquet and SparkHistogram The analysis is implemented using Apache Spark DataFrame API. It uses the dataset in Apache Parquet format and the histogram generation is done using the sparkhistogram package.
Benchmark task 6 Three different solution are provided. This is the hardest task to implement in Spark. The proposed solutions use also Scala UDFs: link to the Scala UDF code.
Benchmark task 7 Two different solutions provided, one using the explode function, the other with Spark's higher order functions for array processing.
Benchmark task 8 This combines Spark DataFrame API for filtering and Scala UDFs for processing. Link to the Scala UDF code.
Benchmark tasks 1 to 5 on Colab You can run this on Google's Colaboratory.
Benchmark tasks 1 to 5 on CERN SWAN You can run this on CERN SWAN (requires CERN SSO credentials).

3. ATLAS Higgs boson analysis - outreach-style

This is an example analysis of the Higgs boson detection via the decay channel H → ZZ* → 4l From the decay products measured at the ATLAS experiment and provided as open data, you will be able to produce a few histograms, comparing experimental data and Monte Carlo (simulation) data. From there you can infer the invariant mass of the Higgs boson.
Disclaimer: this is for educational purposes only, it is not the code nor the data of the official Higgs boson discovery paper.
It is based on the original work on ATLAS outreach notebooks and derived work at this repo and this work Reference: ATLAS paper on the discovery of the Higgs boson (mostly Section 4 and 4.1)

Data

Notebooks

  • Compatibility and tests: these notebooks have been developed with Spark 3.2.1 and tested up to Spark 3.5.1
  • These analyses use a very small dataset and are mostly intended to show how the Spark API can be applied in this context, rather than its performance and scalability.
Notebook Short description
1. ATLAS opendata Higgs H-ZZ*-4l basic analysis Basic analysis with experiment (detector) data.
2. H-ZZ*-4l basic analysis with additional filters and cuts Analysis with extra cuts and data operations, this uses Monte Carlo (simulation) data.
3. H-ZZ*-4l reproduce Fig 2 of the paper - with experiment data and monte carlo Analysis with extra cuts and data operations, this uses experiment (detector) data and Monte Carlo (simulation) data for signal and backgroud. It roughly reproduces Figure 2 of the ATLAS Higgs paper.
Run ATLAS opendata Higgs H-ZZ*-4l basic analysis on Colab Basic analysis with experiment (detector) data. This notebook opens on Google's Colab.
Run ATLAS opendata Higgs H-ZZ*-4l basic analysis on CERN SWAN Basic analysis with experiment (detector) data. This notebook opens on CERN's SWAN notebook service (requires CERN SSO credentials)

4. CMS Higgs boson analysis - outreach-style

This is an example analysis of the Higgs boson detection via the decay channel H → ZZ* → 4l Disclaimer: this is for educational purposes only, it is not the code nor the data of the official Higgs boson discovery paper.
It is based on the original work on cms opendata notebooks and this derived work
Reference: link to the original article with CMS Higgs boson discovery

Data

  • The notebooks presented here use datasets from the original open data events converted to snappy-compressed Apache Parquet format.
    • Download from: CMS Higgs notebook opendata in Parquet format
    • CLI command to download all files (12 GB):
      wget -r -np -R "index.html*" -e robots=off https://sparkdltrigger.web.cern.ch/sparkdltrigger/CMS_Higgs_opendata/
    • The original data is from CMS open data and is stored in ROOT format
      • See notes on how to access data using the XRootD protocol (root://) and how to read ROOT data, download from this URL: root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod

Notebooks

Notebook Short description
CMS opendata Higgs H-ZZ*-4l, basic with monte carlo signal Basic analysis and visualization with Monte Carlo (simulation) data.
TBC

5. LHCb matter antimatter asymmetries analysis - outreach-style

This notebook provides an example of how to use Spark to perform a simple analysis using high energy physics data from a LHC experiment. Credits:

Data

Notebooks

Notebook Short description
LHCb outreach analysis LHCb analysis notebook using open data
Run LHCb opendata analysis notebook on Colab This notebook opens on Google's Colab
Run LHCb opendata analysis notebook on CERN SWAN This notebook opens on CERN SWAN notebook service (requires CERN SSO credentials)

Notes on reading and converting data from ROOT format

How to convert from ROOT format to Apache Parquet or ORC

High Energy Physics uses the ROOT data format extensively. Here are a few notes on how to read and convert data from ROOT format to Apache Parquet or ORC:

How to read files via the XRootD protocol

  • This allows to read files from URLs like root://eospublic.cern.ch/..
    • It is used extensively for HEP data
    • CERN users can read files stored in EOS also using CERNBOx
    • Use Apache Spark to read root:// URLs using the Hadoop-XRootD connector
    • Use CLI to access XRootD using the toolset from XRootD project
      • CLI example: xrdcp root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root .
    • CERN users can read files stored in EOS also using CERNBOx

Spark ML examples with Physics data

A few related references on the topic of ML tools used with HEP data:

Physics references

A few links with additional details on the terms and formulas used: