Chapter 1 Introduction

This report partially fufills the requirements for the HarvardX course PH125.9x: “Data Science: Capstone”. The objective of this project is to apply machine learning techniques beyond standard linear regression to a publicly available dataset of choice.

1.1 The HIGGS dataset

The HIGGS dataset is a synthetic dataset simulating particle accelerator data (Baldi, Sadowski, and Whiteson 2014). Although the details of the simulated particle collisions are beyond the scope of this project, We summarize the contents of the dataset as follows.

The HIGGS dataset is available from the UCI Machine Learning Repository. The following code loads the dataset, downloading and unzipping the CSV file as necessary:

options(timeout=1800) # Give more time for the download to complete

# Download and unzip the data file if needed
if(!dir.exists('data_raw')) {
  dir.create('data_raw')
}
if(!file.exists('data_raw/HIGGS.csv')) {
  download.file(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz',
    'data_raw/HIGGS.csv.gz')
  gunzip('data_raw/HIGGS.csv.gz')
}
# Load dataset into memory
higgs_all <- fread('data_raw/HIGGS.csv')

# Assign column names (csv contains no headers)
colnames(higgs_all) <- c('signal',
                     'lepton_pT', 'lepton_eta', 'lepton_phi',
                     'missing_E_mag', 'missing_E_phi',
                     'jet1_pT', 'jet1_eta', 'jet1_phi', 'jet1_btag',
                     'jet2_pT', 'jet2_eta', 'jet2_phi', 'jet2_btag',
                     'jet3_pT', 'jet3_eta', 'jet3_phi', 'jet3_btag',
                     'jet4_pT', 'jet4_eta', 'jet4_phi', 'jet4_btag',
                     'm_jj', 'm_jjj', 'm_lv', 'm_jlv',
                     'm_bb', 'm_wbb', 'm_wwbb')

# Separate input and output columns
xAll <- higgs_all |> select(-signal) |> as.data.table()
yAll <- higgs_all |> select(signal) |> as.data.table()

rm(higgs_all)

The HIGGS dataset contains 28 features and one binary target, signal. The signal value of an observation corresponds to whether a particle collision produced a Higgs boson as an intermediate product. Two possible processes are considered with the same input particles, one of which generates the Higgs boson (the “signal” process) and one which does not (the “background” process).

Of the 28 features in the dataset, the last seven features, prefixed ‘m_’, are called “high-level” features and are based on computing the mass of expected intermediate decay products in the signal and background processes, assuming that the observed final decay products were generated by each process. The 21 “low-level” features consist of momentum data for a lepton and four jets, b-tags for each jet marking the likelihood that the jet is associated with a bottom quark, and partial information of “missing” total momentum caused by undetected decay products such as neutrinos. Due to the nature of the simulated particle colliders and detectors, full directional data of this missing momentum is not available.

1.2 Creating the final test splits

Since Keras, one of the libraries we will use, uses “validation” internally when fitting a model, we will use the term “final validation set” to denote the final hold-out set for post-model evaluation.

# Create 10% test set split
set.seed(1)
idx <- createDataPartition(yAll$signal, p = 0.1, list = F)
gc()

x <- xAll[-idx,]
y <- yAll[-idx,]
xFinalTest <- x[idx,]
yFinalTest <- y[idx,]

# Clean-up
rm(xAll, yAll)
gc()

1.3 Optimization metric

We will build a classification model for the HIGGS dataset, optimizing the area under the receiver operating curve (AUC). The AUC is defined such that a perfect classifer has an AUC of 1 and a random classifier has an AUC of 0.5. An advantage of using the AUC is that it reflects the trade-off between the true and false positive rates of a model depending on the chosen classification threshold. We will use the pROC package to create ROC curves and compute their area.

References

Baldi, P., P. Sadowski, and D. Whiteson. 2014. “Searching for Exotic Particles in High-Energy Physics with Deep Learning.” Nature Communications 5 (1). https://doi.org/10.1038/ncomms5308.