Chapter 1 Introduction
This report partially fufills the requirements for the HarvardX course PH125.9x: “Data Science: Capstone”. The objective of this project is to apply machine learning techniques beyond standard linear regression to a publicly available dataset of choice.
1.1 The HIGGS dataset
The HIGGS dataset is a synthetic dataset simulating particle accelerator data (Baldi, Sadowski, and Whiteson 2014). Although the details of the simulated particle collisions are beyond the scope of this project, We summarize the contents of the dataset as follows.
The HIGGS dataset is available from the UCI Machine Learning Repository. The following code loads the dataset, downloading and unzipping the CSV file as necessary:
options(timeout=1800) # Give more time for the download to complete
# Download and unzip the data file if needed
if(!dir.exists('data_raw')) {
dir.create('data_raw')
}if(!file.exists('data_raw/HIGGS.csv')) {
download.file(
'https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz',
'data_raw/HIGGS.csv.gz')
gunzip('data_raw/HIGGS.csv.gz')
}
# Load dataset into memory
<- fread('data_raw/HIGGS.csv')
higgs_all
# Assign column names (csv contains no headers)
colnames(higgs_all) <- c('signal',
'lepton_pT', 'lepton_eta', 'lepton_phi',
'missing_E_mag', 'missing_E_phi',
'jet1_pT', 'jet1_eta', 'jet1_phi', 'jet1_btag',
'jet2_pT', 'jet2_eta', 'jet2_phi', 'jet2_btag',
'jet3_pT', 'jet3_eta', 'jet3_phi', 'jet3_btag',
'jet4_pT', 'jet4_eta', 'jet4_phi', 'jet4_btag',
'm_jj', 'm_jjj', 'm_lv', 'm_jlv',
'm_bb', 'm_wbb', 'm_wwbb')
# Separate input and output columns
<- higgs_all |> select(-signal) |> as.data.table()
xAll <- higgs_all |> select(signal) |> as.data.table()
yAll
rm(higgs_all)
The HIGGS dataset contains 28 features and one binary target, signal
. The
signal
value of an observation corresponds to whether a particle collision produced a Higgs
boson as an intermediate product. Two possible processes are considered with the same input
particles, one of which generates the Higgs boson (the “signal” process) and one which does not
(the “background” process).
Of the 28 features in the dataset, the last seven features, prefixed ‘m_
’, are called
“high-level” features and are based on computing the mass of expected intermediate decay products
in the signal and background processes, assuming that the observed final decay products were
generated by each process. The 21 “low-level” features consist of momentum data for a lepton
and four jets, b-tags for each jet marking the likelihood that the jet is associated with a
bottom quark, and partial information of “missing” total momentum caused by undetected decay
products such as neutrinos. Due to the nature of the simulated particle colliders and detectors,
full directional data of this missing momentum is not available.
1.2 Creating the final test splits
Since Keras
, one of the libraries we will use, uses “validation” internally when fitting
a model, we will use the term “final validation set” to denote the final hold-out set for
post-model evaluation.
# Create 10% test set split
set.seed(1)
<- createDataPartition(yAll$signal, p = 0.1, list = F)
idx gc()
<- xAll[-idx,]
x <- yAll[-idx,]
y <- x[idx,]
xFinalTest <- y[idx,]
yFinalTest
# Clean-up
rm(xAll, yAll)
gc()
1.3 Optimization metric
We will build a classification model for the HIGGS dataset,
optimizing the area under the receiver operating curve (AUC). The AUC
is defined such that a perfect classifer has an AUC of 1 and a random classifier has an AUC of 0.5.
An advantage of using the AUC is that it reflects the trade-off between the true and false positive
rates of a model depending on the chosen classification threshold.
We will use the pROC
package to create ROC curves and compute their area.