Comparative analysis of machine learning methods by the example of the problem of determining muon decay

The history of using machine learning algorithms to analyze statistical models is quite long. The development of computer technology has given these algorithms a new breath. Nowadays deep learning is mainstream and most popular area in machine learning. However, the authors believe that many researchers are trying to use deep learning methods beyond their applicability. This happens because of the widespread availability of software systems that implement deep learning algorithms, and the apparent simplicity of research. All this motivate the authors to compare deep learning algorithms and classical machine learning algorithms. The Large Hadron Collider experiment is chosen for this task, because the authors are familiar with this scientific field, and also because the experiment data is open source. The article compares various machine learning algorithms in relation to the problem of recognizing the decay reaction τ− → μ− + μ− + μ+ at the Large Hadron Collider. The authors use open source implementations of machine learning algorithms. We compare algorithms with each other based on calculated metrics. As a result of the research, we can conclude that all the considered machine learning methods are quite comparable with each other (taking into account the selected metrics), while different methods have different areas of applicability.


Introduction
Machine learning is a branch of mathematical modeling related to the construction of surrogate statistics models. Recent years this area has been experiencing really intensive growth, related to the development of computer technology and the ability to analyze grate amount of data (Big Data). Nowadays machine learning approaches, in particular deep learning, demonstrate their high efficiency in data science. Particularly significant results are obtained in classification and cluster analysis of data with unknown structure. The most popular tendency in machine learning is deep learning. It became mainstream area in machine learning and other areas were pushed aside.
In this paper, the authors try to study if deep learning is really superior to all other machine learning methods. Previously, the authors conducted a comparative analysis of the most popular software products for working with neural networks networks [1], and also tried to generalize the methodology for working with machine learning models [2].

Paper structure
This paper has following structure. In section 2 we describe the problem of the decay reaction recognition − → − + − + + . A brief introduction to the physics of the process is given.
The section 3 briefly describes the software we use. The 4 section briefly describes the classification task, provides the terminology from the field of machine learning, we also consider metrics that are used to evaluate efficiency of classifiers.
We apply the Python language and the modules described to the problem in the section 3 . We use metrics to evaluate the effectiveness of various machine learning methods.

The violations of the Standard model
Currently, the main model that describes particle physics is a Standard model formulated in 1960-1970 [3]. Standard model it has passed many experimental tests. However, with from a methodological point of view, this theory is not satisfactory [4]. For example, the Standard model does not describe a number of phenomena, such as explanation of matter-antimatter asymmetry. Research in the field of theoretical and experimental physics that try to expand the standard model and describe phenomena that are not available to it have a collective name: physics beyond the standard model.

Preservation of lepton numbers
The Large Hadron Collider (LHC) is the main tool for studying physics beyond the Standard model. At the LHCb detector (LHG beauty experiment) experiments are being performed [5] whose purpose is the detection of phenomena that contradict theoretical settings of standard model. In particular, one of these phenomena is associated with violation of preserving the lepton number ( ) and the lepton flavor ( , , ). For leptons, the heuristic division into three generations, which is necessary for existence asymmetries of matter and antimatter: -the first generation consists of an electron and an electron neutrino ( − , ), -second generation -muon and muon neutrino ( − , ), -third generation --lepton (tau) and tau neutrino ( , ). As we can see from the Table 1, according to the standard model each  lepton has four numbers , , and and for every reactions between particles the sum of the numbers on the right side of the reaction equation must be equal to the sum of the numbers on the left side (Lepton number conservation). Table 1 Reactions between particles in the standard model This rule holds, for example, in the following tau decay reaction: However, there is a hypothetical tau decay reaction of the following type: Ultrahigh energies proton collisions are performed at the LHC. On average the collision generates about 80 various particles, most of which are unstable and fast disintegrate. Among them, there are tau that can occur in one of the the next five reactions: The task is to build a classification model that must be trained to recognize the decay reaction − → − + − + + . For training the classifier one [6] provides real data from LHC (background events) with the addition of signal data (signal events). The signal data is a simulation of the reaction − → − − + . The classifier requires the following two properties.
-Small discrepancy between real data and simulation. For Estimation of discrepancy data check_agreement.csv is provided This data relates to the reaction + → (→ − + ) + which is topologically very similar to the desired response of the decay − . Also the value of the Kolmogorov-Smirnov test coefficient must be less than 0.09. -Also the classifier should have weak correlation with the mass − . Data in a file is provided to evaluate the correlation check_correlation.csv and the Kramer-von Mises test (CvM).

Software
To apply all the described classification methods, we use Python language and a number of modules: SciKit Learn [7], Keras [8], XGBoost [9] and hep_ml [10]. Let's give a brief description here for each of them.
SciKit Learn [7] is library for data processing, which implements various methods of classification, regression analysis, clustering, and other algorithms related to machine learning training that does not use neural networks. The library is written in Python and uses a number of libraries from the SciPy stack to accelerate calculations. The current version has the number 0.22.2, but the project is quite mature.
SciKit Learn implements almost all of classifications algorithms we described. So the Logistic Regression method is implemented in a submodule linear_model, Gaussian Naive Bayes method is in the submodule naive_bayes, the ensemble submodule implements Random Forest and Gradient Boosting Classifier methods. In the submodule sklearn.metrics there are functions that calculate various metrics for estimation of quality of the classifier.
The XGBoost library is considered the best implementation of gradient boosting. It has API for many languages, including Python. We use it along with SciKit Learn to apply Gradient Boosting Classifier. Also due to the specifics of the task we use hep_ml module because it is specially designed for physics problems.
The Keras [8] library provides a high-level software interface for building neural networks. It can work on top of TensorFlow, Microsoft Cognitive Toolkit (CNTK) [11] or Theano [12]. The library is written completely in Python and distributed under the MIT license. Current version is 2.3.1. The library is based on the following principles: simplicity usage, modularity, and extensibility. Our choice of this libraries about is justified in the article [1].
The modularity principle allows one to describe neural layers separately, optimizers, activators, and so on, and then combine them into one model. The model is fully described in Python. Created model one can save to disk for future use and distribution.

Classification models
The classification model is based on an array of data, presented in tabular form. The process of model construction is usually consists of fitting numeric parameters and is also called model training. The propose of the model is to predict the value dependent variable. In the case of a binary classifier a dependent variable can only take two values: 0 or 1. In in this case, the dependent variable is most often called binary response or just response. One can also meet the terms: goal, outcome, label, and -variable [13]- [15].
The model parameters are adjusted based on independent parameters variables that are represented by columns of the table. The following terms are also used: predictor variable, attribute and -variable.
There are two types of predictor variables numeric and factorial (another termcategorical) predictor variables. Numeric variables are continuous and can take any values from some interval on the numeric axis, and the factor variables are discrete, not necessarily numeric, and can take values from a finite set. A special type of factor variables are indicator variables. Such variables accept only two values (0 or 1).
Depending on the model, it may be necessary to convert factor variables to numeric values or numeric to factor. So when applying multiple linear regression to an array of data with factor variables we need to convert them to numeric type. For example, we can use logit conversion. On the contrary, using the naive Bayesian classifier to continuous data, this data must be converted to factor type.

Metrics for evaluating classification models
A number of numerical methods are used to evaluate the classifier's performance characteristics (metrics) that allow us to compare different classifiers with each other and choose the most optimal one for the given tasks [14].
The classifier is evaluated based on control sample (also called test or verification sample). This sample consists of already classified elements and allows one to measure the performance of the classifier.
Let's assume that the size of the control sample is and the binary classifier detects the response and assigns it 1 or 0. Since this detection is performed on the basis of a control sample, the event class is already known and we can check classification results. All possible predictions fit into four case.
1. True-positive (TP) -classification result is 1 and true value is 1; 2. False-negative (FN) -classification result is 0 but true value is 1; 3. False-positive (FP) -classification result is 1 but true value is 0; 4. True-negative (FN) -classification result is 0 and true value is 0. Let's describe the main metrics that are used for classifier evaluation and specify functions from the module sklearn.metrics [7], [16], which are used to calculate this metrics.
Let the total sample size is , and the classifier has defined true-positive, false-negative, false-positive and of true-negative cases. We can calculate the following table 2 called the confusion matrix.
The classification of metrics is based on this matrix. It shows the number of correct and incorrect predictions grouped into categories by response type. Other names of this matrix are error matrix or confusion matrix. To calculate this matrix we use the confusion_matrix function from SciKit-Learn library.
Accuracy is calculated as the percentage of events that the classifier identified correctly. Calculated using the formula: DCM&ACS. 2020, 28 (2) 105-119 and using the accuracy_score function.
Recall is the percentage of correctly classified events of type 1. Calculated using the formula: and using the recall_score function. Terms are also used are sensitivity or true-positive rate. Specificity is percentage of correctly classified events of type 0 (also called zeros). Calculated using the formula: and also using the recall_score function (for binary classifier this function returns both recall and specificity). The term false-positive rate is also used.
Precision is percentage of predicted units that are actually zeros. Calculated using the formula: and also using the precision_score function. One can create a classifier that will relate all events to class 1. For such a classifier, the recall will be equal to 1, and specificity 0. An ideal classifier should detect events from class 1, without incorrectly identifying events of class 0, as events of the 1 class. Thus a balance must be maintained between recall and specificity. To evaluate this balance, one uses a graphical method called ROC-curve -receiver performance curve.
The ROC curve is a graph of recall versus specificity. For plotting on one axis is delayed recall, and on the other specificity. The graph of an absolutely ineffective classifier will be represent a diagonal line. More effective classifiers will have a graph in the form of an arc. The stronger the arc pressed against the upper-left corner, the more effective it is classifier. The data required to build the curve is calculated with the roc_curve function.
For a more accurate estimation of the ROC curve, one uses a metric indicator AUCarea under the ROC curve. A classifier with a ROC curve as a diagonal line will have = 0.5. The more effective the classifier, the closer the AUC value is to 1. AUC is calculated by the auc function.

Logistic Regression
Logistic regression [17], [18] is an analog of multiple linear regression, with the exception of binary response. To adapt multiple linear regression for this case is necessary to do fallowing steps: -represent the dependent variable as a probability function, with values from segment [0, 1] (probabilistic outcome); -apply the cutoff rule -any outcome with probability, greater than the threshold is classified as 1. If classical multiple regression models the response as linear function from predictor variables: then the logistics response function is modeled using the logistics response function (logit-function or sigmoid): The range of values of such function is the interval (0, 1), we can interpret its values as the probability of the response.
To fit parameters, we consider not the function itself, but the log-odds function: = ln 1 − = 0 + 1 1 2 2 + 3 3 + … + , which map the probability from the interval (0, 1) to real numbers set. Then one uses the maximum likelihood method to select parameters based on a training sample. After selecting the parameters it remains to select cut-off threshold. For example, if one puts it equal to 0.5, then all the response with value < 0.5 will be classified as 0, and with the value >= 0.5 as 1.
In the sklearn library, the function that implements the logistic regression algorithm is located in the linear_model module and it is called LogisticRegression.

Gaussian Naive Bayes
Naive Bayesian classifier [15], [19] is a binary classifier. Assignment to a particular class is based on the conditional probability ( | 1 , … , ) which is calculated based on the Bayes theorem: Next, we make the «naive» statement that all predictor variables are independent and, therefore, the joint probability is ( , 1 , … , ) and it can be calculated using the formula: If predictor variables are assumed to be numeric (i.e. continuous values), then a second «naive» statement is made about the continuity of the distribution function and about the type of distribution. Most often, the normal distribution and a Gaussian Naive Bayes classifier are used.
The advantages of a Naive Bayes classifier include simplicity (there are only few hyperparameter settings) and speed.
In the sklearn library, the function implementing the Gaussian Naive Bayes classifier algorithm is located in the naive_bayes module and is called GaussianNB.

Bagging and Random Forest
Tree models [20] or decision trees is a popular, relatively simple, and yet effective classification method.
Decision trees define a set of classification rules. The rules correspond to the sequential split of the data into segments. Each rule can be expressed as a «if-then» condition imposed on a predictor variable. For each predictor, split value is defined, which divides records into those where the value of the predictor variable is greater and those where it is less. A set of such rules forms a tree whose leaves correspond to one of the two required classes (for a binary classifier).
Tree models advantages is the simplicity of the results interpretation and the ability to reproduce the branching rules in natural language. However, one should avoid overtraining of these models. Overtraining means that branching rules start to take random noise into account. To prevent overtraining, one should limit the depth of tree branches.
Trees became particularly popular with the introduction of the ensemble approach. Its essence is to use a set of decision trees and train them on the same data with further taking the average or weighted average of their results.
Among the methods of training, a method called bagging or bootstrap aggregation. The bootstrap process involves repeatedly retrieving a random set of data from a sample. The number of extracted records is less than the sample size. The most common is bootstrap with replacement. Replacement means that the extracted data is returned to the sample after use, mixed, and used for subsequent retrievals. The begging process consists of training trees on multiple bootstrap samples with returns.
The random forest machine learning algorithm uses begging and selects predictor variables in addition to bootstrap. In other words, each new tree is built on a random subset of variables, rather than on all possible variables. There is empirical rule that it is most efficient to select only √ predictor variables from each time.
In the sklearn library, the function that implements the random forest algorithm is located in the ensemble module and is called RandomForestClassifier.

Gradient Boosting Classifier
Gradient boosting method [21] consists of combining a large number of simple models to produce one that is more accurate than each individual simple model itself. A set of simple models is called ensemble, and by boosting we mean the sequential process of building simple models.
The gradient boosting algorithm is one of the most commonly used machine learning algorithms. We will give only a brief qualitative description here, without going into mathematical details [22], [23].
At each step of gradient boosting, the selected loss function is minimized by gradient descent. The loss function is constructed for the selected base algorithm. Most often the underlying algorithm is the decision tree algorithm. When building each subsequent model, the errors of the previous one are taken into account. This is done by defining the data that does not fit into the previous simple model and adding the next model that processes this data correctly. When configuring the algorithm, the maximum number of models in the ensemble is specified, and when this number of iterations is reached, the algorithm stops. Each model from the ensemble is assigned a certain weight and their predictions are generalized.
In the sklearn library, the function that implements the gradient boosting algorithm is located in the ensemble module and is called GradientBoostingClassifier.
In addition to the implementation included in scikitlearn, Python also has the XGBoost [9] library, which is highly optimized and has interfaces for a large number of programming languages (C/C++, Java, Ruby, Julia, R).
In addition to the implementations from these two libraries, we used the gradient boosting implementation from the hep_ml [10] library, which contains machine learning methods used in the field of high-energy physics.

Neural Network
In the article [1], the authors compared various libraries for building neural networks [8], [12], [24]- [26]. The result of the speed and the accuracy tests show that Keras library provides the most optimal solution. Therefore, to solve the problem of recognizing the decay reaction − → − + − + + we build neural network using this library.

Application of the considered methods
We carry out a comparative analysis of classifiers from section 4 by applying them to the problem of determining muon decay. The problem is a binary classification problem and is based on data from the LCH and generated data for detecting muon decay. Training and test data are presented in csv files. The data contains the values of 40 analyzed parameters. The target attribute is the «signal» attribute, which takes the values 0 or 1. For training classifiers, the data set was divided into training and test samples in the ratio of 8 to 2, the number of records for training classifiers is 54042, and the number of records for testing is 13511.
We choose MLP (multi-layer perceptron) architecture for the neural network. The network consists of fully connected layers with Batch Normalization layers between them that prevent overtraining. Each of the fully connected layers contains a different number of neurons. The input layer consists of 28 neurons, the hidden layers contain 100, 120, 60, and 20 neurons, and the output layer contains 2, according to the number of classes in the data.
All classifiers were tested on a small discrepancy between real data and simulation (Kolmogorov-Smirnov test, the test value for the classifier should be less than 0.09) and a weak correlation with ground test (Cramer-von Mises (CvM), the test value for the classifier should be less than 0.002) In the table 3 lists the values of these tests.  (table 4)

Discussion
The paper presents a comparative analysis of various machine learning algorithms on the example of the problem of determining the decay reaction − → − + − + + at the LHC. We study following algorithms: Logistic Regression, Gaussian naive Bayes classifier, gradient boosting classifier, bootstrap aggregating (bagging) and random forest, neural network model (machine learning algorithm -MLA). For each of the algorithms, we build a classifier using Python libraries and calculate metrics calculated that can be used to determine the most effective model.
All classifiers successfully passed tests for a small discrepancy between real data and simulation (Kolmogorov-Smirnov test) and for a weak correlation with mass (Kramer-von Mises test), which indicates a good quality of the constructed classifiers for this problem.
To conduct a comparative analysis of machine learning methods, we calculate the most important metrics for each model: accuracy, ROC-AUC score, recall, F1-score, precision. In the aggregate of all metrics, the random forest and the gradient boosting method (and their modifications) have the best results. Logistic Regression, Gaussian Naive Bayes and a model based on a fully connected neural network show worse results. However, the neural network surpass other classifiers by the value of the precision metric. This means that the neural network can better distinguish classes from each other than other classifiers.