If you are starting your studies in machine learning, you can initiate looking at some classical models.
In this post, the following classical models will be briefly discussed:
- k-nearest neighbors;
- decision tree;
- random forest;
- support vector machine;
- least squares support vector machine;
- gradient boosting;
- LightGBM.
You can see simulations of GitHub with application of those classical models to benchmark series, financial series, and solar energy datasets, in the link: https://github.com/kaikerochaalves/ClassicalModels.git
K-NEAREST NEIGHBORS ALGORITHM
The k-nearest neighbors algorithm, k-NN or KNN for short, introduced by Fix and Hodges [1] and expanded by Cover and Hart [2], is a nonparametric and supervised model implemented for classification and regression problems. In KNN, the user defines the k value, which is an integer greater than zero and lower than the number of training samples and represents the number of neighbors that will be checked to determine the output of the query point. The model tends to have higher bias and lower variance for larger k values. Otherwise, lower k values tend to result in lower bias and higher variance. Furthermore, it is recommended to use odd k values. Moreover, it is necessary to compute the distance between the query point and the other data points to find the closest data points to the query point. Different distance metrics can be defined. Three well-known distances are the Euclidean, Manhattan, and Minkowski.
- Euclidean:
\begin{equation}\label{Euclidean}
d(x,y) = \sqrt{\sum_{i=1}^{n}(y_{i} - x_{i})^{2} }
\end{equation}
- Manhattan:
\begin{equation}\label{Manhattan}
d(x,y) = \sum_{i=1}^{n} |y_{i} - x_{i}|
\end{equation}
- Minkowski:
\begin{equation}\label{Minkowski}
d(x,y) = \left( \sum_{i=1}^{n} |y_{i} - x_{i}|^{p} \right)^{\frac{1}{p}}
\end{equation}
where x and y are two vectors, n is the number of elements in x or y, and p is an integer greater than zero. Euclidean distance and Manhattan distance are special cases of Minkowski distance with p equal to 1 and 2, respectively.
Among the advantages of KNN, the following can be highlighted:
- KNN provides good results and has a simple implementation.
- KNN has a good adaptation. When new training samples are available, the algorithm adjusts its structure to include the new samples.
- KNN has fewer hyperparameters than other machine learning models: the k value and the distance metric.
Among the disadvantages of KNN, the following can be highlighted:
- KNN is a lazy algorithm, i.e., KNN doesn't perform any generalization in the training phase; all samples are kept in the memory to be used in the testing phase. It can result in higher costs, mainly with larger datasets.
- KNN usually has higher errors with high-dimensional data. Consequently, techniques to reduce dimensionality are good for improving KNN's performance.
- KNN is more prone to underfit or overfit if the k value is too high or low.
DECISION TREES
A decision tree (DT) algorithm is widely applied for classification and regression problems. DT, also referred to as a regression tree when used in regression problems, is a supervised and nonparametric machine learning model inspired by the structure of trees. Feigenbaum and Simon [3] are credited as the first remarkable paper to use a decision tree, and then Hunt et al. [4] established the foundations of many decision trees in their work to model human learning in Psychology, dating from the 1960s.
The basic structure of a DT is composed of nodes, branches, and the leaf node. The first node is named the root node, and the branches represent the connection between the root node and internal nodes, two internal nodes, or an internal node and the leaf node. The nodes keep information on the attributes, and the leaf node contains the target value. The literature presents many algorithms to split decision trees, but two remarkable methods are based on information gain and Gini impurity. Moreover, pruning is an essential technique in DT to maintain the model's accuracy, which consists of discarding some branches of the three.
Among the advantages of DT, the following can be highlighted:
- DT consists of logical and visual ways to present information. It is also easy to see the order of importance among the attributes.
- There is a low necessity, or none, to preprocess the data. DT can deal with many types of data and can handle missing ones.
- Two highly correlated attributes don't represent a problem to DT.
Among the disadvantages of DT, the following can be highlighted:
- DT is more prone to overfitting when it becomes more complex. That is the reason why pruning the tree is essential.
- DT can present more computational cost in the training phase than other algorithms.
- DT presents high variance in the estimators. It means that slight data variations can result in entirely different structures.
RANDOM FOREST
Random forest (RF) is an ensemble model proposed by Breiman that can be applied to classification or regression problems [5]. RF has three hyperparameters: the node size, the number of trees, and the number of features sampled. RF randomly selects with replacement two-thirds of the training set. This set, called bootstrap samples, is used to form one tree. The remaining one-third called out-of-bag (oob) sample, is set aside for cross-validation. This procedure is repeated until the model obtains the number of trees defined by the user.
Moreover, each tree is generated using part of the attributes, i.e., given a dataset with some features, the model randomly selects the number of features defined by the user to create a decision tree. Finally, the model computes the output considering the output of each decision tree. When the model is for regression, the output is an average of each decision tree. In the case of a classification problem, the output is the most frequent label in each decision tree. At the end of the training set, the model has an uncorrelated forest of decision trees, reducing the chance of overfitting, bias, and overall variance. Consequently, it increases the model's accuracy.
Among the advantages of RF, the following can be highlighted:
- RF reduces the chance of overfitting due to the averaging of uncorrelated trees.
- RF can be used to estimate missing data.
- RF is an easy way to estimate the attributes' importance, as it can measure the accuracy increase when a feature is removed in the permutation process.
Among the disadvantages of RF, the following can be highlighted:
- RF can present a higher execution time when dealing with large datasets since it computes data of each decision tree.
- RF needs more computational resources to deal with larger datasets.
- RF presents more complexity in interpreting the results since the model produces many decision trees.
SUPPORT VECTOR MACHINE
Boser et al. [6] introduced a binary linear classification technique called support vector machine (SVM). After the proposal of SVM, it was extended to be applied to multi-class, non-linear, and regression problems. Given a dataset with N features, SVM looks for the hyperplane in the N-dimensional space that separates the data points into two classes with the maximum distance between the hyperplane and the data points of both classes, i.e., the highest margin. Finding the hyperplane with the maximum margin gives more confidence about future classifications.
Among the advantages of RF, the following can be highlighted:
- SVM has improved performance when there is an evident separation between the classes.
- High-dimensional spaces provide better results to SVM.
- Compared with other algorithms, SVM is memory efficient.
Among the disadvantages of RF, the following can be highlighted:
- SVM is not suitable for dealing with large datasets.
- SVM presents long training times.
- SVM is not suggested for noisy datasets, as it will produce significantly increased misclassifying.
LEAST SQUARES SUPPORT VECTOR MACHINE
Suykens and Vandewalle [7] proposed a variation of SVM called least squares support vector machine (LS-SVM). LS-SVM obtains a linear set of equations in a dual space by adopting equality constraints instead of the inequality constraints and quadratic error term of SVM.
Among the advantages of RF, the following can be highlighted:
- LS-SVM presents faster convergence than SVM.
- LS-SVM obtains higher accuracy than SVM
Among the disadvantages of RF, the following can be highlighted:
- LS-SVM classification performance is worse than the SVM.
- LS-SVM is sensitive to outliers.
GRADIENT BOOSTING
Based on a statistical framework, Freund and Schapire [8], Friedman et al. [9], and Friedman [10] derive the first concepts of boosting methods that led to the development of gradient boosting machine (GBM). GBM is an ensemble machine-learning technique used for regression or classification. The model defines the stop criteria based on the number of decision trees or the loss function value. The first successful GBM algorithm is Adaptive Boosting (AdaBoost), proposed by Freund and Schapire [8] for binary classification. After that, many other GBM variations were presented [11-12]
Among the advantages of GBMs, the following can be highlighted:
- GBM presents robustness to capture dependencies in complex non-linear functions.
- Parallelization can be applied to trained GBMs.
Among the disadvantages of GBMs, the following can be highlighted:
- GBMs have a high memory consumption demand.
- GBMs are not considered fast models.
LightGBM
Ke et al. [13] proposed a GBM algorithm called LightGBM. LightGBM combines two approaches, Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB), to create a model that performs equally or better than classical GBM with reduced computational cost. LightGBM implements GOSS to reduce the size of the dataset, excluding the samples with lower gradients. On the other hand, EFB works by reducing the number of features in the dataset.
Among the advantages of LightGBM, the following can be highlighted:
- LightGBM presents faster computational speed and lower memory consumption than other GBMs.
- LightGBM has superior accuracy than other GBMs.
Among the disadvantages of LightGBM, the following can be highlighted:
- LightGBM is more prone to overfitting.
REFERENCES
[1] Fix, E., Hodges, J.: Discriminatory analysis, nonparametric discrimination (1951)
[2] Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27 (1967). https://doi.org/10.1109/TIT.1967.1053964
[3] Feigenbaum, E.A., Simon, H.A.: A theory of the serial position effect. British Journal of Psychology 53(3), 307–320 (1962). https://doi.org/10.1111/j.2044-8295.1962.tb00836.x
[4] Hunt, E.B., Marin, J., Stone, P.J.: Experiments in induction. (1966)
[5] Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001).
https://doi.org/10.1023/A:1010933404324
[6] Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 144–152 (1992).
https://doi.org/10.1145/130385.130401
[7] Suykens, J.A., Vandewalle, J.: Chaos control using least-squares support vector machines. International Journal of Circuit Theory
and Applications 27(6), 605–615 (1999).
https://doi.org/10.1002/(SICI)1097-007X(199911/12)27:6%3C605::AID-CTA86%3E3.0.CO;2-Z
[8] Freung, Y., Shapire, R.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55, 119–139 (1997).
https://doi.org/10.1006/jcss.1997.1504
[9] Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics 28(2), 337–407 (2000).
https://doi.org/10.1214/aos/1016218223
[10] Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Annals of Statistics, 1189–1232 (2001).
https://doi.org/10.1214/aos/1013203451
[11] Natekin, A., Knoll, A.: Gradient boosting machines, a tutorial. Frontiers in Neurorobotics 7, 21 (2013). https://doi.org/10.3389/fnbot.2013.00021
[12] Bentéjac, C., Csörgõ, A., Martínez-Muñoz, G.: A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review 54, 1937–1967 (2021).
https://doi.org/10.1007/s10462-020-09896-5
[13] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.-Y.: Lightgbm: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems 30 (2017).
https://doi.org/10.1007/s11280-022-01033-2
In this post, the following classical models will be briefly discussed:
- k-nearest neighbors;
- decision tree;
- random forest;
- support vector machine;
- least squares support vector machine;
- gradient boosting;
- LightGBM.
You can see simulations of GitHub with application of those classical models to benchmark series, financial series, and solar energy datasets, in the link: https://github.com/kaikerochaalves/ClassicalModels.git
K-NEAREST NEIGHBORS ALGORITHM
The k-nearest neighbors algorithm, k-NN or KNN for short, introduced by Fix and Hodges [1] and expanded by Cover and Hart [2], is a nonparametric and supervised model implemented for classification and regression problems. In KNN, the user defines the k value, which is an integer greater than zero and lower than the number of training samples and represents the number of neighbors that will be checked to determine the output of the query point. The model tends to have higher bias and lower variance for larger k values. Otherwise, lower k values tend to result in lower bias and higher variance. Furthermore, it is recommended to use odd k values. Moreover, it is necessary to compute the distance between the query point and the other data points to find the closest data points to the query point. Different distance metrics can be defined. Three well-known distances are the Euclidean, Manhattan, and Minkowski.
- Euclidean:
\begin{equation}\label{Euclidean}
d(x,y) = \sqrt{\sum_{i=1}^{n}(y_{i} - x_{i})^{2} }
\end{equation}
- Manhattan:
\begin{equation}\label{Manhattan}
d(x,y) = \sum_{i=1}^{n} |y_{i} - x_{i}|
\end{equation}
- Minkowski:
\begin{equation}\label{Minkowski}
d(x,y) = \left( \sum_{i=1}^{n} |y_{i} - x_{i}|^{p} \right)^{\frac{1}{p}}
\end{equation}
where x and y are two vectors, n is the number of elements in x or y, and p is an integer greater than zero. Euclidean distance and Manhattan distance are special cases of Minkowski distance with p equal to 1 and 2, respectively.
Among the advantages of KNN, the following can be highlighted:
- KNN provides good results and has a simple implementation.
- KNN has a good adaptation. When new training samples are available, the algorithm adjusts its structure to include the new samples.
- KNN has fewer hyperparameters than other machine learning models: the k value and the distance metric.
Among the disadvantages of KNN, the following can be highlighted:
- KNN is a lazy algorithm, i.e., KNN doesn't perform any generalization in the training phase; all samples are kept in the memory to be used in the testing phase. It can result in higher costs, mainly with larger datasets.
- KNN usually has higher errors with high-dimensional data. Consequently, techniques to reduce dimensionality are good for improving KNN's performance.
- KNN is more prone to underfit or overfit if the k value is too high or low.
DECISION TREES
A decision tree (DT) algorithm is widely applied for classification and regression problems. DT, also referred to as a regression tree when used in regression problems, is a supervised and nonparametric machine learning model inspired by the structure of trees. Feigenbaum and Simon [3] are credited as the first remarkable paper to use a decision tree, and then Hunt et al. [4] established the foundations of many decision trees in their work to model human learning in Psychology, dating from the 1960s.
The basic structure of a DT is composed of nodes, branches, and the leaf node. The first node is named the root node, and the branches represent the connection between the root node and internal nodes, two internal nodes, or an internal node and the leaf node. The nodes keep information on the attributes, and the leaf node contains the target value. The literature presents many algorithms to split decision trees, but two remarkable methods are based on information gain and Gini impurity. Moreover, pruning is an essential technique in DT to maintain the model's accuracy, which consists of discarding some branches of the three.
Among the advantages of DT, the following can be highlighted:
- DT consists of logical and visual ways to present information. It is also easy to see the order of importance among the attributes.
- There is a low necessity, or none, to preprocess the data. DT can deal with many types of data and can handle missing ones.
- Two highly correlated attributes don't represent a problem to DT.
Among the disadvantages of DT, the following can be highlighted:
- DT is more prone to overfitting when it becomes more complex. That is the reason why pruning the tree is essential.
- DT can present more computational cost in the training phase than other algorithms.
- DT presents high variance in the estimators. It means that slight data variations can result in entirely different structures.
RANDOM FOREST
Random forest (RF) is an ensemble model proposed by Breiman that can be applied to classification or regression problems [5]. RF has three hyperparameters: the node size, the number of trees, and the number of features sampled. RF randomly selects with replacement two-thirds of the training set. This set, called bootstrap samples, is used to form one tree. The remaining one-third called out-of-bag (oob) sample, is set aside for cross-validation. This procedure is repeated until the model obtains the number of trees defined by the user.
Moreover, each tree is generated using part of the attributes, i.e., given a dataset with some features, the model randomly selects the number of features defined by the user to create a decision tree. Finally, the model computes the output considering the output of each decision tree. When the model is for regression, the output is an average of each decision tree. In the case of a classification problem, the output is the most frequent label in each decision tree. At the end of the training set, the model has an uncorrelated forest of decision trees, reducing the chance of overfitting, bias, and overall variance. Consequently, it increases the model's accuracy.
Among the advantages of RF, the following can be highlighted:
- RF reduces the chance of overfitting due to the averaging of uncorrelated trees.
- RF can be used to estimate missing data.
- RF is an easy way to estimate the attributes' importance, as it can measure the accuracy increase when a feature is removed in the permutation process.
Among the disadvantages of RF, the following can be highlighted:
- RF can present a higher execution time when dealing with large datasets since it computes data of each decision tree.
- RF needs more computational resources to deal with larger datasets.
- RF presents more complexity in interpreting the results since the model produces many decision trees.
SUPPORT VECTOR MACHINE
Boser et al. [6] introduced a binary linear classification technique called support vector machine (SVM). After the proposal of SVM, it was extended to be applied to multi-class, non-linear, and regression problems. Given a dataset with N features, SVM looks for the hyperplane in the N-dimensional space that separates the data points into two classes with the maximum distance between the hyperplane and the data points of both classes, i.e., the highest margin. Finding the hyperplane with the maximum margin gives more confidence about future classifications.
Among the advantages of RF, the following can be highlighted:
- SVM has improved performance when there is an evident separation between the classes.
- High-dimensional spaces provide better results to SVM.
- Compared with other algorithms, SVM is memory efficient.
Among the disadvantages of RF, the following can be highlighted:
- SVM is not suitable for dealing with large datasets.
- SVM presents long training times.
- SVM is not suggested for noisy datasets, as it will produce significantly increased misclassifying.
LEAST SQUARES SUPPORT VECTOR MACHINE
Suykens and Vandewalle [7] proposed a variation of SVM called least squares support vector machine (LS-SVM). LS-SVM obtains a linear set of equations in a dual space by adopting equality constraints instead of the inequality constraints and quadratic error term of SVM.
Among the advantages of RF, the following can be highlighted:
- LS-SVM presents faster convergence than SVM.
- LS-SVM obtains higher accuracy than SVM
Among the disadvantages of RF, the following can be highlighted:
- LS-SVM classification performance is worse than the SVM.
- LS-SVM is sensitive to outliers.
GRADIENT BOOSTING
Based on a statistical framework, Freund and Schapire [8], Friedman et al. [9], and Friedman [10] derive the first concepts of boosting methods that led to the development of gradient boosting machine (GBM). GBM is an ensemble machine-learning technique used for regression or classification. The model defines the stop criteria based on the number of decision trees or the loss function value. The first successful GBM algorithm is Adaptive Boosting (AdaBoost), proposed by Freund and Schapire [8] for binary classification. After that, many other GBM variations were presented [11-12]
Among the advantages of GBMs, the following can be highlighted:
- GBM presents robustness to capture dependencies in complex non-linear functions.
- Parallelization can be applied to trained GBMs.
Among the disadvantages of GBMs, the following can be highlighted:
- GBMs have a high memory consumption demand.
- GBMs are not considered fast models.
LightGBM
Ke et al. [13] proposed a GBM algorithm called LightGBM. LightGBM combines two approaches, Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB), to create a model that performs equally or better than classical GBM with reduced computational cost. LightGBM implements GOSS to reduce the size of the dataset, excluding the samples with lower gradients. On the other hand, EFB works by reducing the number of features in the dataset.
Among the advantages of LightGBM, the following can be highlighted:
- LightGBM presents faster computational speed and lower memory consumption than other GBMs.
- LightGBM has superior accuracy than other GBMs.
Among the disadvantages of LightGBM, the following can be highlighted:
- LightGBM is more prone to overfitting.
REFERENCES
[1] Fix, E., Hodges, J.: Discriminatory analysis, nonparametric discrimination (1951)
[2] Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27 (1967). https://doi.org/10.1109/TIT.1967.1053964
[3] Feigenbaum, E.A., Simon, H.A.: A theory of the serial position effect. British Journal of Psychology 53(3), 307–320 (1962). https://doi.org/10.1111/j.2044-8295.1962.tb00836.x
[4] Hunt, E.B., Marin, J., Stone, P.J.: Experiments in induction. (1966)
[5] Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001).
https://doi.org/10.1023/A:1010933404324
[6] Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 144–152 (1992).
https://doi.org/10.1145/130385.130401
[7] Suykens, J.A., Vandewalle, J.: Chaos control using least-squares support vector machines. International Journal of Circuit Theory
and Applications 27(6), 605–615 (1999).
https://doi.org/10.1002/(SICI)1097-007X(199911/12)27:6%3C605::AID-CTA86%3E3.0.CO;2-Z
[8] Freung, Y., Shapire, R.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55, 119–139 (1997).
https://doi.org/10.1006/jcss.1997.1504
[9] Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics 28(2), 337–407 (2000).
https://doi.org/10.1214/aos/1016218223
[10] Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Annals of Statistics, 1189–1232 (2001).
https://doi.org/10.1214/aos/1013203451
[11] Natekin, A., Knoll, A.: Gradient boosting machines, a tutorial. Frontiers in Neurorobotics 7, 21 (2013). https://doi.org/10.3389/fnbot.2013.00021
[12] Bentéjac, C., Csörgõ, A., Martínez-Muñoz, G.: A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review 54, 1937–1967 (2021).
https://doi.org/10.1007/s10462-020-09896-5
[13] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.-Y.: Lightgbm: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems 30 (2017).
https://doi.org/10.1007/s11280-022-01033-2