CS5805 - Classification: Predicting the onset of diabetes

Supervised learning, also known as supervised machine learning, is a subcategory of machine learning and artificial intelligence. It is defined by its use of labeled datasets to train algorithms that to classify data or predict outcomes accurately.

Classification uses machine learning algorithms that learn how to assign a class label to examples from the problem domain. The class labels are a set of discrete values. A model will use the training dataset and will calculate how to best map examples of input data to specific class labels. As such, the training dataset must be sufficiently representative of the problem and have many examples of each class label. Based on the set of class labels, classification can be binary classification (2 class labels) or multi-class classification (>2 class labels). You can read more on classification here!.

In this blog, we will be dealing with binary classification on the Pima Indian Diabetes dataet from the UCI Machine Learning Repository. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Below, we can see a sample of the dataset chosen.

Code

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

df_diabetes = pd.read_csv("diabetes.csv")
df_diabetes.head(10)

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1
5	5	116	74	0	0	25.6	0.201	30	0
6	3	78	50	32	88	31.0	0.248	26	1
7	10	115	0	0	0	35.3	0.134	29	0
8	2	197	70	45	543	30.5	0.158	53	1
9	8	125	96	0	0	0.0	0.232	54	1

We can plot the correlation between the features(columns) and in the chart below, we can see which features have a higher correlation with the tagret variable.

Code

import seaborn as sns
import matplotlib.pyplot as plt


f, ax = plt.subplots(figsize=(8, 8))

corr = df_diabetes.corr()
sns.heatmap(corr,
    cmap=sns.diverging_palette(220, 10, as_cmap=True),
    vmin=-1.0, vmax=1.0,
    annot = True,
    square=True, ax=ax);
plt.show()

We can view the info about the data. We see that there are no null values but there are columns having 0 values which are missing values. It is important to handle missing data and prepare it well before it is processed through the classification model.

Code

print(df_diabetes.info())
print(df_diabetes.drop(columns=['Pregnancies', 'Outcome']).isin([0, 0.0]).sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
dtype: int64

We convert the missing values into Nan values so that we can apply the K nearest neighbours imputation algorithm.

Code

# storing outcomes in dataframe y, and storing pregnancies in a separate list temporarily
# instead of creating a copy of another dataframe
pregnancies = df_diabetes['Pregnancies']
y = df_diabetes['Outcome']
df_diabetes = df_diabetes.drop(columns=['Pregnancies', 'Outcome'])
# making the 0 missing values into Nan values for imputing
df_diabetes.replace(0, np.nan, inplace=True)
print(f"Number of missing values = {np.isnan(df_diabetes.to_numpy()).sum()}")
df_diabetes['Pregnancies'] = pregnancies
columns = df_diabetes.columns
df_diabetes.head(5)

Number of missing values = 652

	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Pregnancies
0	148.0	72.0	35.0	NaN	33.6	0.627	50	6
1	85.0	66.0	29.0	NaN	26.6	0.351	31	1
2	183.0	64.0	NaN	NaN	23.3	0.672	32	8
3	89.0	66.0	23.0	94.0	28.1	0.167	21	1
4	137.0	40.0	35.0	168.0	43.1	2.288	33	0

Before imputing the data, we: 1. Split the data into train-test split 2. Scale the training data and the testing data separately.

The reason for splitting the data and then scaling it and then applying imputation is so that there is no data leakage between the train-test datasets. Since data leakage can make our model biased leading to incorrect results and inaccurate evaluation metric scores.

The training set and the test set are then imputed separately with the KNNImputer with 5 neighbours. Imputation for completing missing values using k-Nearest Neighbors. Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
y = y
X = (df_diabetes).to_numpy()
# 80-20 Train-Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) 

scaling_x=StandardScaler()
X_train=scaling_x.fit_transform(X_train)
X_test=scaling_x.transform(X_test)

# Imputing missing values using knn
# knn imputation transform for the dataset

from sklearn.impute import KNNImputer

# print total missing
print('Missing: %d' % sum(np.isnan(X).flatten()))
# define imputer
imputer = KNNImputer(n_neighbors=5) # taking 5 neighbours
# fit transform on the dataset for training and testing set
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)
# print total missing
X_trans = np.concatenate((X_train_imputed, X_test_imputed), axis=0)
print('Missing: %d' % sum(np.isnan(X_trans).flatten()))

Missing: 652
Missing: 0

We can see, all values have been normalized and there are no missing values in the dataset.

Code

df_diabetes_cleaned = pd.DataFrame(X_trans, columns = columns)
df_diabetes_cleaned.head(5)

	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Pregnancies
0	0.757757	0.607781	1.665478	-0.227452	0.831683	0.529526	0.567932	1.516591
1	0.233976	-0.808447	0.698824	0.513708	1.313142	-0.069689	0.398450	1.812018
2	-0.649904	0.135705	1.085486	-0.433789	0.729555	-0.794249	0.991638	0.925736
3	-0.060651	0.450422	-0.905821	1.088148	-1.050384	-0.167519	2.601722	1.221164
4	-0.060651	0.293064	0.795490	-0.433789	1.094297	-0.760619	-0.364222	-0.551400

We can now begin classification and compare various popular classification models such as Support Vector Machines (SVM), Decision Trees (DT), Random Forest (RF), and XGBoost (XGB). We also explore some hyperparameter tuning parameters and compare all the models on their performance on this dataset. Hyperparameter tuning relies more on experimental results than theory, and thus the best method to determine the optimal settings is to try many different combinations evaluate the performance of each model.

The evaluation metrics are:

Accuracy: Accuracy is a measure of overall correctness and is calculated as the ratio of correctly predicted instances to the total number of instances in the dataset. \(Accuracy = \frac{True Positive + True Negative}{True Positive + True Negative + False Positive + False Negative}\)
Precision: Precision is the ratio of correctly predicted positive instances to the total predicted positive instances. It measures the accuracy of positive predictions. \(Precision = \frac{True Positive}{ True Positive + False Positive}\)
Recall: Recall is the ratio of correctly predicted positive instances to the total actual positive instances. It measures the model’s ability to capture all positive instances. \(Recall (Sensitivity) = \frac{True Positive}{ True Positive + False Negative}\)
F1 score: The F1 score is the harmonic mean of precision and recall. It provides a balance between precision and recall, especially useful when the class distribution is imbalanced. \(F1 = \frac{2*Precision*Recall}{ Precision + Recall}\)

where TP and TN define the samples labeled correctly as the positive class or negative class, FP define the samples falsely labeled as positive and FN define the samples falsely labeled as negative.

Code

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from pprint import pprint
best_preds = []
model_names = []

Support Vector Machine Classification

Support Vector Machines (SVM) is a supervised machine learning algorithm used for classification and regression tasks. SVM works by finding a hyperplane in a high-dimensional space that best separates the data points of different classes. The “support vectors” are the data points closest to the decision boundary, and the margin is the distance between the support vectors and the decision boundary. SVM aims to maximize this margin, providing robust generalization to unseen data. SVM is considered a linear classifier but it can handle non-linear relationships through the use of kernel functions, allowing it to map input data into a higher-dimensional space. SVM is particularly effective in high-dimensional spaces and is widely used in various applications, including image classification and text categorization. In the code, we make use of the scikit-learn library for the SVM implementation. The parameter information can be found in the implementation page.

from sklearn.svm import SVC

model_names.append('Support Vector Machine')

# Create an SVM model
svm_model = SVC()

print("Current params:")
pprint(svm_model.get_params())

svm_model.fit(X_train_imputed, y_train)

y_pred_best = svm_model.predict(X_test_imputed)

best_preds.append([accuracy_score(y_test, y_pred_best), precision_score(y_test, y_pred_best), recall_score(y_test, y_pred_best), f1_score(y_test, y_pred_best)])

Current params:
{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}


SVM classifier
Accuracy: 0.7987012987012987
F1 score: 0.6666666666666666

Decision Tree Classification

Decision Trees are a non-linear, hierarchical model that partitions the input space into regions and assigns a class label or regression value to each region. The tree structure is built by recursively splitting the data based on the most informative features at each node. A common splitting technique is using the impurity measure to decide whether the branch must be split or not. Decision Trees are interpretable, easy to visualize, and capable of handling both categorical and numerical features. However, they are prone to overfitting, especially when deep trees are constructed. Techniques like pruning and limiting tree depth help mitigate overfitting and improve generalization. In the code, we make use of the scikit-learn library for the Decision Tree implementation. The parameter information can be found in the implementation page.

Grid-Search Hyperparameter tuning

GridSearchCV, a method that, instead of sampling randomly from a distribution, evaluates all combinations of parameters we define. In the code, we make use of the scikit-learn library for the GridSearchCV implementation. The parameter information can be found in the implementation page.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

model_names.append('Decision Tree')

dt = DecisionTreeClassifier()

print("Current params:")
pprint(dt.get_params())

dt.fit(X_train_imputed, y_train)

params = {
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': range(1, 5),
    'max_features': ['auto', 'sqrt', 'log2', None],
    'criterion': ['gini', 'entropy'],
}

grid_search_dt = GridSearchCV(dt, params, cv=3, scoring='accuracy')

# Fit the model to the data and perform hyperparameter tuning
grid_search_dt.fit(X_train_imputed, y_train)

# Print the best hyperparameters
print("Best Hyperparameters:")
pprint(grid_search_dt.best_params_)

# Get the best model
best_model_dt = grid_search_dt.best_estimator_

y_pred = dt.predict(X_test_imputed)
y_pred_best = best_model_dt.predict(X_test_imputed)

best_preds.append([accuracy_score(y_test, y_pred_best), precision_score(y_test, y_pred_best), recall_score(y_test, y_pred_best), f1_score(y_test, y_pred_best)])

Current params:
{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': None,
 'splitter': 'best'}
Best Hyperparameters:
{'criterion': 'gini',
 'max_depth': 5,
 'max_features': 'sqrt',
 'min_samples_leaf': 2,
 'min_samples_split': 10}


DT without hyperparameter tuning
Accuracy: 0.7207792207792207
F1 score: 0.6055045871559633

DT with hyperparameter tuning
Accuracy: 0.7402597402597403
F1 score: 0.6363636363636364

Random Forest Classification

Random Forest is an ensemble learning method that builds a multitude of decision trees during training and outputs the mode of the classes for classification tasks or the average prediction for regression tasks. Each tree in the forest is constructed using a random subset of the training data and a random subset of features. The randomness and diversity among trees help mitigate overfitting and improve overall model accuracy. Random Forest is known for its robustness, versatility, and effectiveness in handling high-dimensional data. In the code, we make use of the scikit-learn library for the Random Forest implementation. The parameter information can be found in the implementation page.

RandomSearch Hyperparameter tuning

RandomizedSearchCV, a method that, sample randomly from a distribution and evaluates the randomly chosen of parameters we define. In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter. In the code, we make use of the scikit-learn library for the RandomizedSearchCV implementation. The parameter information can be found in the implementation page.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

model_names.append('Random Forest')

rf = RandomForestClassifier()
print("Current params:")
pprint(rf.get_params())

rf.fit(X_train_imputed, y_train)

max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# Create the random grid
random_grid = {'n_estimators': [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)],
               'max_features': ['auto', 'sqrt'],
               'max_depth': max_depth,
               'min_samples_split': [2, 5, 10],
               'min_samples_leaf': [1, 2, 4],
               'bootstrap': [True, False]}

rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train_imputed, y_train)

# Print the best hyperparameters
print("Best Hyperparameters:")
pprint(rf_random.best_params_)

# Get the best model
best_model_rf = rf_random.best_estimator_

y_pred = rf.predict(X_test_imputed)
y_pred_best = best_model_rf.predict(X_test_imputed)

best_preds.append([accuracy_score(y_test, y_pred_best), precision_score(y_test, y_pred_best), recall_score(y_test, y_pred_best), f1_score(y_test, y_pred_best)])

Current params:
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}
Fitting 3 folds for each of 100 candidates, totalling 300 fits
Best Hyperparameters:
{'bootstrap': False,
 'max_depth': 30,
 'max_features': 'sqrt',
 'min_samples_leaf': 2,
 'min_samples_split': 10,
 'n_estimators': 800}


RF without hyperparameter tuning
Accuracy: 0.8311688311688312
F1 score: 0.7400000000000001

RF with hyperparameter tuning
Accuracy: 0.8051948051948052
F1 score: 0.7058823529411765

XGBoost

XGBoost, or Extreme Gradient Boosting, is a machine learning algorithm renowned for its efficiency and performance in predictive modeling tasks. It belongs to the ensemble learning family and is an extension of traditional gradient boosting methods. The core idea behind XGBoost is the sequential addition of weak learners, often decision trees, which are trained to correct errors made by preceding models. XGBoost introduces several key innovations, including regularization techniques to prevent overfitting, parallelized tree construction for faster training, and gradient-based optimization for rapid convergence. At its core, XGBoost structures its ensemble as a collection of decision trees, each contributing to the final prediction. So, gradient boosting is in the form of ensemble of weak prediction models. The algorithm assigns weights to the misclassified instances in each iteration, adjusting subsequent trees to focus on the previously misclassified samples. During training, XGBoost uses gradient-based optimization to efficiently navigate the solution space and arrive at an ensemble of trees that collectively delivers a robust and accurate prediction.

In the code, we make use of the XGBoost library for the XGBoost implementation. The parameter information can be found in the implementation page.

BayesSearch Hyperparameter tuning

Bayesian optimization is a more sophisticated technique that uses Bayesian methods to model the underlying function that maps hyperparameters to the model performance. It tries to find the optimal set of hyperparameters by making smart guesses based on the previous results. Bayesian optimization is more efficient than grid or random search because it attempts to balance exploration and exploitation of the search space. It can also deal with the cases of large number of hyperparameters and large search space. However, it can be more difficult to implement than grid search or random search and may require more computational resources.

In the code, we make use of the skopt library for the BayesSearchCV implementation. The parameter information can be found in the implementation page.

from xgboost import XGBClassifier
from skopt import BayesSearchCV

model_names.append('XGBoost')

# Create an XGBoost classifier
xgb = XGBClassifier()

print("Current params:")
pprint(xgb.get_params())

xgb.fit(X_train_imputed, y_train)

# Define the parameter search space
param_space = {
    'max_depth': (3, 10),
    'learning_rate': (0.01, 1.0, 'log-uniform'),
    'n_estimators': (50, 200),
    'min_child_weight': (1, 10),
    'subsample': (0.1, 1.0, 'uniform'),
    'gamma': (0.0, 1.0, 'uniform'),
    'colsample_bytree': (0.1, 1.0, 'uniform'),
}

# Instantiate BayesSearchCV
bayes_search_xgb = BayesSearchCV(
    xgb,
    param_space,
    cv=3,  # Number of cross-validation folds
)

np.int = np.int_
# Fit the model to the training data and perform hyperparameter tuning
bayes_search_xgb.fit(X_train_imputed, y_train)

# Print the best hyperparameters
print("Best Hyperparameters:")
pprint(bayes_search_xgb.best_params_)

# Get the best model
best_model_xgb = bayes_search_xgb.best_estimator_


y_pred = xgb.predict(X_test_imputed)
y_pred_best = best_model_xgb.predict(X_test_imputed)

best_preds.append([accuracy_score(y_test, y_pred_best), precision_score(y_test, y_pred_best), recall_score(y_test, y_pred_best), f1_score(y_test, y_pred_best)])

Current params:
{'base_score': None,
 'booster': None,
 'callbacks': None,
 'colsample_bylevel': None,
 'colsample_bynode': None,
 'colsample_bytree': None,
 'device': None,
 'early_stopping_rounds': None,
 'enable_categorical': False,
 'eval_metric': None,
 'feature_types': None,
 'gamma': None,
 'grow_policy': None,
 'importance_type': None,
 'interaction_constraints': None,
 'learning_rate': None,
 'max_bin': None,
 'max_cat_threshold': None,
 'max_cat_to_onehot': None,
 'max_delta_step': None,
 'max_depth': None,
 'max_leaves': None,
 'min_child_weight': None,
 'missing': nan,
 'monotone_constraints': None,
 'multi_strategy': None,
 'n_estimators': None,
 'n_jobs': None,
 'num_parallel_tree': None,
 'objective': 'binary:logistic',
 'random_state': None,
 'reg_alpha': None,
 'reg_lambda': None,
 'sampling_method': None,
 'scale_pos_weight': None,
 'subsample': None,
 'tree_method': None,
 'validate_parameters': None,
 'verbosity': None}
Best Hyperparameters:
OrderedDict([('colsample_bytree', 0.16330485291293845),
             ('gamma', 0.5998228910473469),
             ('learning_rate', 0.31016606360093674),
             ('max_depth', 8),
             ('min_child_weight', 2),
             ('n_estimators', 87),
             ('subsample', 0.9281642866051433)])


XGB without hyperparameter tuning
Accuracy: 0.7727272727272727
F1 score: 0.6728971962616823

XGB with hyperparameter tuning
Accuracy: 0.7272727272727273
F1 score: 0.625

Analyzing the results of all the chosen models, we get the table below:

Code

# tabulate their classification report
evaluation_metrics = ['Accuracy', 'Precision', 'Recall', 'F1-score']
plt.rcParams["figure.figsize"] = [30, 7]
plt.rcParams["figure.autolayout"] = True
fig, axs = plt.subplots(1, 1)
axs.axis('tight')
axs.axis('off')

table1 = axs.table(cellText=best_preds,
                      cellLoc = 'left',
                      rowLabels = model_names,
                      rowColours= ["palegreen"] * 10,
                      colLabels=evaluation_metrics,
                      colColours= ["palegreen"] * 10,
                      loc='center')

# Highlight cells with minimum value in each column
for col_idx, metric in enumerate(evaluation_metrics):
    col_values = [row[col_idx] for row in best_preds]
    max_value_idx = col_values.index(max(col_values))

    # Highlight the cell with maximum value in coral color
    table1[max_value_idx + 1, col_idx].set_facecolor("coral")
        
table1.auto_set_font_size(False)
table1.set_fontsize(14)
table1.scale(1, 4)
fig.tight_layout()
plt.show()

This blog only discusses a few classification algorithms and model tuning parameters. By applying the right model, tuning, and regularizing the model, we can aim to improve the accuracy of the model.