Bias and Variance — digging deeper

Dr. Saptarsi Goswami
Geek Culture
Published in
4 min readAug 23, 2021

--

the two animals to tame

https://unsplash.com/photos/8RXmc8pLX_I

Like human beings, machine learning models make errors. We need to work to minimize the error. But before that, we need to be more familiar with the error. The following figure summarizes the different components of error.

Fig 1: Components of Error ( Image Source: Author)

The reducible component, i.e. bias and variance are what we are interested in. Machine learning models are very similar to students you know, they can adopt unfair means ( Memorizing, keeping a cheat sheet) to excel. You already know this tendency is called overfitting and the more complex a model, the more is the tendency of overfitting. To have a handle on the situation, we don’t show the entire data to the model to memorize, rather keep a portion to test the understanding.

Now there can be two types of students.

  • Type 1, they are consistently off the mark in prediction, but within a small range
  • Type 2, they are not very off the mark, but sometimes little left, sometimes little right, sometimes up, or sometimes little down. That means the error is somewhat random.

The type of error made by type 1 is called bias and by type 2 is called variance. Let’s still develop this intuition a little further, with a decision tree.

Step 1: Setting up the experiment

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import accuracy_score

sonar=pd.read_csv("/kaggle/input/sonar.csv")
sonar.head(5)

Sonar is a dataset used for binary classification. The dataset has 60 independent attributes essentially capturing characteristics of sonar waves. The two classes are ‘R’ (rock) and ‘M’ (metal) respectively. You can read more about the dataset here .

Step 2: We are going to create different train-test pairs with different seed values. For reproducibility, we are going to create an ab array with 200 different seed values.

sds=np.array([ 7201, 65639, 79429, 97556, 83446,  4040, 19136, 60465,  4508,85159, 47490, 78933, 38786, 71735, 40947, 93921, 23817, 49387,29389, 77211,  9631, 51530, 42189, 39644, 26031, 92416, 43836,
14515, 98603, 89179, 36509, 6660, 1014, 63248, 71190, 85926, 30030, 39332, 60098, 75754, 2610, 21638, 75559, 90547, 88348, 63420, 61094, 47382, 62301, 2633, 45287, 4219, 46123, 84365, 35592, 94491, 21469, 85004, 35004, 99191, 34292, 70491, 76716,
30571, 80030, 9096, 94893, 99482, 24740, 17996, 6926, 29095,52196, 99592, 73362, 78224, 4137, 28615, 43236, 90798, 27307,64625, 29798, 9837, 12678, 66675, 458, 55317, 371, 70754,7143, 97278, 24314, 4906, 73279, 92083, 22524, 11067, 40521, 29496, 79254, 72593, 26576, 83370, 9189, 46891, 41156, 68492, 19550, 50102, 99913, 41639, 7876, 40158, 29276, 27296, 51437, 91329, 90905, 9766, 91473, 35803, 56797, 91852, 11818, 82792,11675, 1815, 88774, 94928, 66445, 27668, 97092, 96118, 44631, 82373, 24251, 84652, 70962, 82970, 13656, 93935, 25024, 65802,37281, 23082, 47794, 2920, 46094, 98047, 77605, 6782, 8969,86184, 8220, 58567, 42360, 40531, 89866, 85664, 47358, 89345,61521, 21490, 28711, 35516, 73557, 75779, 5745, 85077, 90047,
3622, 98625, 57135, 22317, 57947, 45689, 30491, 43600, 63679,
64541, 19675, 77626, 35308, 28497, 35008, 85613, 15440, 52657,
6497, 47109, 41854, 4720, 64677, 86208, 88470, 63428, 2000,7064, 51788])

Step 3: We are going to create different variants of the decision tree, with varying complexity

  • Tree 1: A normal tree
  • Tree 2: A tree with max depth = 1
  • Tree 3: A tree with max depth = 1 and minimum samples to split to be 200
  • Tree 4: A tree with min number of samples in leaf to be 10
acc1=np.empty(200)
acc2=np.empty(200)
acc3=np.empty(200)
acc4=np.empty(200)
i=0
for k in np.nditer(sds):
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=int(k))
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)
acc1[i]=accuracy_score(y_test, y_pred_tree)
# Max Depth = 1
tree_clf1 = DecisionTreeClassifier(random_state=42,max_depth=1)
tree_clf1.fit(X_train, y_train)
y_pred_tree = tree_clf1.predict(X_test)
acc2[i]=accuracy_score(y_test, y_pred_tree)
# Max Depth and minimum sample to split
tree_clf2 = DecisionTreeClassifier(random_state=42,max_depth=1,min_samples_split=100)
tree_clf2.fit(X_train, y_train)
y_pred_tree = tree_clf2.predict(X_test)
acc3[i]=accuracy_score(y_test, y_pred_tree)
tree_clf3 = DecisionTreeClassifier(random_state=42,min_samples_leaf=10)
tree_clf3.fit(X_train, y_train)
y_pred_tree = tree_clf3.predict(X_test)
acc4[i]=accuracy_score(y_test, y_pred_tree)
i = i + 1

Step 4: Validating the result

Before the validation, let’s develop some notions on the same. So, these four models are going to throw an array of accuracies on the different test sets. What is the size, it is 200 right !. How, can we visualize them? Well, a box plot may not be a bad idea. If the models are simple then they will have high bias, that is low classification accuracy and less variance i.e smaller spread of the box. If they are complex, the classification accuracy will be higher but the spread of the corresponding box will be more.

dftree = pd.DataFrame({'Regular': acc1, 'Depth 1': acc2, 'Depth 1,Min Split':acc3,'Min sample leaf':acc4})
import seaborn as sns
sns.boxplot(data = dftree)
Figure 2: Comparing the bias, variance using decision trees (Image Source: Author)

Conclusion:

We can surely see the regular tree has the highest median accuracy as well as the highest spread of the box, compared to the other three.

--

--

Dr. Saptarsi Goswami
Geek Culture

Asst Prof — CS Bangabasi Morning Clg, Lead Researcher University of Calcutta Data Science Lab, ODSC Kolkata Chapter Lead