Model Evaluation Metrics

1. Common Model Classification Evaluation Metrics#

Accuracy#

Accuracy: The percentage of correct predictions out of the total samples.

code

from sklearn.metrics import accuracy_score

y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]

accuracy_score(y_true, y_pred)

Disadvantages
Accuracy can fail when samples are imbalanced. For example, in predicting whether users browsing a shopping website will make a purchase, if there are 100 browsing users but only 1 will buy, the model could predict that no one will buy, resulting in an accuracy of 99%.
1️⃣ Handling Sample Imbalance: Resampling, undersampling, oversampling, etc.

2️⃣ Switching to Appropriate Metrics: F1-Score, which considers not only the number of incorrect predictions but also the types of errors.

Confusion Matrix#

Look at the diagonal.
code

import matplotlib.pyplot as plt
import scikitplot as skplt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)
clf = RandomForestClassifier(n_estimators=5, max_depth=5, random_state=1)
clf.fit(X, y)
clf.score(X, y)
pred = clf.predict(X)

skplt.metrics.plot_confusion_matrix(y, pred, normalize=True)
plt.show()

Binary Classification Diagonal Derived Metrics#

True Positive (TP): Positive samples predicted as positive by the model;
False Positive (FP): Negative samples predicted as positive by the model;
False Negative (FN): Positive samples predicted as negative by the model;
True Negative (TN): Negative samples predicted as negative by the model;
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
F1 Score = 2*(P * R)/(P+R)

Precision#

Precision =\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}

The percentage of samples predicted as 1 that are actually 1.
Case:

When predicting stocks, we care more about precision, meaning among the stocks we predict will rise, how many actually do, because those are the stocks we invest in.
For predicting criminals, we want the predictions to be very accurate; even if some real criminals are let go, we cannot wrongly accuse an innocent person.

code

from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score

y_true = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]

# None: Precision for each class (not averaged)
precision_score(y_true, y_pred, average=None)
# [0.375 1.   ] 

# 'macro': Average precision for each class (not weighted)
precision_score(y_true, y_pred, average='macro')
# (0.375 + 1.)/2 = 0.6875

# 'weighted': Weighted average precision based on the number of samples in each class
precision_score(y_true, y_pred, average='weighted')
# 0.375*0.3+1*0.7 = 0.8125

# 'micro': Overall precision for all samples
precision_score(y_true, y_pred, average='micro')
# Equals Accuracy 0.5

accuracy_score(y_true, y_pred) 
# 0.5

Recall#

Recall =\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}

The probability of actual 1 samples being recalled by the model, also known as sensitivity.

Case:

If there are 10 earthquakes, we would prefer to issue 1000 alerts to cover all 10 earthquakes (in this case, recall is 100%, precision is 1%), rather than issue 100 alerts where 8 earthquakes are predicted but 2 are missed (in this case, recall is 80%, precision is 8%).
In the context of predicting patients, we focus more on recall, meaning we want to minimize the number of actual patients we incorrectly predict, as failing to detect a real patient can have serious consequences; the previous naive algorithm had a recall of 0.

code

from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score

y_true = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]

recall_score(y_true, y_pred, average=None)  
# 3 zeros were recalled, 2 out of 7 ones were recalled [1.         0.28571429]

recall_score(y_true, y_pred, average='macro')  
# (1. + 0.28571429)/2 = 0.6428571428571428

recall_score(y_true, y_pred, average='weighted')  
# 1*0.3+0.28571429*0.7 = 0.5

recall_score(y_true, y_pred, average='micro')  
# Equals Accuracy =0.5
accuracy_score(y_true, y_pred)
# 0.5

Why Precision and Recall are Contradictory#

1️⃣ If you want higher recall, the model needs to cover more samples, but this increases the likelihood of making mistakes, meaning precision will be lower.

2️⃣ If the model is conservative and only detects samples it is very certain about, precision will be high, but recall will be relatively low.

F1 Score#

\frac{1}{F 1}=\frac{1}{2}\left(\frac{1}{\text { precision }}+\frac{1}{\text { recall }}\right)

F 1 =\frac{2}{\frac{1}{precision}+\frac{1}{recall}}

Harmonic Mean#

What is the harmonic mean?

H=\frac{n}{\frac{1}{x_{1}}+\frac{1}{x_{2}}+\ldots+\frac{1}{x_{n}}}

🔴 Because it is calculated based on the reciprocals of the variables, it is also known as the reciprocal mean.

For a journey of 2 kilometers, with a speed of 20 km/h for the first kilometer and 10 km/h for the second kilometer, what is the average speed?

Simple Average:
(20+10)/2 = 15

Time-Weighted Average:

Total time = (1/20 + 1/10) = 0.15

Time for the first kilometer: 1/20=0.05 Weight 33%

Time for the second kilometer: 1/10=0.1 Weight 66%

Average time = 20 * 33% + 10 * 66% = 13.33

Harmonic Mean

Average Speed = \frac{Total Distance}{Total Time}

Average Speed = \frac{2}{\frac{1}{20}+\frac{1}{10}}=13.33

Why Use Harmonic Mean for F1?#

If using simple average, P=0.8, R=0.8 and P=0.7, R=0.9 would both yield an arithmetic average of 0.8, suggesting that precision and recall are interchangeable.
The harmonic mean adds a penalty mechanism: higher values receive lower weights (for example, in the above case, the weight of the speed of 20 is only 33%).
This avoids the situation where one high value and one low value lead to an inflated average when using arithmetic mean. (For instance, if p is 1.0 and r is 0.1, the arithmetic mean would be close to 0.5 while the harmonic mean would be close to 0.2).
The idea behind the F1 score is that an algorithm with balanced precision and recall is more reliable than one that excels in one metric while failing in another.
In summary: Both metrics need to be good for it to be truly good.

code

from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

y_true = [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
y_pred = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]

f1_score(y_true, y_pred, average=None)
# [0.54545455 0.44444444]

f1_score(y_true, y_pred, average='macro')  
# (0.54545455+0.44444444)/2 = 0.4949494949494949

f1_score(y_true, y_pred, average='weighted')  
# 0.54545455*0.3+0.44444444*0.7 = 0.47474747474747475

f1_score(y_true, y_pred, average='micro')  
# Equals Accuracy 0.5
accuracy_score(y_true, y_pred)
# 0.5