Classification : Evaluation Metrics

4 min readNov 22, 2020

Model building for problem is done, then what’s next? How to say our model is generalized? How accurate our classification model is? Is our model performing well? Let’s find answer for questions.

We can measure our model performance with the help of classification metrics like :

ACCURACY
RECALL OR SENSITIVITY
PRECISION
F1 SCORE
AUC
RUC

Before moving further we can recall about CONFUSION MATRIX

If actual and predicted are positive then it belongs to TRUE POSITIVE and if the predicted and actual both are negative the it belongs to TRUE NEGATIVE, if result predicted is negative but actually it is positive then it is classified as FALSE NEGATIVE, similarly if predicted one is positive but actually it’s belongs to negative class then it said to be FALSE POSITIVE

ACCURACY

It can be said that it’s defined as the total number of correct classifications divided by the total number of classifications.

ACCURACY = TP+TN/(TP+TN+FP+FN)

We should not always rely on accuracy itself, because consider a case where there are 100 patient’s in that 60 has cancer and 40 is free from cancer, when we build model with this data, our model has classified 20 as TP(has cancer and predicted as cancer), classified 40 as TN and classified 40 patients has TN(correctly identified cancer free patients).

here accuracy [(20+40)/100 ] is 0.6 i.e. it says that our model correctly classifies 60% of cases, i.e. out of 60 cases our model just correctly classifies 20 as cancer prone and says rest 40 patients as cancer free patients, which is very critical, this is because this metrics takes FN and FP into account too, which when consider with TP, affects the overall rate. that’s the reason we should not completely rely on accuracy

2.RECALL

RECALL = TP/(TP+FN)

This metrics say’s how many positive classes the model is able to recall i.e. It measures that from total number of positive class how many positive class our model correctly predicts.

recall = 20/(20+40) = 0.33. here our recall value is 33% which says out of total positive cancer patient we could correctly classify 33% of them only.

3. PRECISION

It measure that among all positive predictions how many of them is actually positive.

PRECISION = TP/(TP+FP)

precision = (20/(20+0))=1 it says that our model doesn’t predict any patient as cancer prone.

COMPARISON

Our model produced accuracy of 60%, precision 100% recall 33% oh my god !!! which one to choose?
Here we need to find maximum number of people with cancer, if our model doesn’t correctly classify positive person it will leads to disaster. hence in this case we should consider RECALL it says how much positive cases we have correctly classified. but our recall value is only 33% hence we should reject this model. here when we consider precision we got 100% but we should not depend on it because in our case if our model predicts normal person as cancer prone it won’t leads to any problem. hence depending on cases and problem we should decide whether to consider precision or recall.

4. F1 SCORE

In some cases we need to consider both precision and recall. f1 score is harmonic mean of recall and precision.

F1 SCORE = 2*(precision * recall)/(precision + recall)

5.ROC

Before moving to ROC , let’s understand what is threshold? Threshold is set ,any probability value below the assumed threshold we say as negative threshold whereas if it falls above it is consider as positive threshold.

here threshold is assumed as 0.5 so if the resultant probability falls below 0.5 we would reject it.

If we change or alter threshold for our model we end up getting different confusion matrix. Here the black dotted points is confusion matrix value for different threshold values. we can analyze that after certain limit FPR value decreases and TPR increases we should choose that threshold value. This is optimal threshold for our data.

6.AUC

Okay, if we build model with different algorithm like(LOGISTIC REGRESSION, KNN, DECISION TREE …) then which model best fit’s our data? which model to choose?

Here red region occupies more area while blue region occupies less area, hence we should consider model which has high area under curve.

CONCLUSION

Let’s wrap up our discussion, based on use case we need to choose appropriate evaluation metrics for our data. hence through this blog you may have understand when to choose which metrics, why we should not completely depend on accuracy. which model to choose when we deal with different algorithm.

Classification : Evaluation Metrics

Written by Rekha V S