The predictive models can be categorised in two main classes: 1) regression models where we have a continues output, and classification models where the output is nominal or binary. Depends on what type of predictive model we are using the evaluation metrics can be different. In this Blog post, I will try to investigate the most important evaluation metrics for both classification and regression models.
Most of the classification metrics have been essentially designed for binary classification tasks. In order to extend a binary metric to multiclass problems, the data is usually treated as a collection of binary problems, one for each class as positive class and the rest as negative one. There are then a number of methods to average binary metric calculations across the set of classes (see here).
In a binary classification task, the terms ‘’positive’’ and ‘’negative’’ refer to the classifier’s prediction and the terms ‘’true’’ and ‘’false’’ refer to whether that prediction corresponds to the external judgment. Given this definition, the confusion matrix/table is formulated as below:
Now there are number of definition that we need to analyse a confusion matrix:
Accuracy : the proportion of the total number of predictions that were correct.
Precision : the proportion of positive cases that were correctly identified.
Recall or Sensitivity: the proportion of actual positive cases which are correctly identified
Specificity: the proportion of actual negative cases which are correctly identified.
Accuracy = TP + TN / TP + TN + FP +FN OR TP + TN / #Samples
Precision = TP / TP + FP
Recall or Sensitivity = TP/ TP + FN
Specificity = TN/(TN + FP)
The following example can make the above definitions more clear. Consider a data set including 100 samples from the patients have been tested for a specific cancer, 20% is belonged to the class of patients who actually have cancer and 80% are healthy. Assume the results for a given classifier is as follow:
Accuracy: 1 + 75 / 1 + 75 + 5 + 19 = 0.8
Precision: 1 / 1 + 5 = 0.16
Recall: 1 / 20 = 0.05
Specificity: 75 / 75 + 5 = 0.93
The accuracy of the classifier is %80 which seems good, but unfortunately in our case the accuracy is not a metric that we can rely on, as our data is imbalanced (where the number of observations in each class are not equal) and the classes don't have the same importance (e.g, classifying a person who has cancer as healthy is more risky than classifying a healthy person inside the cancer class), therefore as you can see above, the other metrics can help us to analyse the classifier better.
Area Under ROC Curve (AUC - ROC)
Before starting to talk about ROC (Receiver Operating Characteristic), the following discussion about the type of classification algorithms is necessary.
In classification, the output of the algorithm can be either a class or a probability. For example in Support Vector Machines and KNN the output will be 0 or 1 (a class label), while in some other algorithms such as Logistic Regression and Random Forest the output will be a probability determining how likely an input is belonged to a given class. Converting a probability output to a class label is just a matter of creating a threshold probability (e.g. a threshold equal to 0.5).
The ROC curve is created by plotting the true positive rate (Sensitivity) against the false positive rate (1-Specificity) at various threshold probability. Please see here to understand ROC curve through a graphical demo.
As above figure shows, a classifier has a better performance if its ROC curve goes more straight up the Y axis and then along the X axis. A classifier has no power if it sits on the diagonal. It is obvious that a ROC curve is very useful to select a proper threshold probability for a classifier whereby the true positive would be maximised while minimising the false positive.
ROC would be very useful to evaluate the performance of a given classifier and also in comparing two or more classifiers. To this end, the Area Under the Curve (AUC) is used as a metric. As you can see in above figure. The AUC for the Random Classifier is 0.5 and the ideal classifier has AUC equal to 1.0 and for the most of the classifiers the AUC is between these two value. It is worth mentioning that an AUC less than 0.5 might indicates something wrong is happening in your classifier, e.g. the positives and negatives might have been mislabelled.
Mean Absolute Error (MAE) and Mean Square Error (MSE)
MAE is the average of the absolute differences between predictions and actual values. The measure gives an idea of the magnitude of the error without taking into account the direction. MSE is the average of squared difference between predictions and actual values. It tends to exaggerate the effect of outliers while the MAE does not have this effect.
R Squared (R^2)
R^2 indicates how close the actual data are to the fitted regression line. This value is between 0 and 1 for no-fit and perfect fit respectively.