Evaluation Metrics

Intersection over Union (IoU) and Generalized IoU are already discussed. Other few importants are discussed here.

Top-1 and Top-5 Accuracy

In multi-class object detection problems, we get probability of each class for the predicted bounding-boxes.

Top-1 Accuracy checks if the top class (one with the highest probability) is same as the target label. Top-5 Accuracy checks if the target label is within the top-5 predicted label/class (ones with highest probabilities)

Precision and Recall

In object detection models, we are finding ground-truth bounding-boxes, where each resulting bounding-boxes must be first classified as :

True Positive (TP) : A correct detection of ground-truth bounding-box
True Negative (TN) : A correct detection of non-existing spaces
False Positive (FP) : An incorrect detection of non-existing spaces. non-existing spaces labeled as bounding-box.
False Negative (FN) : An undetected ground-truth bounding-box. bounding-box labeled as non-existing object

Precision is the ability of a model to identify only relevant objects, whereas Recall is the ability to find all relevant cases (all ground-truth bounding-boxes). If we have a dataset with $G$ ground-truths and a model that outputs $N$ detections, of which $S$ are correct $(S \le G)$, then Precision(P) and Recall(R) can be written as:

\[\text{Precision (P)} = \frac{\sum_{n=1}^S TP_{n}}{\sum_{n=1}^S TP_{n} + \sum_{n=1}^{N-S} FP_{n}}\] \[\text{Recall(R)} = \frac{\sum_{n=1}^S TP_{n}}{\sum_{n=1}^S TP_{n} + \sum_{n=1}^{G-S} FN_{n}}\]

ROC and AUC

An ROC curve or Reciever Operating Characteristics Curve is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters : TPR and FPR.

True Positive Rate (TPR) $\equiv \text{Recall} = \frac{TP}{TP + FN}$,whereas False Positive Rate (FPR) $= \frac{FP}{FP + TN}$.

With ROC, lowering the classification threshold classifies more items as positive, and thereby increasing both FP and TP. A typical ROC is shown below :

ROC

AUC stands for Area under the ROC Curve. It measures the entire 2D area underneath ROC Courve from $(0,0)$ to $(1,1)$. AUC ranges in value from $0$ to $1$. If a model is $100\%$ correct, it has AUC of $1.0$, and if $100\%$ incorrect, then AUC of $0.0$

Two key things here :

AUC is scale-invariant. It measures how well predictions are ranked, and not their absolute values.
AUC is classification-threshold-invariant. It measures the quality of the models’ prediction irrespective of what classfication threshold is chosen.

Average Precision

We know that, a detection is considered positive only when it has IoU higher than certain threshold, $\tau$. Otherwise negative. Now, eq(1) and eq(2) can be written as :

$P(\tau) = \frac{TP(\tau)}{TP(\tau) + FP(\tau)}$ and $R(\tau) = \frac{TP(\tau)}{TP(\tau) + FN(\tau)}$

In the above presented equations, both $TP(\tau)$ and $FP(\tau)$ are decreasing function of $\tau$, whereas $FN(\tau)$ is increasing function.

Therefore, we can say that :

$R(\tau)$ is a decreasing function of $\tau$, wheras nothing can be said about $P(\tau)$ Consequently, AUC under $P(\tau) × R(\tau)$ exhibit zig-zag pattern.

Average Precision is area under $\text{Precision} × \text{Recall}$, which has pre-processed to remove zig-zag pattern and monotonic. It is similar to AUC and ranges between $0$ and $1$.

To compute AP, first start collecting pairs of Precision and Recall, with different confidence values $\tau(k)$ where $k = 1,2, \cdots k$. These Precision can be written as $P(\tau(k))$ and Recall as $R(\tau(k))$.

Next, interpolation is applied to turn precision $\times$ recall into monotonic. Interpolation function, $P_{ip}(R)$, where $R$ is a real value contained in the interval $[0,1]$ is defined as :

\[P_{ip}(R) = max_{k \lvert R(\tau(k)) \ge R} \left[{P(\tau(k))}\right]\]

Now, Average Precision (AP) can be calculated by sampling $P_{ip}(R)$ at the $N$ reference recall values $R_{r}$ as :

\[AP = \sum_{k=0}^k (R_{r}(k) - R_{r}(k+1))P_{ip}(R_{r}(k))\]