Machine Learning Performance Metrics Cheat Sheet: Complete Guide

Performance Metrics in Machine Learning Cheat Sheet

Metric	Formula	Description	Advantages	Disadvantages	Interpretation	Minimize or Maximize
Mean Absolute Error (MAE)	\[ \frac{1}{n} \sum_{i=1}^{n} \|y_i - \hat{y}_i\| \]	Average of absolute differences between actual and predicted.	Less sensitive to outliers	It does not consider the direction of the error and does not emphasize larger errors compared to MSE and RMSE.	Suppose our model has an MAE of 20,000 USD. This means that on average, our predictions on the price of houses are off by 20,000 USD. So, if the model predicts a house to be 500,000 USD, we can expect the actual price to be anywhere between 480,000 USD and 520,000 USD.	Minimize
Mean Squared Error (MSE)	\[ \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]	Average of squared differences between actual and predicted.	Emphasizes larger errors due to squaring.	Can be sensitive to outliers because it squares the prediction errors.	If the MSE of our model is 1,000,000,000 (USD²), this tells us that the average squared difference between the predicted and actual house prices is 1,000,000,000. However, interpreting MSE in its raw form can be quite difficult due to the square units (USD²), so it's often more helpful to interpret the square root of the MSE (RMSE) instead.	Minimize
Root Mean Squared Error (RMSE)	\[ \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} \]	Square root of MSE.	Easier to interpret than MSE because RMSE is in the same unit as the target variable.	Like MSE, RMSE increases the weight of the bigger errors due to squaring.	Now, if the RMSE of our model is 31,623 USD (which is the square root of 1,000,000,000), this indicates that the standard deviation of our prediction errors is roughly 31,623 USD. Essentially, this tells us that our predictions are scattered on average by 31,623 USD from the actual house price.	Minimize
Mean Absolute Percentage Error (MAPE)	\[ \frac{100\%}{n} \sum_{i=1}^{n} \left\|\frac{y_i - \hat{y}_i}{y_i}\right\| \]	Average absolute percent difference between observed and predicted values	Useful when dealing with variables of varying scales	Can lead to divide by zero errors, not suited for values close to zero	If a model predicting house prices has a MAPE of 15%, the model's predictions are off by 15% of the actual price on average	Minimize
R-Squared (R²)	\[ 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} \]	Proportion of the variance in the dependent variable predictable from the independent variable(s)	It can be interpreted as a percentage	Does not inform about the absolute fit of the model, but relative fit to a simple mean model	If a model predicting house prices has an R² of 0.85, the model explains 85% of the variability in house prices from the features	Maximize
Adjusted R-Squared	\[ 1 - (1 - R^2)\frac{n-1}{n-p-1} \]	Like R² but adjusted for the number of predictors in the model	Takes into account the number of predictors	More complex than simple R²	If a model predicting house prices with 3 features has an Adjusted R² of 0.82, after adjusting for the number of predictors, the model explains 82% of the variability in house prices	Maximize

Metric	Formula	Description	Advantages	Disadvantages	Interpretation
Accuracy	\[ \frac{TP + TN}{TP + FP + FN + TN} \]	The proportion of true results among the total number of cases examined.	Easy to interpret.	Can be misleading in imbalanced datasets.	If a model has 90% accuracy, this means that 90 out of 100 predictions are correct.
Precision	\[ \frac{TP}{TP + FP} \]	The proportion of positive identifications that were actually correct.	Useful when the cost of false positives is high.	Not useful when the class distribution is imbalanced.	If the model's precision is 0.75, this means that 75% of the people the model identified as positive cases are actual positive cases.
Recall (Sensitivity)	\[ \frac{TP}{TP + FN} \]	The proportion of actual positives that were correctly identified.	Useful when the cost of false negatives is high.	Not particularly useful when the class distribution is heavily skewed towards negatives.	If the model's recall is 0.8, this means that it was able to find 80% of all positive cases.
Specificity	\[ \frac{TN}{TN + FP} \]	The proportion of actual negatives that were correctly identified.	Useful when the cost of false positives is high.	Not particularly useful when the class distribution is heavily skewed towards positives.	If the model's specificity is 0.7, this means that it correctly identified 70% of all negative cases.
F1 Score	\[ 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]	The harmonic mean of precision and recall.	Useful for balancing precision and recall and for dealing with imbalanced datasets.	Not interpretable as a statistical measure of the samples.	An F1 score of 0.7 indicates that the model is fairly good at identifying positive cases without labeling too many false positives or missing too many actual positives.
AUC-ROC	\[ \text{AUC} = \frac{\sum R_{\text{positive}} - \frac{n_{\text{positive}} (n_{\text{positive}} + 1)}{2}}{n_{\text{positive}} \cdot n_{\text{negative}}} \]	The area under the curve when plotting true positive rate (sensitivity) against the false positive rate (1-specificity).	Handles both balanced and imbalanced datasets.	Less interpretable as it summarizes the model's performance across all classification thresholds.	An AUC of 0.9 means that there is 90% chance that the model will be able to distinguish between positive class and negative class.
Log Loss	\[ -\frac{1}{N} \sum_{i=1}^{N} \left(y_i \log(p_i) + (1-y_i) \log(1-p_i)\right) \]	Measures the performance of a classification model where the prediction is a probability value between 0 and 1.	Takes into account uncertainty in the predictions, useful for probabilistic models.	Sensitive to incorrect confidence levels; penalizes wrong confident predictions heavily.	A lower Log Loss indicates a better model; 0 means perfect predictions.