We use predictive models as our advisors, helping us make better decisions using their output. A reasonable question, then, is “is my model accurate enough to be useful”? An already-present part of the process for most ML practitioners is Cross Validation, that beloved Swiss Army Knife of model validation. Anyone doing their due diligence when training a predictive model will try a few out, and select the one with the minimum Mean Squared Error, or perhaps use the one standard error rule. Either way, you’ll look at some out-of-sample errors for different models.

This doesn’t usually answer our question, though. Model selection tells us which choice is the best among the available options, but it’s unclear whether even the best one is actually good enough to be useful. I myself have had the frustrating experience of performing an in-depth model selection process, only to realize at the end that all my careful optimizing has given me a model which is better than the baseline, but still so bad at predicting that it is unusable for any practical purpose.

So, back to our question. What does “accurate enough to be useful” mean, exactly? How do we know if we’re there?

We could try imposing a rule of thumb like “your MSE must be this small”, but this seems to require context. After all, different tasks require different levels of precision in the real world - this is why dentists do not (except in extreme situations) use jackhammers, preferring tools with a diameter measured in millimeters.

Statistical measures of model or coefficient significance don’t seem to help either; knowing that a given coefficient (or all of them) are statistically significantly different from zero is handy, but does not tell us that the model is ready for prime time. Even the legendary $R^2$ doesn’t really have a clear a priori “threshold of good enough” (though surprisingly, I see to frequently run into people who are willing to do so, often claiming 80% or 90% as if their model is trying to make the Honor Roll this semester). If you’re used to using $R^2$, a perspective I found really helpful is Ch. 10 of Cosma Shalizi’s The Truth About Linear Regression.

An actual viable method is to look at whether your prediction intervals are both practically precise enough for the task and also cover the data, an approach detailed here. This is a perfectly sensible choice if your model provides you with an easy way to compute prediction intervals. However, if you’re using something like scikit-learn you’ll usually be creating just a single point estimate (ie, a single fitted model of $\mathbb{E}[y \mid X]$ which you can deploy), and it may not be easy to generate prediction intervals for your model.

The method that I’ve found most effective is to work with my stakeholders and try to determine **what size of relative (percent) error would be good enough for decision making**, and then see how often the model predictions meet that requirement. Usually, I ask a series of questions like:

- Imagine the model was totally accurate and precise, ie it hit the real value 100% of the time. What would that let us do? What value would that success bring us in terms of outcomes? Presumably, there is a clear answer here, and this would let us increase output, sell more products, or something else we want.
- Now imagine that the model’s accuracy was off by a little bit, say 5%. Would you still be able to achieve the desired outcome?
- If so, what if it was 10%? 20%? How large could the error be and still allow you to achieve your desired outcome?
- Take this threshold, and consider every prediction within it to be a “hit”, and everything else is a “miss”. In that case, we can evaluate the model’s practical usefulness by seeing how often it produces a hit.

This allows us to take our error measure, which is a continuous number, and discretize it. We could add more categories by defining what it means to have a “direct hit”, a “near miss”, a “bad miss” etc. You could then attach a predicted outcome to each of those discrete categories, and you’ve learned something not just about how the model makes **predictions**, but how it lets you make **decisions**. In this sense, it’s the regression-oriented sequel to our previous discussion about analyzing the confusion matrix for classifiers - we go from pure regression analysis to decision analysis using a diagnostic. The “direct hits” for a regression model are like landing in the main diagonal of the confusion matrix.

In a sense, this is a check of the model’s “calibration quality”. While I usually hear that term referring to probability calibration, I think it’s relevant here too. In the regression setting, a model is “well calibrated” when its prediction are at or near the actual value. We’ll plot the regression equivalent of the calibration curve, and highlight the region that counts as a good enough fit.

Let’s do a quick example using this dataset of California House Prices along with their attributes. Imagine that you’re planning on using this to figure out what the potential price of your house might be when you sell it; you want to know how much you might get for it so you can figure out how to budget for your other purchases. We’ll use a Gradient Boosting model, but that’s not especially important - whatever black-box method you’re using should work.

First, lets get all our favorite toys out of the closet, grabbing our data and desired model:

```
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.datasets import fetch_california_housing
california_housing = fetch_california_housing(as_frame=True)
data = california_housing.frame
input_features = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
target_variable = 'MedHouseVal'
X = data[input_features]
y = data[target_variable]
model = HistGradientBoostingRegressor()
```

In this context, the acceptable amount of error is probably dictated by how much money you have in the bank as a backup in case you get less for the house than you expected. For your purposes, you decide that a difference of 35% compared to the actual value would be too much additional cost for you to bear.

We’ll first come up with out-of-sample predictions using the cross validation function, and then we’ll plot the actual vs predicted values along with the “good enough” region we want to hit.

```
predictions = cross_val_predict(model, X, y, cv=5) # cv=5 for 5-fold cross-validation
from matplotlib import pyplot as plt
import seaborn as sns
x_y_line = np.array([min(predictions), max(predictions)])
p = 0.35 # Size of threshold, 35%
sns.histplot(x=predictions, y=y) # Plot the predicted vs actual values
plt.plot(x_y_line, x_y_line, label='Perfect accuracy', color='orange') # Plot the "perfect calibration" line
plt.fill_between(x_y_line, x_y_line*(1+p), x_y_line*(1-p), label='Acceptable error region', color='orange', alpha=.1) # Plot the "good enough" region
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.legend()
plt.show()
```

For reasons I can’t really explain, I find it very amusing that this diagram looks like the flag of Seychelles, and would look even more so if we added finer gradations of hit vs missed targets.

In addition to a chart like this, it’s also handy to define a numeric score - we could even use this for model selection, if we wanted to. One that seems like it would be an easy step to me is the percentage of time our model makes predictions that land in the bound of acceptable error. Hopefully, that number is high, indicating that we can usually expect this model to produce outputs of good enough quality to use for decision-making.

If we define $p$ as the acceptable percent change, we can compute the estimated *percent of predictions within acceptable error as*:

To think about this from an engineering perspective, our use case defines the “tolerance”, similar to the tolerance which is set in machining parts. This quantity tells us how often the product which our model produces (ie its output) is within the tolerance for error that we can handle.

```
# Within target region calculation
within_triangle = sum((y*(1-p) < predictions) & (predictions < y*(1+p)))
print(round(100 * (within_triangle / len(y))), 2)
```

That gives us 66% for this model on this data set - a strong start, though there’s probably room for improvement. It seems unlikely that we’d be willing to deploy this model as-is, and we’d want to improve performance by adding more features, more data, or improving the model design. However, even though this model is not usable currently, it’s useful to now have a way of measuring how well the model fits the task at hand.

This is just one example of doing decision-oriented model validation, but the method could be expanded or taken in different directions. If we wanted to get a finer idea of how our decisions might play out, we could break the plot into more segments, like introducing regions for “near misses” or “catastrophic misses”. You could also probably analyze the relationship between predicted and actual with quantile regression, learning what the “usual” lower bound on actual value given the predicted value is.

Written on September 24th, 2023 by Louis CialdellaFeel free to share!