Take off your point prediction blinders: Conformal inference in Python with MAPIE

All exact science is dominated by the idea of approximation. When a man tells you that he knows the exact truth about anything, you are safe in inferring that he is an inexact man.

Bertrand Russell

You may be suffering from point prediction tunnel vision

Let’s say you want to sell your house (also you own a house, congratulations), and you decide to build a model to predict what kind of price you might get for it. You do all the usual steps - you collect historical data, try a wide range of features, and perform model selection without a holdout set. Armed with a model that can tell you the expected value of your house, you run model.predict(my_house_features) and find out that your model predicts a value of…drum roll please…one million dollars! Nice, you’re going to be a millionaire! Well, maybe a sub-millionaire after all applicable taxes and legal fees, but still not bad.

Of course, you know enough about models to realize that your house won’t sell for exactly one million dollars and zero cents. It’ll be a little more or a little less. How much more or less, though? If the answer is “a few dollars” more or less, great. If the answer is “a million dollars” more or less, then your forecast isn’t actually saying much. Reviewing our basic machine learning, our model is really giving us the expected house price given all the house’s features. That is, the model is a model of $\mathbb{E}[y \mid X]$, where $y$ is the house price and $X$ are the house features.

Quantifying the range of possible outcomes around the prediction is really important. Despite that, lots of folks frequently take the point prediction made by a model as the gospel. I sometimes think of this as point prediction tunnel vision - if we only consider the point prediction, we miss all the other possible cases we should also factor into our decision.

So what is a house seller to do? This is the familiar problem of producing a prediction interval. A couple of options present themselves:

If your model is a linear regression, and your data follows the linear regression assumptions (homoskedasticity, gaussian noise, etc etc), then you can get a prediction interval by computing the size of the noise term $\hat{\sigma}$, as we’ve seen before.
Alternatively, you could just dump your existing model and instead build a quantile regression model.

Both of those are a little unsatisfying. Plenty of models are not linear models, and after all we went through all this work to build a model of $\mathbb{E}[y \mid X]$. Can’t we use that somehow?

One tempting option is to try and build multiple models by bootstrapping, and looking at the distribution of predictions. However, that’s not quite what we want, because the variation in the bootstrap samples isn’t telling us what range of actual values this particular house might sell for. Rather, bootstrapping tells us about how uncertain we should be about the model’s prediction of the expected value - it tells us how much uncertainty we should have around $\mathbb{E}[y \mid X]$. Bootstrapping gives us a confidence interval of the conditional mean, not a prediction interval of actual values we might observe.

Lets got back to the way we construct prediction intervals for linear models. In a linear regression, we assume that the observations are distributed according to:

\[y \sim Normal(\alpha + \beta X, \sigma)\]

We can make a point prediction by computing the expected value of $y$, ie:

\[\mathbb{E}[y \mid X] = \alpha + \beta X\]

If we actually knew the parameter values $\alpha$ and $\beta$, then the observed value of $y$ should be within $2 \sigma$ of the point prediction 95% of the time. If we didn’t know $\sigma$, we could estimate it from the data by looking at the standard deviation of the residuals, and use that to generate prediction intervals (technically it’s only proportional to $\sigma$, but lets ignore that for the moment).

The key intuition from the linear model is that we can look at the shape of the residual distribution, ie at how distant predictions usually are from the real values, and use that to calibrate our prediction interval. There is a generalization of this idea to non-linear regression models and beyond, and it is called conformal inference.

The key idea in conformal inference: the prediction interval contains all the points within “error distance” of the point prediction

As far as I can tell, the idea

If we can define a distance in the output space, we can do conformal prediction

PIs in arbitrary spaces based where conformity=distance

Distance framing of conformal inference

Pick a distance measure in the Y-space
Generate OOS predictions
Look at how close predictions “usually” are to the actual values
Make a prediction; points which are the “usual” distance from the prediction are included in the PI

this actually generalizes the PI

Black box PIs with conformal inference

user choices (after MAPIE paper s2.1)

pick conformity score
How to generate the out-of-sample predictions (jackknife, CV, etc)
Risk level \alpha

MAPIE example for regression: Training and prediction

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from matplotlib import pyplot as plt
from mapie.regression import MapieRegressor
from mapie.metrics import regression_mean_width_score, regression_coverage_score
from sklearn.datasets import fetch_california_housing

california_housing = fetch_california_housing(as_frame=True)
data = california_housing.frame

input_features = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
target_variable = 'MedHouseVal'

X = data[input_features]
y = data[target_variable]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.005)

regressor = HistGradientBoostingRegressor()

mapie_regressor = MapieRegressor(estimator=regressor, method='plus', cv=5)

mapie_regressor = mapie_regressor.fit(X_train, y_train)
y_pred, y_pi = mapie_regressor.predict(X_test, alpha=[0.05, 0.32]) 

# Shape of y_pis is n_rows x {low, high} x alphas
y_pi_05 = y_pi[:,:,0]

low_05, high_05 = y_pi_05[:,0], y_pi_05[:,1]

plt.scatter(y_test, y_pred)
plt.vlines(y_test, low_05, high_05)

plt.plot([np.min(y_test), np.max(y_test)], [np.min(y_test), np.max(y_test)], linestyle='dotted')

print('Coverage:', regression_coverage_score(y_test, low_05, high_05))
print('Average interval width:', regression_mean_width_score(low_05, high_05))
# Pick the model with acceptable coverage + lowest width

explain the input kwargs to MapieRegressor

CQR for heteroskedasticity

https://proceedings.neurips.cc/paper_files/paper/2019/file/5103c3584b063c431bd1268e9b5e76fb-Paper.pdf

https://mapie.readthedocs.io/en/stable/examples_regression/4-tutorials/plot_cqr_tutorial.html

Evaluating PI models

hit rate/coverage plot - we want coverage better than (1-\alpha) and width as low as possible. heuristic: plot coverage and width, pick model with lowest width that has coverage better than target

https://stats.stackexchange.com/questions/465799/testing-for-clairvoyance-or-performance-of-a-model-where-the-predictions-are-i/465808#465808

which is introduced in section 6.2 of https://sites.stat.washington.edu/raftery/Research/PDF/Gneiting2007jasa.pdf

What does a non-regression version look like?

Classification: Prediction set. Method = ???. Doc link = ???.

Times series: Prediction band. Method = EnbPI. Doc link = ???.

Outro

Simulation and decision for conformal models

Appendix - relevant papers

gentle introduction - https://arxiv.org/pdf/2107.07511.pdf

mapie - https://arxiv.org/abs/2207.12274