Take off your point prediction blinders: Conformal inference in Python with MAPIE
All exact science is dominated by the idea of approximation. When a man tells you that he knows the exact truth about anything, you are safe in inferring that he is an inexact man.
Bertrand Russell
Let’s say you want to sell your house (also you own a house, congratulations), and you decide to build a model to predict what kind of price you might get for it. You do all the usual steps - you collect historical data, try a wide range of features, and perform model selection without a holdout set. Armed with a model that can tell you the expected value of your house, you run model.predict(my_house_features)
and find out that your model predicts a value of…drum roll please…one million dollars! Nice, you’re going to be a millionaire! Well, maybe a sub-millionaire after all applicable taxes and legal fees, but still not bad.
Of course, you know enough about models to realize that your house won’t sell for exactly one million dollars and zero cents. It’ll be a little more or a little less. How much more or less, though? If the answer is “a few dollars” more or less, great. If the answer is “a million dollars” more or less, then your forecast isn’t actually saying much. Reviewing our basic machine learning, our model is really giving us the expected house price given all the house’s features. That is, the model is a model of $\mathbb{E}[y \mid X]$, where $y$ is the house price and $X$ are the house features.
Quantifying the range of possible outcomes around the prediction is really important. Despite that, lots of folks frequently take the point prediction made by a model as the gospel. I sometimes think of this as point prediction tunnel vision - if we only consider the point prediction, we miss all the other possible cases we should also factor into our decision.
So what is a house seller to do? This is the familiar problem of producing a prediction interval. A couple of options present themselves:
Both of those are a little unsatisfying. Plenty of models are not linear models, and after all we went through all this work to build a model of $\mathbb{E}[y \mid X]$. Can’t we use that somehow?
One tempting option is to try and build multiple models by bootstrapping, and looking at the distribution of predictions. However, that’s not quite what we want, because the variation in the bootstrap samples isn’t telling us what range of actual values this particular house might sell for. Rather, bootstrapping tells us about how uncertain we should be about the model’s prediction of the expected value - it tells us how much uncertainty we should have around $\mathbb{E}[y \mid X]$. Bootstrapping gives us a confidence interval of the conditional mean, not a prediction interval of actual values we might observe.
Lets got back to the way we construct prediction intervals for linear models. In a linear regression, we assume that the observations are distributed according to:
\[y \sim Normal(\alpha + \beta X, \sigma)\]We can make a point prediction by computing the expected value of $y$, ie:
\[\mathbb{E}[y \mid X] = \alpha + \beta X\]If we actually knew the parameter values $\alpha$ and $\beta$, then the observed value of $y$ should be within $2 \sigma$ of the point prediction 95% of the time. If we didn’t know $\sigma$, we could estimate it from the data by looking at the standard deviation of the residuals, and use that to generate prediction intervals (technically it’s only proportional to $\sigma$, but lets ignore that for the moment).
The key intuition from the linear model is that we can look at the shape of the residual distribution, ie at how distant predictions usually are from the real values, and use that to calibrate our prediction interval. There is a generalization of this idea to non-linear regression models and beyond, and it is called conformal inference.
As far as I can tell, the idea
If we can define a distance in the output space, we can do conformal prediction
PIs in arbitrary spaces based where conformity=distance
Distance framing of conformal inference
this actually generalizes the PI
user choices (after MAPIE paper s2.1)
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from matplotlib import pyplot as plt
from mapie.regression import MapieRegressor
from mapie.metrics import regression_mean_width_score, regression_coverage_score
from sklearn.datasets import fetch_california_housing
california_housing = fetch_california_housing(as_frame=True)
data = california_housing.frame
input_features = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
target_variable = 'MedHouseVal'
X = data[input_features]
y = data[target_variable]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.005)
regressor = HistGradientBoostingRegressor()
mapie_regressor = MapieRegressor(estimator=regressor, method='plus', cv=5)
mapie_regressor = mapie_regressor.fit(X_train, y_train)
y_pred, y_pi = mapie_regressor.predict(X_test, alpha=[0.05, 0.32])
# Shape of y_pis is n_rows x {low, high} x alphas
y_pi_05 = y_pi[:,:,0]
low_05, high_05 = y_pi_05[:,0], y_pi_05[:,1]
plt.scatter(y_test, y_pred)
plt.vlines(y_test, low_05, high_05)
plt.plot([np.min(y_test), np.max(y_test)], [np.min(y_test), np.max(y_test)], linestyle='dotted')
print('Coverage:', regression_coverage_score(y_test, low_05, high_05))
print('Average interval width:', regression_mean_width_score(low_05, high_05))
# Pick the model with acceptable coverage + lowest width
explain the input kwargs to MapieRegressor
CQR for heteroskedasticity
https://proceedings.neurips.cc/paper_files/paper/2019/file/5103c3584b063c431bd1268e9b5e76fb-Paper.pdf
https://mapie.readthedocs.io/en/stable/examples_regression/4-tutorials/plot_cqr_tutorial.html
hit rate/coverage plot - we want coverage better than (1-\alpha) and width as low as possible. heuristic: plot coverage and width, pick model with lowest width that has coverage better than target
https://stats.stackexchange.com/questions/465799/testing-for-clairvoyance-or-performance-of-a-model-where-the-predictions-are-i/465808#465808
which is introduced in section 6.2 of https://sites.stat.washington.edu/raftery/Research/PDF/Gneiting2007jasa.pdf
Classification: Prediction set. Method = ???. Doc link = ???.
Times series: Prediction band. Method = EnbPI. Doc link = ???.
Simulation and decision for conformal models
gentle introduction - https://arxiv.org/pdf/2107.07511.pdf
mapie - https://arxiv.org/abs/2207.12274