Casual Inference Data analysis and other apocrypha

most of the time, we assume that our data is an IID sample from the population. this is a big deal for regression tools, and lots of causal inference is regression-y

Why should we believe your sampling process is IID? Often it isn’t. Happens all the time that sampling wasn’t done on individual units, but on clusters. for example, if you want to poll students in a school district, it might be easier to first pick schools, and then poll all the students within the selected school. cluster a/b tests for example. or maybe you do random sampling of units from the database, but those units are from clusters that are correlated intra-cluster.

Some examples I’ve seen:

difference in difference! https://www.theeffectbook.net/ch-DifferenceinDifference.html#how-is-it-performed-2

https://theeffectbook.net/ch-StatisticalAdjustment.html#your-standard-errors-are-probably-wrong

“The use of clustered standard errors can account for any sort of correlation between errors within each grouping.” The Effect

https://en.wikipedia.org/wiki/Intraclass_correlation

adding a term removes bias; clustering gets you correct variance

when in doubt, think about the sampling process

import numpy as np
import pandas as pd

cluster_sample_sizes = [100, 1, 1, 1, 1, 1, 10, 
                        10, 10, 10, 10, 10, 100, 
                        1000, 1, 1, 2, 5, 8, 10]
cluster_means = np.random.normal(0, 100, len(cluster_sample_sizes))
cluster_samples = [np.random.normal(m, 1, size=s) for m, s in zip(cluster_means, cluster_sample_sizes)]
pooled_samples = np.concatenate(cluster_samples)
cluster_labels = np.concatenate([[i]*s for i, s in enumerate(cluster_sample_sizes)])

df = pd.DataFrame({'cluster': cluster_labels, 'y': pooled_samples})

from statsmodels.api import formula as smf

model = smf.ols('y ~ 1', df).fit()

clustered_results = model.get_robustcov_results(cov_type='cluster', 
                                                groups=df['cluster'])

print(model.summary())

print(clustered_results.summary())

from sklearn.utils import resample

def cluster_bootstrap(data, cluster_col, statistic, n_iterations=1000):
    np.random.seed(42)
    clusters = data[cluster_col].unique()
    stats = []

    for _ in range(n_iterations):
        # Sample clusters with replacement
        sampled_clusters = resample(clusters, replace=True)
        # Create a bootstrap sample with the sampled clusters
        sampled_data = pd.concat([data[data[cluster_col] == cluster] for cluster in sampled_clusters])
        
        # Apply the model function to the bootstrap sample
        stat = statistic(sampled_data)
        stats.append(stat)

    return np.std(stats)

print(cluster_bootstrap(df, 'cluster', np.mean))