Casual Inference Data analysis and other apocrypha

n non-technical papers that made me a better Data Scientist

Statistical Modeling: The Two Cultures - Leo Breiman

Statistical Inference: The Big Picture - Robert E. Kass

An interesting response to this is Bayesian Statistical Pragmatism - Andrew Gelman

Statistical Models and Shoe Leather - David A. Freedman

Is regression actually a causal inference success story?

According to Freedman, the results of the adoption of regression have been unimpressive. Practitioners without great statistical depth (like John Snow) can achieve good results by careful causal reasoning, combining many lines of evidence, and by putting in the effort to understand how data was collected. Those who attempt causal inference from regression should take note of the rigor and success of such examples despite their lack of technical sophistication.

What is the difference between Kanarek et al.’s study and Snow’s? Kanarek et al. ignored the ecological fallacy. Snow dealt with it. Kanarek et al. tried to control for covariates by modeling, using socioeconomic status as a proxy for smoking. Snow found a natural experiment and collected the data he needed. Kanarek et al.’s argument for causation rides on the statistical significance of a coefficient. Snow’s argument used logic and shoe leather. Regression models make it all too easy to substitute technique for work.

A deep knowledge of the data generating process is required in order to make causal claims - Snow put in the Shoe Leather to get that knowledge by collecting the data himself, but often causality-from-regression practitioners expect statistical wizardry to make this work unnecessary

To Explain or Predict? - Galit Shmueli

What You Can and Can’t Properly Do with Regression - Richard Berk

Philosophy and the practice of Bayesian Statistics - Andrew Gelman, Cosma Rohilla Shalizi

Confessions of a Pragmatic Statistician - Chris Chatfield

Some themes

Be mindful of what your P-values are actually telling you; we often assume a significant result is a confirmation of our hypothesis but it is usually much more limited than that

You need to understand how the data was generated if you want to make causal claims