Casual Inference Data analysis and other apocrypha

A sprinkle of time series analysis to make your next metric review easier

Metric reviews

Certain experiences seem to show up in data science jobs regardless of company size or industry. If you’re a working data scientist, at some point a very nervous, very tired product manager will contact you and ask you a question along the lines of this one: okay I looked at the dashboard, our favorite metric just moved, does that mean something? should we do something?? WHAT DOES IT MEAN WHAT DO WE DO OH GOD OUR VP JUST PINGED ME ABOUT THE QUARTERLY MEETING WHAT DO I TELL

Okay, okay, relax. You and your friend are going to be alright, even if the VP’s chat window is showing an ominous little hey and some dots. You have a couple of goals right now, once you’ve calmed down and run a query to look at the metric:

  1. First, look at the recent trajectory of the metric. Has it been changing in the direction we want it to move? Is it really volatile?
  2. Given that perspective, is the most recent observation actually all that interesting? Is it far out of the ordinary compared to its historical values?

Let’s see how we might answer these questions on some real data.

An example: Trips per day KPI on a ride-sharing app

Imagine, for example, that you’re responsible for keeping an eye on how many trips drivers are performing on a ride-sharing app1. If your goal is to monetize driver time, you want this number going up, since it will mean you’re able to increase your ROI on each driver. (In real life you might worry about other parts of the trip funnel like total active drivers, but we’ll put that aside for now.)

Lets take a look at our data, which is a monthly time series measuring trips per day of drivers.

trips_per_day.png

Hm, that’s not a totally unambiguous picture. It went down, it went up, it’s up in October…is that what we expected?

We’re pretty sure that in our business, there is yearly cycle. So we’ll take YoY change

yoy_change.png

It looks like year-over-year growth has been positive, so that’s good. Has it been decelerating? from the plot alone, that’s not totally clear.

how about the most recent month’s observation - is that surprising? sometimes you should look closer/take action. is this month one of those times

Let’s get a deeper look by going further into this YoY data set.

What’s the recent growth of the metric? Is it positive, is it negative?

Lets start by getting the context. Before this month, what was growth like?

avg growth = 0?

Lets imagine a simple model that might have generated this data, in which growth is constant. In this “constant growth model”, our observed YoY values are generated by

$y_t \sim c + \epsilon$

we fit a model, and check coefficient on $c$. note the AR model for correct standard errors

regression output goes here

okay, not a big surprise. the average month has a YoY growth of …, and the CI is small/well above significance

Is growth accelerating? Decelerating?

Growth is positive, that’s good. Which direction has it been moving in? Has it been going up, down, staying in the same spot? We can look at the acceleration of the metric in each month; how much growth has been changing month-to-month.

acceleration = 0?

imagine a “constant acceleration model”. it looks like

fit $y_t \sim \alpha y_{t-1}$

where the acceleration is $\alpha - 1$. if that’s zero, growth isn’t really changing month-to-month

okay, lets fit this model now.

regression output goes here

hm, alright. on average, it looks like we saw some deceleration - the average month has X% less growth than the month before. but the amount is small, and it isn’t statistically significantly different from 1, ie no acceleration

Is the most recent observation anomalous?

So we know that recently growth has been positive, and hasn’t really been accelerating or decelerating. given that context, how do this month’s numbers look? we’ll answer

H0 is that we’ll see the same acceleration as the pervious X months

is $y_{k+1}$ unusual compared to the set of similar points (ie, 12m)

simulate forecasted PI dist, look at pctile

What should I do next?

These methods are a handy first pass at understanding whether something interesting is happening.

Are you on track to hit your goals for the month/quarter/half/year? If growth remains in line with the X month average? If acceleration remains in line with the X month average?

What group is causing the change? (link to decomposition post)

Plug in a forecsat model instead of the AR(1) or constant model

import pandas as pd
from tqdm import tqdm
from matplotlib import pyplot as plt
import seaborn as sns
from datetime import datetime
from statsmodels import api as sm
import numpy as np
from statsmodels.tsa.ar_model import AutoReg
from scipy.stats import norm

df = pd.read_csv('https://www1.nyc.gov/assets/tlc/downloads/csv/data_reports_monthly.csv')
df['trips_per_day'] = df['Trips Per Day'].str.replace(',', '').astype(float)
df['date'] = df['Month/Year'].apply(lambda s: datetime.strptime(s + '-01', '%Y-%m-%d'))

# df = df[(df['date'] >= '2015-12-01') & (df['date'] <= '2018-09-01')]

monthly_trip_series = df.groupby('date').sum()['trips_per_day']

THIS_MONTH = '2018-10-01'
LAST_MONTH = '2018-09-01'

monthly_trip_series = monthly_trip_series.loc[:THIS_MONTH]

# Calculate YoY growth
monthly_trip_yoy_growth = monthly_trip_series / monthly_trip_series.diff(12) - 1

monthly_trip_series = monthly_trip_series[-13:]
monthly_trip_yoy_growth = monthly_trip_yoy_growth[-13:]

# Plot the monthly value of the metric
sns.lineplot(monthly_trip_series)
plt.show()

# Calculate and plot the YoY growth in the metric
sns.lineplot(x=monthly_trip_series.index, y=monthly_trip_yoy_growth, marker='o')
plt.xticks(rotation=90)
plt.show()

# Fit an AR(0) model, ie the "constant growth plus noise" model
ar0_model_last_13 = AutoReg(endog=monthly_trip_yoy_growth[:THIS_MONTH], lags=0, trend='c')
ar0_model_last_13_fit = ar0_model_last_13.fit()
print(ar0_model_last_13_fit.summary())

# Fit an AR(1) model, ie the "constant acceleration plus noise" model
ar1_model_last_13 = AutoReg(endog=monthly_trip_yoy_growth[:THIS_MONTH], lags=1, trend='n')
ar1_model_last_13_fit = ar1_model_last_13.fit()
print(ar1_model_last_13_fit.summary())

# Fit the model on all data through last month, 
# then compare this month with its predicted value
ar1_model_last_month = AutoReg(endog=monthly_trip_yoy_growth[:LAST_MONTH], 
                               lags=1, trend='n')
ar1_model_last_month_fit = ar1_model_last_month.fit()
residual_variance = np.var(ar1_model_last_month_fit.resid)
simulation_gen = norm(ar1_model_last_month_fit.forecast(1), np.sqrt(residual_variance))

sns.distplot(simulation_gen.rvs(10000))
plt.axvline(monthly_trip_yoy_growth[THIS_MONTH])
quantile_of_prediction_interval = simulation_gen.cdf(monthly_trip_yoy_growth[THIS_MONTH])
plt.title('q={}'.format(quantile_of_prediction_interval))
plt.show()

low, high = simulation_gen.interval(.95)
sns.lineplot(monthly_trip_yoy_growth)
plt.scatter([THIS_MONTH, THIS_MONTH], [low, high])
plt.show()

1. I don’t actually have a ride-share app whose data I can show you, but we’ll use some monthly numbers of NYC Taxi Usage, which should be sufficiently realistic. I pulled the time series we’re going to examine using

code