-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Hi there. I'm a little bit confused by how the cohort variable is defined and later used within the ATTgt class. Let me provide an example here to determine whether this is an actual bug or I'm just misunderstanding something. Let's start by using the simulate_data function to simulate some (panel) data.
panel_data = simulate_data(n_cohorts=5, intensity_by=1, tau=0.0, low=2.0, high=2.0)
panel_data.head()| y | x0 | w | cat.0 | cat.1 | effect | cohort | strata | intensity | |
|---|---|---|---|---|---|---|---|---|---|
| ('e0', 1900) | 0.069472 | 0.160245 | 3.72983 | 0 | 0 | 0 | 1903 | 0 | 2 |
| ('e0', 1901) | 8.95894 | -1.9334 | 1.6912 | 0 | 0 | 0 | 1903 | 0 | 2 |
| ('e0', 1902) | 8.24309 | -0.0568131 | 1.78633 | 0 | 0 | 0 | 1903 | 0 | 2 |
| ('e0', 1903) | 5.05815 | 0.810118 | 1.74161 | 0 | 0 | 0 | 1903 | 0 | 2 |
| ('e0', 1904) | 15.2352 | 0.101205 | 0.806004 | 0 | 0 | 20 | 1903 | 0 | 2 |
The simulated data set the cohort variable as the last time step before the intervention (e.g. 1903 for unit e0 when treatment effect kicks in in 1904). This cohort definition is non-standard as it is usually defined as the start of treatment. I then proceed to fit the model and plot the estimated ATTs.
att = ATTgt(data=panel_data, cohort_name="cohort")
att.fit(formula="y", est_method="reg")
att.plot(color_by="cohort", shape_by="post")
Resulting plot looks correct, with marker shapes correctly encoding pre- and post-intervention periods (e.g. for cohort = 1903, time step 1903 is labeled as pre, while 1904 is labeled as post).
att.aggregate("event")
att.plot("event")However, there seems to be inconsistency with the above cohort definition when aggregating the ATTs. For instance, when I run the aggregate method for the "cohort" level, I get the following results,
att.aggregate("cohort")| cohort | ATT | std_error | lower | upper | zero_not_in_cband |
|---|---|---|---|---|---|
| 1902 | 16.8192 | 1.18693 | 14.4929 | 19.1456 | * |
| 1903 | 15.6355 | 1.18963 | 13.3039 | 17.9671 | * |
| 1904 | 15.5679 | 1.14973 | 13.3145 | 17.8213 | * |
| 1905 | 13.5871 | 1.17521 | 11.2837 | 15.8904 | * |
The average ATTs presented here follow the standard cohort definition (i.e. start of the intervention), as they include the time == cohort data points. This clearly biases the estimated ATTs, as it includes a pre-intervention time period. This is easy to see by manually aggregating the ATTs. Clearly the second result, which does not include the time == cohort estimates, is the right one.
att.results().query("time >= cohort").iloc[:, 0].groupby(level=0).mean()cohort
1902 16.819243
1903 15.635493
1904 15.567925
1905 13.587051
Name: (ATTgtResult, , ATT), dtype: float64
att.results().query("time > cohort").iloc[:, 0].groupby(level=0).mean()cohort
1902 20.142418
1903 19.561096
1904 20.113236
1905 19.812459
Name: (ATTgtResult, , ATT), dtype: float64
Hence, it seems like both the simulate_data function and the (disaggregated) plot method generate/expect a non-standard cohort definition which is inconsistent with other functions/methods in the package. Thoughts? Sorry if I'm missing something.