Skip to content

Inconsistent cohort definition? #3

@aayala15

Description

@aayala15

Hi there. I'm a little bit confused by how the cohort variable is defined and later used within the ATTgt class. Let me provide an example here to determine whether this is an actual bug or I'm just misunderstanding something. Let's start by using the simulate_data function to simulate some (panel) data.

panel_data = simulate_data(n_cohorts=5, intensity_by=1, tau=0.0, low=2.0, high=2.0)
panel_data.head()
y x0 w cat.0 cat.1 effect cohort strata intensity
('e0', 1900) 0.069472 0.160245 3.72983 0 0 0 1903 0 2
('e0', 1901) 8.95894 -1.9334 1.6912 0 0 0 1903 0 2
('e0', 1902) 8.24309 -0.0568131 1.78633 0 0 0 1903 0 2
('e0', 1903) 5.05815 0.810118 1.74161 0 0 0 1903 0 2
('e0', 1904) 15.2352 0.101205 0.806004 0 0 20 1903 0 2

The simulated data set the cohort variable as the last time step before the intervention (e.g. 1903 for unit e0 when treatment effect kicks in in 1904). This cohort definition is non-standard as it is usually defined as the start of treatment. I then proceed to fit the model and plot the estimated ATTs.

att = ATTgt(data=panel_data, cohort_name="cohort")
att.fit(formula="y", est_method="reg")
att.plot(color_by="cohort", shape_by="post")
image

Resulting plot looks correct, with marker shapes correctly encoding pre- and post-intervention periods (e.g. for cohort = 1903, time step 1903 is labeled as pre, while 1904 is labeled as post).

att.aggregate("event")
att.plot("event")

However, there seems to be inconsistency with the above cohort definition when aggregating the ATTs. For instance, when I run the aggregate method for the "cohort" level, I get the following results,

att.aggregate("cohort")
cohort ATT std_error lower upper zero_not_in_cband
1902 16.8192 1.18693 14.4929 19.1456 *
1903 15.6355 1.18963 13.3039 17.9671 *
1904 15.5679 1.14973 13.3145 17.8213 *
1905 13.5871 1.17521 11.2837 15.8904 *

The average ATTs presented here follow the standard cohort definition (i.e. start of the intervention), as they include the time == cohort data points. This clearly biases the estimated ATTs, as it includes a pre-intervention time period. This is easy to see by manually aggregating the ATTs. Clearly the second result, which does not include the time == cohort estimates, is the right one.

att.results().query("time >= cohort").iloc[:, 0].groupby(level=0).mean()
cohort
1902    16.819243
1903    15.635493
1904    15.567925
1905    13.587051
Name: (ATTgtResult, , ATT), dtype: float64
att.results().query("time > cohort").iloc[:, 0].groupby(level=0).mean()
cohort
1902    20.142418
1903    19.561096
1904    20.113236
1905    19.812459
Name: (ATTgtResult, , ATT), dtype: float64

Hence, it seems like both the simulate_data function and the (disaggregated) plot method generate/expect a non-standard cohort definition which is inconsistent with other functions/methods in the package. Thoughts? Sorry if I'm missing something.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions