Inconsistent cohort definition?

Hi there. I'm a little bit confused by how the cohort variable is defined and later used within the ATTgt class. Let me provide an example here to determine whether this is an actual bug or I'm just misunderstanding something. Let's start by using the `simulate_data` function to simulate some (panel) data.

```python
panel_data = simulate_data(n_cohorts=5, intensity_by=1, tau=0.0, low=2.0, high=2.0)
panel_data.head()
```

|              |         y |         x0 |        w |   cat.0 |   cat.1 |   effect |   cohort |   strata |   intensity |
|:-------------|----------:|-----------:|---------:|--------:|--------:|---------:|---------:|---------:|------------:|
| ('e0', 1900) |  0.069472 |  0.160245  | 3.72983  |       0 |       0 |        0 |     1903 |        0 |           2 |
| ('e0', 1901) |  8.95894  | -1.9334    | 1.6912   |       0 |       0 |        0 |     1903 |        0 |           2 |
| ('e0', 1902) |  8.24309  | -0.0568131 | 1.78633  |       0 |       0 |        0 |     1903 |        0 |           2 |
| ('e0', 1903) |  5.05815  |  0.810118  | 1.74161  |       0 |       0 |        0 |     1903 |        0 |           2 |
| ('e0', 1904) | 15.2352   |  0.101205  | 0.806004 |       0 |       0 |       20 |     1903 |        0 |           2 |

The simulated data set the cohort variable as the last time step before the intervention (e.g. 1903 for unit e0 when treatment effect kicks in in 1904). This cohort definition is non-standard as it is usually defined as the start of treatment. I then proceed to fit the model and plot the estimated ATTs.

```python
att = ATTgt(data=panel_data, cohort_name="cohort")
att.fit(formula="y", est_method="reg")
att.plot(color_by="cohort", shape_by="post")
```

<img width="990" alt="image" src="https://github.com/bernardodionisi/differences/assets/13224679/4271a14e-b396-4342-987d-5014d623b514">

Resulting plot looks correct, with marker shapes correctly encoding pre- and post-intervention periods (e.g. for cohort = 1903, time step 1903 is labeled as pre, while 1904 is labeled as post).

```python
att.aggregate("event")
att.plot("event")
```

However, there seems to be inconsistency with the above cohort definition when aggregating the ATTs. For instance, when I run the `aggregate` method for the "cohort" level, I get the following results,

```python
att.aggregate("cohort")
```

|   cohort |   ATT |   std_error |  lower |  upper | zero_not_in_cband   |
|---------:|-----------------------------------:|-------------------------------------------------:|---------------------------------------------------------:|---------------------------------------------------------:|:---------------------------------------------------------------------|
|     1902 |                            16.8192 |                                          1.18693 |                                                  14.4929 |                                                  19.1456 | *                                                                    |
|     1903 |                            15.6355 |                                          1.18963 |                                                  13.3039 |                                                  17.9671 | *                                                                    |
|     1904 |                            15.5679 |                                          1.14973 |                                                  13.3145 |                                                  17.8213 | *                                                                    |
|     1905 |                            13.5871 |                                          1.17521 |                                                  11.2837 |                                                  15.8904 | *                                                                    |

The average ATTs presented here follow the standard cohort definition (i.e. start of the intervention), as they include the time == cohort data points. This clearly biases the estimated ATTs, as it includes a pre-intervention time period. This is easy to see by manually aggregating the ATTs. Clearly the second result, which does not include the time == cohort estimates, is the right one.

```python
att.results().query("time >= cohort").iloc[:, 0].groupby(level=0).mean()
```

```
cohort
1902    16.819243
1903    15.635493
1904    15.567925
1905    13.587051
Name: (ATTgtResult, , ATT), dtype: float64
```

```python
att.results().query("time > cohort").iloc[:, 0].groupby(level=0).mean()
```

```
cohort
1902    20.142418
1903    19.561096
1904    20.113236
1905    19.812459
Name: (ATTgtResult, , ATT), dtype: float64
```

Hence, it seems like both the `simulate_data` function and the (disaggregated) `plot` method generate/expect a non-standard cohort definition which is inconsistent with other functions/methods in the package. Thoughts? Sorry if I'm missing something. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent cohort definition? #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	y	x0	w	effect	cohort	intensity
('e0', 1900)	0.069472	0.160245	3.72983	0	1903	2
('e0', 1901)	8.95894	-1.9334	1.6912	0	1903	2
('e0', 1902)	8.24309	-0.0568131	1.78633	0	1903	2
('e0', 1903)	5.05815	0.810118	1.74161	0	1903	2
('e0', 1904)	15.2352	0.101205	0.806004	20	1903	2

cohort	ATT	std_error	lower	upper	zero_not_in_cband
1902	16.8192	1.18693	14.4929	19.1456	*
1903	15.6355	1.18963	13.3039	17.9671	*
1904	15.5679	1.14973	13.3145	17.8213	*
1905	13.5871	1.17521	11.2837	15.8904	*

Inconsistent cohort definition? #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions