Skip to content

Conversation

gabrielwol
Copy link
Collaborator

@gabrielwol gabrielwol commented Mar 3, 2025

What this pull request accomplishes:

This PR contains two aggregation processes: one for congestion network and one for corridors (arbitrary to/from node pair) the process for each is as follows (review one at a time).

  • congestion_cache_tt_results and congestion_network_segment_agg both perform dynamic binning, one for a single corridor (<1s) and one for entire congestion network (10 min)
%%{init: {"flowchart": {"htmlLabels": false}} }%%
flowchart BT 
    corridors[TABLE congestion_corridors]
    raw_corridors[TABLE congestion_raw_corridors]
    raw_segments[TABLE congestion_raw_segments]
    time_grps[VIEW congestion_time_grps]
    cache_corridor{FUNCTION congestion_cache_corridor}
    cache_tt_results{FUNCTION congestion_cache_tt_results}
    network_segment_agg{FUNCTION congestion_network_segment_agg}
    Result

subgraph "FUNCTION congestion_dynamic_bin_avg"
    cache_corridor --> |Cache corridor or return pre-cached corridor| corridors
    corridors --> cache_tt_results -->  |Aggregate a corridor for a specific time period| raw_corridors --> |Returns Average| Result
end

subgraph Congestion Network Aggregation
    network_segment_agg -->|Aggregates congestion network| raw_segments
    time_grps --> |Defines time periods to aggregate| network_segment_agg
end
Loading

Issue(s) this solves:

What, in particular, needs to reviewed:

What needs to be done by a sysadmin after this PR is merged

E.g.: these tables need to be migrated/created in the production schema.

@gabrielwol gabrielwol requested a review from Nate-Wessel March 3, 2025 19:49
@gabrielwol gabrielwol self-assigned this Mar 3, 2025
@gabrielwol gabrielwol linked an issue Mar 3, 2025 that may be closed by this pull request
@gabrielwol gabrielwol changed the title #1132 HERE dynamic binning aggregation (DRAFT) #1132 HERE dynamic binning aggregation Apr 8, 2025
Comment on lines 10 to 18
hr double precision,
avg_tt numeric,
stdev numeric,
percentile_05 numeric,
percentile_15 numeric,
percentile_50 numeric,
percentile_85 numeric,
percentile_95 numeric,
num_quasi_obs bigint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest dropping the precision for all of these fields, if only to save space.

  • hr should be a smallint
  • all of the fields holding travel times (avg_tt thru percentile_95) can safely be reals. They should take less space to store and result in faster calculations since numeric types are sometimes much less efficient.
  • num_quasi_obs should always fit in a smallint too (max of 32,767)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if you wanted to keep numeric you could define the precision for the column, rather than relying on rounding values before inserting them. I'm pretty sure that let's postgres be smarter about how it stores them since it knows ahead of time how big they will be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some quick work this morning to just sketch out what seems like a pretty efficient way of estimating confidence intervals for this aggregation. What follows is a comparison of the bootstrap and jackknife methods for a single aggregation which I chose more or less randomly:

/* 
--A random selection from the current monthly aggregation
SELECT *
FROM gwolofs.congestion_segments_monthy_summary
WHERE
    mnth = '2024-05-01'
    AND is_wkdy IS TRUE
    AND segment_id = 2
    AND hr = 7;
*/

-- required for the bootstrap to be deterministic
SELECT setseed(0.12345);

WITH obs AS (
    -- get all the pseudo-observations for this aggregation
    SELECT
        ROW_NUMBER() OVER () AS i,
        tt::real
    FROM gwolofs.congestion_raw_segments_2024
    WHERE -- same params as the above aggregation
        dt >= '2024-05-01' AND dt < '2024-06-01'
        AND EXTRACT('ISODOW' FROM dt) IN (1, 2, 3, 4, 5)
        AND segment_id = 2
        AND EXTRACT('HOUR' FROM hr) = 7
),

jackknife AS (
    SELECT AVG(tt)::real AS avg_tt
    FROM obs
    CROSS JOIN LATERAL (
        SELECT generate_series(
            1,
            (SELECT MAX(i) FROM obs) -- number of re-samples == sample size
        ) AS i
    ) AS sample
    WHERE obs.i != sample.i -- leave out one obs per sample group
    GROUP BY sample.i
),

bootstrap_samples AS (
    SELECT
        sample_group,
        ceiling(random() * (SELECT MAX(i) + 1 FROM obs)) AS random_i
    FROM generate_series(1, (SELECT MAX(i) FROM obs)) AS row_num
    -- 200 resamples (could be any number)
    CROSS JOIN generate_series(1, 200) AS sample_group
),

bootstrap AS (
    SELECT AVG(obs.tt)::real AS avg_tt
    FROM bootstrap_samples
    JOIN obs ON bootstrap_samples.random_i = obs.i
    GROUP BY sample_group
)

SELECT
    'Bootstrap' AS method,
    percentile_disc(0.025) WITHIN GROUP (ORDER BY avg_tt) AS ci_lower,
    percentile_disc(0.975) WITHIN GROUP (ORDER BY avg_tt) AS ci_upper
FROM bootstrap

UNION

SELECT
    'Jackknife' AS method,
    percentile_disc(0.025) WITHIN GROUP (ORDER BY avg_tt) AS ci_lower,
    percentile_disc(0.975) WITHIN GROUP (ORDER BY avg_tt) AS ci_upper
FROM jackknife

The jackknife method, to my mind, is performing poorly here. I think it may perform better for groups with fewer observations, but really, I'm just looking at the bootstrap method now. Those confidence intervals seem quite reasonable to me, and the method is consistent with how confidence intervals are already calculated in the travel time app.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also put together a version of the above, just for the bootstrap, to see how it scales to more than one aggregation at a time. This just looks at the 24 weekday hours for a single segment and month, but we could tweak it easily to aggregate over the other dimensions too. It runs quicker than I thought it would, but who knows if the resampling may scale badly at a certain point.

/*
--A random subset from the current monthly aggregation
SELECT *
FROM gwolofs.congestion_segments_monthy_summary
WHERE
    mnth = '2024-05-01'
    AND is_wkdy IS TRUE
    AND segment_id = 2;
*/

WITH obs_raw AS (
    -- get all the pseudo-observations for this aggregation
    SELECT
        segment_id,
        EXTRACT('MONTH' FROM dt) AS mnth,
        EXTRACT('ISODOW' FROM dt) IN (1, 2, 3, 4, 5) AS is_wkdy,
        EXTRACT('HOUR' FROM hr) AS hr,
        tt::real
    FROM gwolofs.congestion_raw_segments_2024
    WHERE -- same params as the above aggregation
        dt >= '2024-05-01' AND dt < '2024-06-01'
        AND EXTRACT('ISODOW' FROM dt) NOT IN (1, 2, 3, 4, 5)
        AND segment_id = 2
),

obs AS (
    SELECT
        segment_id,
        mnth,
        is_wkdy,
        hr,
        tt,
        ROW_NUMBER() OVER (PARTITION BY segment_id, mnth, is_wkdy, hr) AS i
    FROM obs_raw
),

obs_counts AS (
    SELECT
        segment_id,
        mnth,
        is_wkdy,
        hr,
        AVG(tt) AS avg_tt,
        COUNT(*) AS n
    FROM obs
    GROUP BY
        segment_id,
        mnth,
        is_wkdy,
        hr
),

random_selections AS (
    SELECT
        obs_counts.segment_id,
        obs_counts.mnth,
        obs_counts.is_wkdy,
        obs_counts.hr,
        sample_group,
        ceiling(random() * obs_counts.n + 1) AS random_i
    FROM obs_counts
    CROSS JOIN generate_series(1, obs_counts.n) AS rid
    -- 200 resamples (could be any number)
    CROSS JOIN generate_series(1, 300) AS sample_group
),

resampled_averages AS (
    SELECT
        rs.segment_id,
        rs.mnth,
        rs.is_wkdy,
        rs.hr,
        rs.sample_group,
        AVG(tt) AS avg_tt
    FROM random_selections AS rs
    JOIN obs
        ON rs.segment_id = obs.segment_id
        AND rs.mnth = obs.mnth
        AND rs.is_wkdy = obs.is_wkdy
        AND rs.hr = obs.hr
        AND rs.random_i = obs.i
    GROUP BY
        rs.segment_id,
        rs.mnth,
        rs.is_wkdy,
        rs.hr,
        rs.sample_group
)

SELECT
    segment_id,
    mnth,
    is_wkdy,
    hr,
    obs_counts.avg_tt::real,
    obs_counts.n,
    percentile_disc(0.025) WITHIN GROUP (ORDER BY resampled_averages.avg_tt)::real AS ci_lower,
    percentile_disc(0.975) WITHIN GROUP (ORDER BY resampled_averages.avg_tt)::real AS ci_upper
FROM resampled_averages
JOIN obs_counts USING (segment_id, mnth, is_wkdy, hr)
GROUP BY
    segment_id,
    mnth,
    is_wkdy,
    hr,
    obs_counts.avg_tt,
    obs_counts.n

Anyway, this is a lot. I don't expect you to digest it all from this comment; I mostly wanted a place to drop that code until we can chat when you're back in the office.

corridor_id smallint,
time_grp timerange NOT NULL,
bin_range tsrange NOT NULL,
tt numeric,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

The precision on these is wiiild. I'd suggest reals here too to save space without losing any meaningful precision.

num_obs integer,
uri_string text COLLATE pg_catalog."default",
dt date,
hr timestamp without time zone,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have date as it's own field already, I might suggest letting this be a smallint. Otherwise, the current datetime hr field could serve both purposes, hour and date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HERE Aggregation Proposal

2 participants