(DRAFT) #1132 HERE dynamic binning aggregation #1165

gabrielwol · 2025-03-03T19:49:14Z

What this pull request accomplishes:

This PR contains two aggregation processes: one for congestion network and one for corridors (arbitrary to/from node pair) the process for each is as follows (review one at a time).

congestion_cache_tt_results and congestion_network_segment_agg both perform dynamic binning, one for a single corridor (<1s) and one for entire congestion network (10 min)

%%{init: {"flowchart": {"htmlLabels": false}} }%%
flowchart BT 
    corridors[TABLE congestion_corridors]
    raw_corridors[TABLE congestion_raw_corridors]
    raw_segments[TABLE congestion_raw_segments]
    time_grps[VIEW congestion_time_grps]
    cache_corridor{FUNCTION congestion_cache_corridor}
    cache_tt_results{FUNCTION congestion_cache_tt_results}
    network_segment_agg{FUNCTION congestion_network_segment_agg}
    Result

subgraph "FUNCTION congestion_dynamic_bin_avg"
    cache_corridor --> |Cache corridor or return pre-cached corridor| corridors
    corridors --> cache_tt_results -->  |Aggregate a corridor for a specific time period| raw_corridors --> |Returns Average| Result
end

subgraph Congestion Network Aggregation
    network_segment_agg -->|Aggregates congestion network| raw_segments
    time_grps --> |Defines time periods to aggregate| network_segment_agg
end

Issue(s) this solves:

What, in particular, needs to reviewed:

What needs to be done by a sysadmin after this PR is merged

E.g.: these tables need to be migrated/created in the production schema.

…ists before routing

…e uri_string for result lookup

…straint

…uri_string to locate results

Nate-Wessel · 2025-08-15T17:42:42Z

here/traffic/sql/dynamic_bins/create-table-congestion_segments_monthy_summary.sql

+    hr double precision,
+    avg_tt numeric,
+    stdev numeric,
+    percentile_05 numeric,
+    percentile_15 numeric,
+    percentile_50 numeric,
+    percentile_85 numeric,
+    percentile_95 numeric,
+    num_quasi_obs bigint


I would suggest dropping the precision for all of these fields, if only to save space.

hr should be a smallint

all of the fields holding travel times (avg_tt thru percentile_95) can safely be reals. They should take less space to store and result in faster calculations since numeric types are sometimes much less efficient.

num_quasi_obs should always fit in a smallint too (max of 32,767)

Or if you wanted to keep numeric you could define the precision for the column, rather than relying on rounding values before inserting them. I'm pretty sure that let's postgres be smarter about how it stores them since it knows ahead of time how big they will be.

Nate-Wessel · 2025-08-15T17:50:44Z

here/traffic/sql/dynamic_bins/function-congestion_segment_monthly_agg.sql

I did some quick work this morning to just sketch out what seems like a pretty efficient way of estimating confidence intervals for this aggregation. What follows is a comparison of the bootstrap and jackknife methods for a single aggregation which I chose more or less randomly:

/* --A random selection from the current monthly aggregation SELECT * FROM gwolofs.congestion_segments_monthy_summary WHERE mnth = '2024-05-01' AND is_wkdy IS TRUE AND segment_id = 2 AND hr = 7; */ -- required for the bootstrap to be deterministic SELECT setseed(0.12345); WITH obs AS ( -- get all the pseudo-observations for this aggregation SELECT ROW_NUMBER() OVER () AS i, tt::real FROM gwolofs.congestion_raw_segments_2024 WHERE -- same params as the above aggregation dt >= '2024-05-01' AND dt < '2024-06-01' AND EXTRACT('ISODOW' FROM dt) IN (1, 2, 3, 4, 5) AND segment_id = 2 AND EXTRACT('HOUR' FROM hr) = 7 ), jackknife AS ( SELECT AVG(tt)::real AS avg_tt FROM obs CROSS JOIN LATERAL ( SELECT generate_series( 1, (SELECT MAX(i) FROM obs) -- number of re-samples == sample size ) AS i ) AS sample WHERE obs.i != sample.i -- leave out one obs per sample group GROUP BY sample.i ), bootstrap_samples AS ( SELECT sample_group, ceiling(random() * (SELECT MAX(i) + 1 FROM obs)) AS random_i FROM generate_series(1, (SELECT MAX(i) FROM obs)) AS row_num -- 200 resamples (could be any number) CROSS JOIN generate_series(1, 200) AS sample_group ), bootstrap AS ( SELECT AVG(obs.tt)::real AS avg_tt FROM bootstrap_samples JOIN obs ON bootstrap_samples.random_i = obs.i GROUP BY sample_group ) SELECT 'Bootstrap' AS method, percentile_disc(0.025) WITHIN GROUP (ORDER BY avg_tt) AS ci_lower, percentile_disc(0.975) WITHIN GROUP (ORDER BY avg_tt) AS ci_upper FROM bootstrap UNION SELECT 'Jackknife' AS method, percentile_disc(0.025) WITHIN GROUP (ORDER BY avg_tt) AS ci_lower, percentile_disc(0.975) WITHIN GROUP (ORDER BY avg_tt) AS ci_upper FROM jackknife

The jackknife method, to my mind, is performing poorly here. I think it may perform better for groups with fewer observations, but really, I'm just looking at the bootstrap method now. Those confidence intervals seem quite reasonable to me, and the method is consistent with how confidence intervals are already calculated in the travel time app.

I also put together a version of the above, just for the bootstrap, to see how it scales to more than one aggregation at a time. This just looks at the 24 weekday hours for a single segment and month, but we could tweak it easily to aggregate over the other dimensions too. It runs quicker than I thought it would, but who knows if the resampling may scale badly at a certain point.

/* --A random subset from the current monthly aggregation SELECT * FROM gwolofs.congestion_segments_monthy_summary WHERE mnth = '2024-05-01' AND is_wkdy IS TRUE AND segment_id = 2; */ WITH obs_raw AS ( -- get all the pseudo-observations for this aggregation SELECT segment_id, EXTRACT('MONTH' FROM dt) AS mnth, EXTRACT('ISODOW' FROM dt) IN (1, 2, 3, 4, 5) AS is_wkdy, EXTRACT('HOUR' FROM hr) AS hr, tt::real FROM gwolofs.congestion_raw_segments_2024 WHERE -- same params as the above aggregation dt >= '2024-05-01' AND dt < '2024-06-01' AND EXTRACT('ISODOW' FROM dt) NOT IN (1, 2, 3, 4, 5) AND segment_id = 2 ), obs AS ( SELECT segment_id, mnth, is_wkdy, hr, tt, ROW_NUMBER() OVER (PARTITION BY segment_id, mnth, is_wkdy, hr) AS i FROM obs_raw ), obs_counts AS ( SELECT segment_id, mnth, is_wkdy, hr, AVG(tt) AS avg_tt, COUNT(*) AS n FROM obs GROUP BY segment_id, mnth, is_wkdy, hr ), random_selections AS ( SELECT obs_counts.segment_id, obs_counts.mnth, obs_counts.is_wkdy, obs_counts.hr, sample_group, ceiling(random() * obs_counts.n + 1) AS random_i FROM obs_counts CROSS JOIN generate_series(1, obs_counts.n) AS rid -- 200 resamples (could be any number) CROSS JOIN generate_series(1, 300) AS sample_group ), resampled_averages AS ( SELECT rs.segment_id, rs.mnth, rs.is_wkdy, rs.hr, rs.sample_group, AVG(tt) AS avg_tt FROM random_selections AS rs JOIN obs ON rs.segment_id = obs.segment_id AND rs.mnth = obs.mnth AND rs.is_wkdy = obs.is_wkdy AND rs.hr = obs.hr AND rs.random_i = obs.i GROUP BY rs.segment_id, rs.mnth, rs.is_wkdy, rs.hr, rs.sample_group ) SELECT segment_id, mnth, is_wkdy, hr, obs_counts.avg_tt::real, obs_counts.n, percentile_disc(0.025) WITHIN GROUP (ORDER BY resampled_averages.avg_tt)::real AS ci_lower, percentile_disc(0.975) WITHIN GROUP (ORDER BY resampled_averages.avg_tt)::real AS ci_upper FROM resampled_averages JOIN obs_counts USING (segment_id, mnth, is_wkdy, hr) GROUP BY segment_id, mnth, is_wkdy, hr, obs_counts.avg_tt, obs_counts.n

Anyway, this is a lot. I don't expect you to digest it all from this comment; I mostly wanted a place to drop that code until we can chat when you're back in the office.

Nate-Wessel · 2025-08-15T17:58:35Z

here/traffic/sql/dynamic_bins/create-table-congestion_raw_corridors.sql

+    corridor_id smallint,
+    time_grp timerange NOT NULL,
+    bin_range tsrange NOT NULL,
+    tt numeric,


The precision on these is wiiild. I'd suggest reals here too to save space without losing any meaningful precision.

Nate-Wessel · 2025-08-15T18:01:27Z

here/traffic/sql/dynamic_bins/create-table-congestion_raw_corridors.sql

+    num_obs integer,
+    uri_string text COLLATE pg_catalog."default",
+    dt date,
+    hr timestamp without time zone,


Since we have date as it's own field already, I might suggest letting this be a smallint. Otherwise, the current datetime hr field could serve both purposes, hour and date.

…_network_segment_agg

gabrielwol requested a review from Nate-Wessel March 3, 2025 19:49

gabrielwol self-assigned this Mar 3, 2025

gabrielwol linked an issue Mar 3, 2025 that may be closed by this pull request

HERE Aggregation Proposal #1132

Open

gabrielwol changed the title ~~#1132 HERE dynamic binning aggregation~~ (DRAFT) #1132 HERE dynamic binning aggregation Apr 8, 2025

gabrielwol added 26 commits April 9, 2025 15:22

#1132 raw_segments table definition

a0fbc23

#1132 raw_segments aggregation

48ad3a8

#1132 add time periods in addition to hours

073a814

#1132 dynamic bins should not exceed 1hr in length

29b8326

#1132 reamde describing dynamic binning query

6d1847e

#1132 remove unnecessary join from insert subquery

696de6e

#1132 here dynamic binning: function

8edb8c8

#1132 update cache_tt_segment to return segment details + Check if ex…

0128961

…ists before routing

#1132 cache_tt_segment procedure->function

330668c

#1132 cache_tt_results procedure->function

521ecef

#1132 select map version func

154ea44

#1132 cache_tt_results updates; use map version, fix end_bin bug, sav…

ab45c51

…e uri_string for result lookup

#1132 tt_segments table; save map_version, nodes, remove link_dir con…

e51d6ed

…straint

#1132 update dynamic bin results table; add uri

71468b6

#1132 update here_dynamic_bin_avg; use daily avgs -> avg method, use …

dcd9584

…uri_string to locate results

#1132 apply end_bin fix to congestion network query

ad3c31e

#1132 rename files/functions as per proposed dictionary

b1befb7

#1132 rename files/functions continued

c65985c

#1132 fluff

a59f2fd

tsrange -> timerange

b23d611

#1132 congestion ntwrk hrly and period agg updates

b7a6f70

#1132 rename function for consistency

674b0c1

#1132 move time_grps to view, combine hourly and period agg

61fdfdc

#1132 rename congestion_network_segment_agg

e9049a4

#1132 fluff and add comments

4e93259

#1132 change path

82bf16a

gabrielwol added 8 commits June 30, 2025 21:29

#1132 add task timeout

aa724ce

#1132 adjust constraints

5c71af7

#1132 add max bin length constraint

ecb6585

#1132 add DROP (temp) TABLE IF EXISTS

d2dafe6

#1132 add task timeout

3ff60e4

#1132 monthly agg dag

bb38800

#1132 try execution_timeout with timedelta rather than duration

d742eb9

#1132 fix monthly cron schedule

70e0fc9

Nate-Wessel reviewed Aug 15, 2025

View reviewed changes

gabrielwol added 12 commits August 28, 2025 19:51

#1132 fix sql operator timeout

f1679d1

#1132 implementation of bootstrapping method

cdb8886

#1132 bootstrapping method - eliminate another cte

0657471

#1132 update segment_grouping to work for all map versions

e77503a

#1132 check day is not empty before aggregating

f50d023

#1132 perform bootstrapping by group of segment_ids

57cc194

#1132 much faster over a single segment_id rather than array

2c37631

#1132 bootstrap table structure

0effb5f

#1132 tt->real, hr->smallint

a058f7c

#1132 fluff

c6ef208

#1132 separate out insert and select funcionality

6dc942f

#1132 fix logical date of TriggerDagRunOperator

dd3170b

Nate-Wessel mentioned this pull request Oct 8, 2025

Deduplicate aggregation code between this app and Data Ops pipeline CityofToronto/bdit_tt_request_app#229

Open

gabrielwol added 5 commits October 9, 2025 20:58

#1132 changes to reflect change in hr datatype

7863ef9

#1132 try adding an analyze on temp table

9c63a32

#1132 materialize?

191a2c2

#1132 separate out delete query

4e14aee

#1132 change cte's to temp tables with indices to speed up congestion…

b3c62df

…_network_segment_agg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(DRAFT) #1132 HERE dynamic binning aggregation #1165

(DRAFT) #1132 HERE dynamic binning aggregation #1165

Uh oh!

gabrielwol commented Mar 3, 2025 •

edited

Loading

Uh oh!

Nate-Wessel Aug 15, 2025

Uh oh!

Nate-Wessel Aug 15, 2025

Uh oh!

Nate-Wessel Aug 15, 2025

Uh oh!

Nate-Wessel Aug 15, 2025

Uh oh!

Nate-Wessel Aug 15, 2025

Uh oh!

Nate-Wessel Aug 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

(DRAFT) #1132 HERE dynamic binning aggregation #1165

Are you sure you want to change the base?

(DRAFT) #1132 HERE dynamic binning aggregation #1165

Uh oh!

Conversation

gabrielwol commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this pull request accomplishes:

Issue(s) this solves:

What, in particular, needs to reviewed:

What needs to be done by a sysadmin after this PR is merged

Uh oh!

Nate-Wessel Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Nate-Wessel Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Nate-Wessel Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Nate-Wessel Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Nate-Wessel Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Nate-Wessel Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gabrielwol commented Mar 3, 2025 •

edited

Loading