-
Notifications
You must be signed in to change notification settings - Fork 8
(DRAFT) #1132 HERE dynamic binning aggregation #1165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ists before routing
…e uri_string for result lookup
…uri_string to locate results
hr double precision, | ||
avg_tt numeric, | ||
stdev numeric, | ||
percentile_05 numeric, | ||
percentile_15 numeric, | ||
percentile_50 numeric, | ||
percentile_85 numeric, | ||
percentile_95 numeric, | ||
num_quasi_obs bigint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest dropping the precision for all of these fields, if only to save space.
hr
should be a smallint- all of the fields holding travel times (
avg_tt
thrupercentile_95
) can safely bereal
s. They should take less space to store and result in faster calculations since numeric types are sometimes much less efficient. num_quasi_obs
should always fit in a smallint too (max of 32,767)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or if you wanted to keep numeric
you could define the precision for the column, rather than relying on rounding values before inserting them. I'm pretty sure that let's postgres be smarter about how it stores them since it knows ahead of time how big they will be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some quick work this morning to just sketch out what seems like a pretty efficient way of estimating confidence intervals for this aggregation. What follows is a comparison of the bootstrap and jackknife methods for a single aggregation which I chose more or less randomly:
/*
--A random selection from the current monthly aggregation
SELECT *
FROM gwolofs.congestion_segments_monthy_summary
WHERE
mnth = '2024-05-01'
AND is_wkdy IS TRUE
AND segment_id = 2
AND hr = 7;
*/
-- required for the bootstrap to be deterministic
SELECT setseed(0.12345);
WITH obs AS (
-- get all the pseudo-observations for this aggregation
SELECT
ROW_NUMBER() OVER () AS i,
tt::real
FROM gwolofs.congestion_raw_segments_2024
WHERE -- same params as the above aggregation
dt >= '2024-05-01' AND dt < '2024-06-01'
AND EXTRACT('ISODOW' FROM dt) IN (1, 2, 3, 4, 5)
AND segment_id = 2
AND EXTRACT('HOUR' FROM hr) = 7
),
jackknife AS (
SELECT AVG(tt)::real AS avg_tt
FROM obs
CROSS JOIN LATERAL (
SELECT generate_series(
1,
(SELECT MAX(i) FROM obs) -- number of re-samples == sample size
) AS i
) AS sample
WHERE obs.i != sample.i -- leave out one obs per sample group
GROUP BY sample.i
),
bootstrap_samples AS (
SELECT
sample_group,
ceiling(random() * (SELECT MAX(i) + 1 FROM obs)) AS random_i
FROM generate_series(1, (SELECT MAX(i) FROM obs)) AS row_num
-- 200 resamples (could be any number)
CROSS JOIN generate_series(1, 200) AS sample_group
),
bootstrap AS (
SELECT AVG(obs.tt)::real AS avg_tt
FROM bootstrap_samples
JOIN obs ON bootstrap_samples.random_i = obs.i
GROUP BY sample_group
)
SELECT
'Bootstrap' AS method,
percentile_disc(0.025) WITHIN GROUP (ORDER BY avg_tt) AS ci_lower,
percentile_disc(0.975) WITHIN GROUP (ORDER BY avg_tt) AS ci_upper
FROM bootstrap
UNION
SELECT
'Jackknife' AS method,
percentile_disc(0.025) WITHIN GROUP (ORDER BY avg_tt) AS ci_lower,
percentile_disc(0.975) WITHIN GROUP (ORDER BY avg_tt) AS ci_upper
FROM jackknife
The jackknife method, to my mind, is performing poorly here. I think it may perform better for groups with fewer observations, but really, I'm just looking at the bootstrap method now. Those confidence intervals seem quite reasonable to me, and the method is consistent with how confidence intervals are already calculated in the travel time app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also put together a version of the above, just for the bootstrap, to see how it scales to more than one aggregation at a time. This just looks at the 24 weekday hours for a single segment and month, but we could tweak it easily to aggregate over the other dimensions too. It runs quicker than I thought it would, but who knows if the resampling may scale badly at a certain point.
/*
--A random subset from the current monthly aggregation
SELECT *
FROM gwolofs.congestion_segments_monthy_summary
WHERE
mnth = '2024-05-01'
AND is_wkdy IS TRUE
AND segment_id = 2;
*/
WITH obs_raw AS (
-- get all the pseudo-observations for this aggregation
SELECT
segment_id,
EXTRACT('MONTH' FROM dt) AS mnth,
EXTRACT('ISODOW' FROM dt) IN (1, 2, 3, 4, 5) AS is_wkdy,
EXTRACT('HOUR' FROM hr) AS hr,
tt::real
FROM gwolofs.congestion_raw_segments_2024
WHERE -- same params as the above aggregation
dt >= '2024-05-01' AND dt < '2024-06-01'
AND EXTRACT('ISODOW' FROM dt) NOT IN (1, 2, 3, 4, 5)
AND segment_id = 2
),
obs AS (
SELECT
segment_id,
mnth,
is_wkdy,
hr,
tt,
ROW_NUMBER() OVER (PARTITION BY segment_id, mnth, is_wkdy, hr) AS i
FROM obs_raw
),
obs_counts AS (
SELECT
segment_id,
mnth,
is_wkdy,
hr,
AVG(tt) AS avg_tt,
COUNT(*) AS n
FROM obs
GROUP BY
segment_id,
mnth,
is_wkdy,
hr
),
random_selections AS (
SELECT
obs_counts.segment_id,
obs_counts.mnth,
obs_counts.is_wkdy,
obs_counts.hr,
sample_group,
ceiling(random() * obs_counts.n + 1) AS random_i
FROM obs_counts
CROSS JOIN generate_series(1, obs_counts.n) AS rid
-- 200 resamples (could be any number)
CROSS JOIN generate_series(1, 300) AS sample_group
),
resampled_averages AS (
SELECT
rs.segment_id,
rs.mnth,
rs.is_wkdy,
rs.hr,
rs.sample_group,
AVG(tt) AS avg_tt
FROM random_selections AS rs
JOIN obs
ON rs.segment_id = obs.segment_id
AND rs.mnth = obs.mnth
AND rs.is_wkdy = obs.is_wkdy
AND rs.hr = obs.hr
AND rs.random_i = obs.i
GROUP BY
rs.segment_id,
rs.mnth,
rs.is_wkdy,
rs.hr,
rs.sample_group
)
SELECT
segment_id,
mnth,
is_wkdy,
hr,
obs_counts.avg_tt::real,
obs_counts.n,
percentile_disc(0.025) WITHIN GROUP (ORDER BY resampled_averages.avg_tt)::real AS ci_lower,
percentile_disc(0.975) WITHIN GROUP (ORDER BY resampled_averages.avg_tt)::real AS ci_upper
FROM resampled_averages
JOIN obs_counts USING (segment_id, mnth, is_wkdy, hr)
GROUP BY
segment_id,
mnth,
is_wkdy,
hr,
obs_counts.avg_tt,
obs_counts.n
Anyway, this is a lot. I don't expect you to digest it all from this comment; I mostly wanted a place to drop that code until we can chat when you're back in the office.
corridor_id smallint, | ||
time_grp timerange NOT NULL, | ||
bin_range tsrange NOT NULL, | ||
tt numeric, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_obs integer, | ||
uri_string text COLLATE pg_catalog."default", | ||
dt date, | ||
hr timestamp without time zone, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have date as it's own field already, I might suggest letting this be a smallint. Otherwise, the current datetime hr
field could serve both purposes, hour and date.
What this pull request accomplishes:
This PR contains two aggregation processes: one for congestion network and one for corridors (arbitrary to/from node pair) the process for each is as follows (review one at a time).
congestion_cache_tt_results
andcongestion_network_segment_agg
both perform dynamic binning, one for a single corridor (<1s) and one for entire congestion network (10 min)Issue(s) this solves:
What, in particular, needs to reviewed:
What needs to be done by a sysadmin after this PR is merged
E.g.: these tables need to be migrated/created in the production schema.