Skip to content

Conversation

@sylvansson
Copy link
Contributor

Summary:

  • Fix an issue where we shift the service window's start and end dates incorrectly based on exceptions, even if they only apply to a subset of the services (5f001f7).
  • Fix an issue where we always calculate the service window based on both calendar.txt and calendar_dates.txt even when one or both are missing (de23caf).
  • A few miscellaneous changes (small typos, etc.).

Closes #2017

Expected behavior:

Before:
image

After:
image

I've only tested with the mdb-2138 feed for now. If I understand correctly the acceptance tests don't validate the feed metadata so I still need to check a few datasets manually ⚠️

Please make sure these boxes are checked before submitting your pull request - thanks!

  • Run the unit tests with gradle test to make sure you didn't break anything
  • Add or update any needed documentation to the repo
  • Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
  • Linked all relevant issues
  • Include screenshot(s) showing how this pull request works and fixes the issue(s)

@sylvansson sylvansson marked this pull request as draft May 8, 2025 11:47
@sylvansson
Copy link
Contributor Author

I see we're getting a stack overflow for this dataset. I'll have to take a closer look.

@sylvansson sylvansson force-pushed the 2017-inverted-service-dates branch from ce96443 to bd4b3ff Compare May 9, 2025 03:23
@sylvansson sylvansson marked this pull request as ready for review May 9, 2025 03:24
@sylvansson
Copy link
Contributor Author

@davidgamez @jcpitre This is ready for review :-)

@davidgamez
Copy link
Member

Adding the acceptance test manually due to #1960

📝 Acceptance Test Report

📋 Summary

✅ The rule acceptance has passed for commit 057ab10
Download the full acceptance test report here (report will disappear after 90 days).

📊 Notices Comparison

New Errors (0 out of 1886 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Errors (0 out of 1886 datasets, ~0%) ✅

No changes were detected due to the code change.

New Warnings (0 out of 1886 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Warnings (0 out of 1886 datasets, ~0%) ✅

No changes were detected due to the code change.

🛡️ Corruption Check

35 out of 1921 sources (~2 %) are corrupted.
Dataset Ref Report Exists Ref Report Readable Latest Report Exists Latest Report Readable
ca-saskatchewan-moose-jaw-transit-gtfs-2602
de-baden-wuttemberg-nvbw-gtfs-2393.json
de-bayern-munchner-verkehrs--und-tarifverbund-gmbh-mvv-gtfs-2252
de-sachsen-mitteldeutscher-verkehrsverbund-gmbh-mdv-gtfs-2360
fr-eurostar-gtfs-2431.json
il-ministry-of-transport-and-road-safety-gtfs-2519
jp-aomori-aomori-city-bus-gtfs-2607
ro-buzau-transbus-buzau-gtfs-2106
ro-dambovita-servicii-publice-municipale-targoviste-gtfs-2107
ro-prahova-transport-calatori-express-ploiesti-gtfs-2108
us-california-alameda-contra-costa-transit-district-ac-transit-gtfs-2455
us-california-county-connection-gtfs-2421
us-california-regional-transportation-commission-of-southern-nevada-rtc-gtfs-110
us-california-santa-maria-area-transit-gtfs-26
us-california-south-county-transit-link-gtfs-2203
us-california-taft-maricopa-area-transit-gtfs-821
us-california-tri-delta-transit-gtfs-1974
us-florida-gainesville-regional-transit-system-gtfs-2412
us-hawaii-hawaii-mass-transit-agency-hele-on-bus-gtfs-2608
us-illinois-danville-mass-transit-gtfs-2363
us-kansas-salina-gtfs-1867
us-massachusetts-pioneer-valley-transit-authority-pvta-gtfs-2416
us-utah-cache-valley-gtfs-1906
us-virginia-arlington-transit-gtfs-485
us-virginia-fairfax-cue-bus-cue-gtfs-2609
us-virginia-fredericksburg-regional-transit-gtfs-2430
us-washington-coast-transportation-gtfs-2162
us-washington-columbia-county-public-transportation-gtfs-2168
us-washington-community-in-motion-gtfs-2163
us-washington-eastside-friends-of-seniors-gtfs-2166
us-washington-hopelink-transportation-gtfs-2167
us-washington-intercity-transit-gtfs-2289
us-washington-paratransit-services-gtfs-2176
us-washington-puget-sound-educational-service-district-gtfs-2177
us-washington-sound-generations-hyde-shuttle-gtfs-2183

⏱️ Performance Assessment

📈 Validation Time

Assess the performance in terms of seconds taken for the validation process.

Time Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 3.66 3.81 ⬆️+0.15
Median -- 1.37 1.47 ⬆️+0.11
Standard Deviation -- 9.70 10.35 ⬆️+0.66
Minimum in References Reports us-california-flex-v2-developer-test-feed-3-gtfs-1819 0.47 0.63 ⬆️+0.16
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 202.12 257.99 ⬆️+55.87
Minimum in Latest Reports us-california-city-of-wasco-gtfs-1788 0.49 0.49 ⬆️+0.00
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 202.12 257.99 ⬆️+55.87
📜 Memory Consumption
Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 467.18 MiB 461.92 MiB ⬇️-5.27 MiB
Median -- 333.47 MiB 331.93 MiB ⬇️-1.54 MiB
Standard Deviation -- 796.63 MiB 740.51 MiB ⬇️-56.12 MiB
Minimum in References Reports ro-vrancea-consiliul-judetean-vrancea-gtfs-1984 39.75 MiB 39.78 MiB ⬆️+30.23 KiB
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 11.81 GiB 5.68 GiB ⬇️-6.12 GiB
Minimum in Latest Reports us-new-mexico-santa-fe-trails-gtfs-839 403.93 MiB 39.11 MiB ⬇️-364.82 MiB
Maximum in Latest Reports ch-unknown-swiss-federal-railways-sbb-gtfs-2144 7.99 GiB 8.72 GiB ⬆️+750.86 MiB

Copy link
Member

@davidgamez davidgamez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I added a few comments before final approval.

@davidgamez
Copy link
Member

@skalexch can you take a look at this PR for product approval? Thanks!

@davidgamez
Copy link
Member

Adding the acceptance test manually due to #1960

The test is failing due to a high rate of corrupted files, with no concerns regarding notices. This should be fixed after #2024

📝 Acceptance Test Report

📋 Summary

❌ The rule acceptance test has failed for commit cdd15be
Download the full acceptance test report here (report will disappear after 90 days).

📊 Notices Comparison

New Errors (0 out of 1886 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Errors (0 out of 1886 datasets, ~0%) ✅

No changes were detected due to the code change.

New Warnings (0 out of 1886 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Warnings (0 out of 1886 datasets, ~0%) ✅

No changes were detected due to the code change.

🛡️ Corruption Check

45 out of 1931 sources (~2 %) are corrupted.
Dataset Ref Report Exists Ref Report Readable Latest Report Exists Latest Report Readable
br-rio-de-janeiro-angra-dos-reis-viacao-senhor-do-bonfirm-2632
ca-london-hamilton-street-railway-gtfs-2618
ca-saskatchewan-moose-jaw-transit-gtfs-2602
de-baden-wuttemberg-nvbw-gtfs-2393.json
de-bayern-munchner-verkehrs--und-tarifverbund-gmbh-mvv-gtfs-2252
de-sachsen-mitteldeutscher-verkehrsverbund-gmbh-mdv-gtfs-2360
es-renfe-alta-velocidad-larga-distancia-media-distancia-gtfs-2620
fr-eurostar-gtfs-2431.json
fr-grand-est-troyes-champagne-metropole-tcat-gtfs-2627
fr-grand-est-troyes-champagne-metropole-tcat-gtfs-2628
fr-nouvelle-aquitaine-bordeaux-tbm-gtfs-2622
fr-sens-intercom-gtfs-2626
gg-unknown-sarkshipping-gtfs-2615
il-ministry-of-transport-and-road-safety-gtfs-2519
jp-aomori-aomori-city-bus-gtfs-2607
ro-buzau-transbus-buzau-gtfs-2106
ro-dambovita-servicii-publice-municipale-targoviste-gtfs-2107
ro-prahova-transport-calatori-express-ploiesti-gtfs-2108
us-california-alameda-contra-costa-transit-district-ac-transit-gtfs-2455
us-california-county-connection-gtfs-2421
us-california-regional-transportation-commission-of-southern-nevada-rtc-gtfs-110
us-california-santa-maria-area-transit-gtfs-26
us-california-south-county-transit-link-gtfs-2203
us-california-taft-maricopa-area-transit-gtfs-821
us-california-tri-delta-transit-gtfs-1974
us-colorado-vail-blueskylimo-gtfs-2616
us-florida-gainesville-regional-transit-system-gtfs-2412
us-hawaii-hawaii-mass-transit-agency-hele-on-bus-gtfs-2608
us-illinois-danville-mass-transit-gtfs-2363
us-kansas-salina-gtfs-1867
us-massachusetts-pioneer-valley-transit-authority-pvta-gtfs-2416
us-oregon-burns-paiute-tribal-transit-gtfs-2621
us-utah-cache-valley-gtfs-1906
us-virginia-arlington-transit-gtfs-485
us-virginia-fairfax-cue-bus-cue-gtfs-2609
us-virginia-fredericksburg-regional-transit-gtfs-2430
us-washington-coast-transportation-gtfs-2162
us-washington-columbia-county-public-transportation-gtfs-2168
us-washington-community-in-motion-gtfs-2163
us-washington-eastside-friends-of-seniors-gtfs-2166
us-washington-hopelink-transportation-gtfs-2167
us-washington-intercity-transit-gtfs-2289
us-washington-paratransit-services-gtfs-2176
us-washington-puget-sound-educational-service-district-gtfs-2177
us-washington-sound-generations-hyde-shuttle-gtfs-2183

⏱️ Performance Assessment

📈 Validation Time

Assess the performance in terms of seconds taken for the validation process.

Time Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 3.73 3.82 ⬆️+0.09
Median -- 1.38 1.50 ⬆️+0.12
Standard Deviation -- 10.68 10.33 ⬇️-0.35
Minimum in References Reports us-california-san-juan-capistrano-free-weekend-trolley-gtfs-2235 0.48 0.57 ⬆️+0.10
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 285.52 263.50 ⬇️-22.02
Minimum in Latest Reports us-california-taft-area-transit-gtfs-2236 0.48 0.57 ⬆️+0.09
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 285.52 263.50 ⬇️-22.02
📜 Memory Consumption
Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 459.73 MiB 460.65 MiB ⬆️+941.43 KiB
Median -- 335.93 MiB 332.04 MiB ⬇️-3.88 MiB
Standard Deviation -- 732.24 MiB 728.43 MiB ⬇️-3.80 MiB
Minimum in References Reports us-colorado-boulder-county-gtfs-2191 38.79 MiB 411.93 MiB ⬆️+373.13 MiB
Maximum in Reference Reports ch-unknown-swiss-federal-railways-sbb-gtfs-2144 8.37 GiB 8.03 GiB ⬇️-345.00 MiB
Minimum in Latest Reports ro-vrancea-consiliul-judetean-vrancea-gtfs-1984 39.77 MiB 37.99 MiB ⬇️-1.78 MiB
Maximum in Latest Reports ch-unknown-swiss-federal-railways-sbb-gtfs-2144 8.37 GiB 8.03 GiB ⬇️-345.00 MiB

@sylvansson
Copy link
Contributor Author

I see a large increase in memory consumption for the "Minimum in References Reports" metric but other PRs (this one, for example) have the same thing, so I assume it's nothing to worry about?

@davidgamez
Copy link
Member

I see a large increase in memory consumption for the "Minimum in References Reports" metric but other PRs (this one, for example) have the same thing, so I assume it's nothing to worry about?

I agree; this increase doesn't look like a side effect of this PR.

Copy link
Member

@davidgamez davidgamez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@skalexch skalexch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not having submitted this earlier, I wrote the comment on Wednesday and did not submit the review. I just wrote one question regarding the logic.

return Optional.empty();
}

Map<String, Set<LocalDate>> removedDaysByServiceId =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sylvansson @davidgamez one question, the function seems to create removedDaysByServiceId by using only GtfsCalendarDate (calendar_dates.txt). However, there might be services in calendar.txt that don't have a removed date in calendar_dates.txt.

So I think the way to check this is by making sure that for each removed date, it is removed for every service_id that figures either in calendar.txt or in calendar_dates.txt (as ADDED SERVICE: exception_type=1).

Eg:
calendar.txt: service_id s_1 from 1-5-2025 to 30-5-2025.

calendar_dates.txt:
service_id s_1 removed on 1-5-2025
service_id s_3 added on 24-5-2025
service_id s_3 removed on 31-5-2025

The service window is 2-5-2025 to 31-5-2025 because:

  • May 1st is removed for all services that exist on May 1st (s_1 only)
  • s_3 was added for May 24th and removed on May 31st. So no service operates on May 31st. If s_1 were to end on May 31st then the service window would include that date.

I suggest that for each date, we count the total unique removed service_ids, then compare it with the count of unique service_ids that operate on that date (both from calendar.txt and the added services in calendar_dates.txt), but there might be better ways to do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @skalexch. You are right. The spec doesn't force all service IDs to be defined in the calendar.txt and calendar_dates.txt when both files are defined. In this case, we can have service IDs from calendar.txt that are not present in the calendar_dates and will have a service on a day removed by a calendar_date entry.

Following this logic, I agree that we need to keep a total of "all unique service IDs" per day, so we make sure that all services are not present to mark the date as "removed." The "all unique service IDs" can be computed by looking at the trips.txt file.
In addition, we need to ensure that the added service "expands" the start and end dates of the service window. So, we must iterate over all additions and set the start and end dates accordingly. This should be done after computing the start and end dates from all calendar ranges.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll try to get this fixed sometime next week.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll try to get this fixed sometime next week.

Thanks!

return Optional.empty();
}

Map<String, Set<LocalDate>> removedDaysByServiceId =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @skalexch. You are right. The spec doesn't force all service IDs to be defined in the calendar.txt and calendar_dates.txt when both files are defined. In this case, we can have service IDs from calendar.txt that are not present in the calendar_dates and will have a service on a day removed by a calendar_date entry.

Following this logic, I agree that we need to keep a total of "all unique service IDs" per day, so we make sure that all services are not present to mark the date as "removed." The "all unique service IDs" can be computed by looking at the trips.txt file.
In addition, we need to ensure that the added service "expands" the start and end dates of the service window. So, we must iterate over all additions and set the start and end dates accordingly. This should be done after computing the start and end dates from all calendar ranges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inverted service dates in validation report (start later than end)

3 participants