-
Notifications
You must be signed in to change notification settings - Fork 33
Speedup STAC collections endpoint #871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #871 +/- ##
===========================================
+ Coverage 84.32% 84.33% +0.01%
===========================================
Files 35 35
Lines 4230 4214 -16
Branches 537 535 -2
===========================================
- Hits 3567 3554 -13
+ Misses 467 465 -2
+ Partials 196 195 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
else [-180.0, -90.0, 180.0, 90.0] | ||
] | ||
# TODO: Find a nicer way to make the typechecker happier | ||
# pystac is too specific in wanting a list[float | int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not going to help right now, but it's possible pystac would accept a PR that changes the type annotation from list to Sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I was thinking that's the best solution too.
I started looking at the code but there's a lot of code being changed, and there are still test failures. I will take a look again when the tests are not failing. |
It's not too much code... But there is a massive amount of duplication between the postgis and postgres index modules unfortunately, but using different sqlalchemy table definition styles which makes all the code that's otherwise identical, slightly different. It's unfortunately been added to my mental tidy up list. |
d3ebb8e
to
a9e3d97
Compare
indexed=False, | ||
offset=("metadata", "title"), | ||
) | ||
|
||
description = SimpleDocField( | ||
name="description", | ||
description="product description", | ||
alchemy_column=collection.c.definition, | ||
alchemy_column=ODC_PRODUCT.definition, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The corresponding postgres change also fixes the type of the offset=
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that SimpleDocField
turns a plain string into a length 1 tuple... but I'm not sure.
This is ugly though, I'll make them match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed type annotations around that (and a bunch of usages) in datacube-core, I'm quite sure you need the tuple. A bit surprised the type checker wasn't complaining about these lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
pyright was complaining, but mypy wasn't for some reason.
It's fixed now regardless.
) -> list[Row]: | ||
# STAC Collections only hold a bounding box in EPSG:4326, no geometry | ||
# Calculate the bounding box on the server, it's far more efficient | ||
collection_bbox = func.Box2D( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RHS takes its data from TimeOverview
, shouldn't this be called t_o_bbox
(or overview_bbox
, but that is longer)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. I might add some comments.
The Cubedash Product (which maps to a STAC Collection) doesn't have any bounding box or geometry attached, all the geometries are in the TimeOverview table, grouped by different period_types
. In this case we use the period_type=="all"
to get the one that covers all of time.
|
||
if bbox: | ||
query = query.where( | ||
collection.c.footprint_geometry.intersects(func.ST_MakeEnvelope(*bbox)) | ||
TimeOverview.footprint_geometry.intersects(func.ST_MakeEnvelope(*bbox)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no test coverage for most of these if-branches, but I'll trust you know what you are doing with these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there's a least one bug that I found in these if-branches. The 'time' filter was crashing by trying to apply a python datetime function to a SQLAlchemy Column.
I only added one test in this PR. There's room for more...
In this case, collection
was a subquery that held the required columns. I've removed the sub-query since it was unnecessary obfuscation, and apply the WHERE clause against the table.
# supports the full 'Features' specification. | ||
|
||
|
||
@pytest.mark.parametrize("env_name", ("default",), indirect=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a performance issue, shouldn't this also be enabled for the postgis backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not using the PostGIS backend yet, and, there's a LOT of integration tests here that aren't testing against the PostGIS index either. I was afraid I'd fall down another rabbit hole if I started expanding the test coverage.
The table structure and query structure are identical between the two, with the exception of the naming of the String name
field that is used to connect the cubedash product table with the odc product/agdc dataset_type table.
- Explain at a high level why we're using the TimeOverview able - Fix a couple more typing errors, - Can't apply datetime_utc to SQLAlchemy Columns - SimpleDocField expects `offset` to be a tuple - Inline a confusingly name variable in the main query
8cb8757
to
950a435
Compare
As reported in #802, we were experiencing timeouts or extremely slow responses to the
/stac/collections
endpoint. The implementation was doing a lot of extra data transfer and work, by transferring full geometries from PostGIS into Python before simplifying them down to a bbox.This moves the processing up into PostGIS, and in my testing gives a speedup between 10x and 121x, and potentially more in different environments. I only tested on my laptop.
Running with local PostGIS server, M4 MacBook Pro.
pytest-benchmark
Loading
/stac/collections
with the data loaded intest_stac.py
Small dataset, but about 10x faster.
Running datacube-explorer locally
Using a copy of our database with the horrendously complex geometries
Run the server:
Run benchmark with hey
Before this change:
After this change:
121x faster.
I haven't added any request caching discussed in the issue, that can be done separately.
Fixes #802
📚 Documentation preview 📚: https://datacube-explorer--871.org.readthedocs.build/en/871/