Skip to content

Conversation

@adk-swisstopo
Copy link
Member

We were holding releases due to PB-1619. It looks increasingly likely that is an issue specific to how the DEV environment is setup. Let's cut a new release and push it to INT to verify. This is a major release because at least #572 qualifies.

adk-swisstopo and others added 30 commits April 16, 2025 15:42
In #558 we added the ability to
log stack traces upon exit. However we did not distinguish between a normal
graceful exit and an abnormal exit of a deadlocked service. This means that on
every exit/restart, we clutter the logs with stack traces.

With this change, upon exit we only attempt to dump the stack if the worker is
still running after a delay. This means that graceful exit that complete within
that delay do not clutter the logs with stack traces.

This change removes the `GUNICORN_DUMP_STACKS_ON_EXIT` variable and adds
`GUNICORN_STACK_DUMP_DELAY` and `GUNICORN_GRACEFUL_TIMEOUT`. The former being
set by default to one second less than the latter. We may need to tweak
`GUNICORN_GRACEFUL_TIMEOUT` to be coherent with the Kubernetes
`terminationGracePeriodSeconds`.

Another way to resolve this might be implemented upstream later in benoitc/gunicorn#3385
PB-1531: only dump stacks when exit is taking too long.
PB-1575: make remove_expired_items more memory-efficient, probably.
Previously: Response 200
Now: Response 200 GET /checker?
PB-1619: make response logging more useful.
This was fixed in gevent 25.4.1.
Upgrade gevent and remove our custom bug workaround.
This refactores the `RemoveExpiredItems` test case to make it easier to
modify the code being tested and to add more tests in the future.

Specifically:
* move the fixtures into `setUp` and `tearDown` as that's where they belong
* split `test_remove_item` into two as it's really two tests
This will be used to add more test cases in a later change.
Specifically:
* verify non-expiring items are not removed
* allow defining arbitrary number of expiring and non-expiring items
* test the default min_age_hours
* print the correct class name on object existence assertion failure
* deduplicate expected output
* add test case for many items (currently 20)
That was only useful when testing exact match on stdout. We now only look for
patterns so it's less cumbersome to write new tests or to modify the output.
Specifically:
* check that non-numerical values trigger an error
* check that passing a value lower than the default works
We track memory usage and wallclock.
Also increase the number of Items from 20 to 110.
The data_factory creates Django objects individually, which is really slow when
many objects are involved as that incurs one database round-trip per object.

With this change we use the `bulk_create` method instead. This dramatically
speeds up the test setup and allows much more practical performance testing
of larger datasets.

Also bump the number of Items in RemoveExpiredItemsManyWithProfiling to 1010.

Profiling output on my laptop:

	memory consumed: 6857881 (6857.881 per item)
	time consumed: 15.571250406s (0.015571250406000001 per item)
We now only perform a fixed number of queries instead of multiple queries for
each Item.

Performance increase for RemoveExpiredItemsManyWithProfiling:

	memory +57%	memory consumed: 2890362 (2890.362 per item)
	wallclock +69%	time consumed: 4.712163328s (0.004712163328 per item)

However this shifts the load to the database. A later commit will add safety
features to reduce risks of overloading it. That said, I am not too concerned
about [conflicting locks or transactions](https://pglocks.org/?pgcommand=DELETE).
This adds two flags: --max-deletions (absolute value) and
--max-deletions-percentage (a percentage). If we detect an attempt to remove
more than either value of Items, we fail with an error.

This is useful to guard against bugs that would accidentally wipe out the
whole database and to reduce risks of overloading the database with a request
that might consume a lot of resources.

Default values are based on the daily usage we observe in PROD. Given the
backlog of expired Items currently in the database, the safety will trigger
until we clean things up manually.

This also forbids negative values for --min-age-hours.
While remove_expired_items only change the status of AssetUploads from IN_PROGRESS to ABORTED,
there is an [ON_DELETE=CASCADE](https://github.com/geoadmin/service-stac/blob/1219a97f35dc296a44328e710a23c858a99aa3aa/app/stac_api/models/item.py#L290)
that removes them implicitely. There is also a
[Django pre_delete constraing](https://github.com/geoadmin/service-stac/blob/1219a97f35dc296a44328e710a23c858a99aa3aa/app/stac_api/signals.py#L35)
preventing us from deleting AssetUploads that are IN_PROGRESS. So we do need to
update them for them to be DELETED afterwards.

This change ensures that the correct AssetUploads are deleted.
PB-1575: many improvements to remove_expired_items and its tests.
PB-1615: Fix change form template variable lookup error
Every time we delete an `Asset` or `CollectionAsset`, we make a call to the
corresponding S3 `DeleteObjects` method. For objects that are hosted externally
(i.e. not in an S3 bucket we control), this call always fails silently and we
can just as well not do it.

With this change, we stop attempting to call `DeleteObjects` for externally
hosted `Assets` and `CollectionAssets`, thus speeding up deletion.
PB-1575: don't attempt to delete externally hosted S3 objects.
asteiner-swisstopo and others added 24 commits May 8, 2025 14:09
This is the same as in the other endpoints that hide expired items:
- GET /collections/{collectionId}/items: Fetch features
- GET /collections/{collectionId}/items/{featureId}: Fetch a single feature

There is no difference in the result but it reduces confusion.
To reduce duplication. Note that we cannot extract the whole filtering operation  (`Item.objects.filter(...)`) as the different usages add additional fields.
One `delete()` per object is very slow. One `delete()` for all the objects
consumes a lot of memory and fails in an all-or-nothing mode (i.e. when it
fails mid-way, nothing gets deleted).

With this change, we split the objects into batches of configurable size and
run `delete()` over each batch. It consumes a ~constant amount of memory, is
faster than having a batch size of 1 and is more robust. I.e. when it fails
mid-way, it will most likely have deleted at least some objects and the next
run has that much less work to do.

There will be cases where expired items are partially deleted (i.e. some of
their assets are deleted but not all). I don't expect that to be a problem for
our use case.
Instead of extracting the ids of the objects to DELETE with SELECT/LIMIT/OFFSET,
we can just use SELECT/LIMIT. This is supposed to be
[faster](https://www.postgresql.org/docs/current/queries-limit.html)
but mostly, it's simpler as we don't need to paginate.

Idea from https://forum.djangoproject.com/t/incremental-bulk-deletion/40833/2
Update develop with master hot fixes
PB-1669:  Render asset hrefs as links in the admin UI
@adk-swisstopo adk-swisstopo requested a review from ltshb June 10, 2025 14:24
@github-actions github-actions bot changed the title #major release New Release v2.0.0 - #major Jun 10, 2025
@adk-swisstopo adk-swisstopo requested a review from boecklic June 10, 2025 14:24
@adk-swisstopo adk-swisstopo merged commit c240b8d into master Jun 11, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants