Storage release 2025-08-01 06:13 UTC #12800

vipvap · 2025-08-01T06:13:30Z

No description provided.

Follow up to #12701, which introduced a new regression. When profiling locally I noticed that writes have the tendency to always reallocate. On investigation I found that even if the `Connection`'s write buffer is empty, if it still shares the same data pointer as the `Client`'s write buffer then the client cannot reclaim it. The best way I found to fix this is to just drop the `Connection`'s write buffer each time we fully flush it. Additionally, I remembered that `BytesMut` has an `unsplit` method which is allows even better sharing over the previous optimisation I had when 'encoding'.

We have a dedicated libs folder for proxy related libraries. Let's move the subzero_core stub there.

## Problem LKB-2502 The garbage collection of the project info cache is garbage. What we observed: If we get unlucky, we might throw away a very hot entry if the cache is full. The GC loop is dependent on getting a lucky shard of the projects2ep table that clears a lot of cold entries. The GC does not take into account active use, and the interval it runs at is too sparse to do any good. Can we switch to a proper cache implementation? Complications: 1. We need to invalidate by project/account. 2. We need to expire based on `retry_delay_ms`. ## Summary of changes 1. Replace `retry_delay_ms: Duration` with `retry_at: Instant` when deserializing. 2. Split the EndpointControls from the RoleControls into two different caches. 3. Introduce an expiry policy based on error retry info. 4. Introduce `moka` as a dependency, replacing our `TimedLru`. See the follow up PR for changing all TimedLru instances to use moka: #12726.

## Problem The Dockerfile for build tools has some small issues that are easy to fix to make it follow some of docker best practices ## Summary of changes Apply some small quick wins on the Dockerfile for build tools - Usage of apt-get over apt - usage of --no-cache-dir for pip install

Document that the Pageserver gRPC port is accessible by computes, and should not provide internal services. Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).

Verify that gRPC `GetPageRequest` has been sent to the shard that owns the pages. This avoid spurious `NotFound` errors if a compute misroutes a request, which can appear scarier (e.g. data loss). Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).

…ciliations (#12745) ## Problem We saw the following in the field: Context and observations: * The storage controller keeps track of the latest generations and the pageserver that issued the latest generation in the database * When the storage controller needs to proxy a request (e.g. timeline creation) to the pageservers, it will find use the pageserver that issued the latest generation from the db (generation_pageserver). * pageserver-2.cell-2 got into a bad state and wasn't able to apply location_config (e.g. detach a shard) What happened: 1. pageserver-2.cell-2 was a secondary for our shard since we were not able to detach it 2. control plane asked to detach a tenant (presumably because it was idle) a. In response storcon clears the generation_pageserver from the db and attempts to detach all locations b. it tries to detach pageserver-2.cell-2 first, but fails, which fails the entire reconciliation leaving the good attached location still there c. return success to cplane 3. control plane asks to re-attach the tenant a. In response storcon performs a reconciliation b. it finds that the observed state matches the intent (remember we did not detach the primary at step(2)) c. skips incrementing the genration and setting the generation_pageserver column Now any requests that need to be proxied to pageservers and rely on the generation_pageserver db column fail because that's not set ## Summary of changes 1. We do all non-essential location config calls (setting up secondaries, detaches) at the end of the reconciliation. Previously, we bailed out of the reconciliation on the first failure. With this patch we attempt all of the RPCs. This allows the observed state to update even if another RPC failed for unrelated reasons. 2. If the overall reconciliation failed, we don't want to remove nodes from the observed state as a safe-guard. With the previous patch, we'll get a deletion delta to process, which would be ignored. Ignoring it is not the right thing to do since it's out of sync with the db state. Hence, on reconciliation failures map deletion from the observed state to the uncertain state. Future reconciliation will query the node to refresh their observed state. Closes LKB-204

## Problem Copy certificate and key from secret mount directory to `pgdata` directory where `postgres` is the owner and we can set the key permission to 0600. ## Summary of changes - Added new pgparam `pg_compute_tls_settings` to specify where k8s secret for certificate and key are mounted. - Added a new field to `ComputeSpec` called `databricks_settings`. This is a struct that will be used to store any other settings that needs to be propagate to Compute but should not be persisted to `ComputeSpec` in the database. - Then when the compute container start up, as part of `prepare_pgdata` function, it will copied `server.key` and `server.crt` from k8s mounted directory to `pgdata` directory. ## How is this tested? Add unit tests. Manual test via KIND Co-authored-by: Jarupat Jisarojito <jarupat.jisarojito@databricks.com>

… compute instance (#12732) ## Problem We need the set the following Postgres GUCs to the correct value before starting Postgres in the compute instance: ``` databricks.workspace_url databricks.enable_databricks_identity_login databricks.enable_sql_restrictions ``` ## Summary of changes Plumbed through `workspace_url` and other GUC settings via `DatabricksSettings` in `ComputeSpec`. The spec is sent to the compute instance when it starts up and the GUCs are written to `postgresql.conf` before the postgres process is launched. --------- Co-authored-by: Jarupat Jisarojito <jarupat.jisarojito@databricks.com> Co-authored-by: William Huang <william.huang@databricks.com>

neondatabase/cloud#19011 - Prewarm config changes are not publicly available. Correct the test by using a pre-filled 50 GB project on staging - Create extension neon with schema neon to fix read performance tests on staging, error example in https://neon-github-public-dev.s3.amazonaws.com/reports/main/16483462789/index.html#suites/3d632da6dda4a70f5b4bd24904ab444c/919841e331089fc4/ - Don't create extra endpoint in LFC prewarm performance tests

LKB-2536 TimedLru is hard to maintain. Let's use moka instead. Stacked on top of #12710.

## Problem test_ps_unavailable_after_delete is flaky. All test failures I've looked at are because of ERROR log messages in pageserver, which happen because storage controller tries runs a reconciliations during the graceful shutdown of the pageserver. I wasn't able to reproduce it locally, but I think stopping PS immediately instead of gracefully should help. If not, we might just silence those errors. - Closes: https://databricks.atlassian.net/browse/LKB-745

## Problem Monitoring dashboards show aggregates of all proxy instances, including terminating ones. This can skew the results or make graphs less readable. Also, alerts must be tuned to ignore certain signals from terminating proxies. ## Summary of changes Add a `service_info` metric currently with one label, `state`, showing if an instance is in state `init`, `running`, or `terminating`. The metric can be joined with other metrics to filter the presented time series.

…` guard (#12743) Before this PR, getpage requests wouldn't hold the `applied_gc_cutoff_lsn` guard until they were done. Theoretical impact: if we’re not holding the `RcuReadGuard`, gc can theoretically concurrently delete reconstruct data that we need to reconstruct the page. I don't think this practically occurs in production because the odds of it happening are quite low, especially for primary read_write computes. But RO replicas / standby_horizon relies on correct `applied_gc_cutofff_lsn`, so, I'm fixing this as part of the work ok replacing standby_horizon propagation mechanism with leases (LKB-88). The change is feature-gated with a feature flag, and evaluated once when entering `handle_pagestream` to avoid performance impact. For observability, we add a field to the `handle_pagestream` span, and a slow-log to the place in `gc_loop` where it waits for the in-flight RcuReadGuard's to drain. refs - fixes https://databricks.atlassian.net/browse/LKB-2572 - standby_horizon leases epic: https://databricks.atlassian.net/browse/LKB-2572 --------- Co-authored-by: Christian Schwarz <Christian Schwarz>

## Problem For certificate auth, we need to configure pg_hba and pg_ident for it to work. HCC needs to mount this config map to all pg compute pod. ## Summary of changes Create `databricks_pg_hba` and `databricks_pg_ident` to configure where the files are located on the pod. These configs are pass down to `compute_ctl`. Compute_ctl uses these config to update `pg_hba.conf` and `pg_ident.conf` file. We append `include_if_exists {databricks_pg_hba}` to `pg_hba.conf` and similarly to `pg_ident.conf`. So that it will refer to databricks config file without much change to existing pg default config file. --------- Co-authored-by: Jarupat Jisarojito <jarupat.jisarojito@databricks.com> Co-authored-by: William Huang <william.huang@databricks.com> Co-authored-by: HaoyuHuang <haoyu.huang.68@gmail.com>

Exposes metrics for caches. LKB-2594 This exposes a high level namespace, `cache`, that all cache metrics can be added to - this makes it easier to make library panels for the caches as I understand it. To calculate the current cache fill ratio, you could use the following query: ``` ( cache_inserted_total{cache="node_info"} - sum (cache_evicted_total{cache="node_info"}) without (cause) ) / cache_capacity{cache="node_info"} ``` To calculate the cache hit ratio, you could use the following query: ``` cache_request_total{cache="node_info", outcome="hit"} / sum (cache_request_total{cache="node_info"}) without (outcome) ```

## Problem The test for logical replication used the year-old versions of ClickHouse and Debezium so that we may miss problems related to up-to-date versions. ## Summary of changes The ClickHouse version has been updated to 24.8. The Debezium version has been updated to the latest stable one, 3.1.3Final. Some problems with locally running the Debezium test have been fixed. --------- Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com> Co-authored-by: Alexander Bayandin <alexander@neon.tech>

## Problem When tenants have a lot of timelines, the number of tenants that a pageserver can comfortably handle goes down. Branching is much more widely used in practice now than it was when this code was written, and we generally run pageservers with a few thousand tenants (where each tenant has many timelines), rather than the 10k-20k we might have done historically. This should really be something configurable, or a more direct proxy for resource utilization (such as non-archived timeline count), but this change should be a low effort improvement. ## Summary of changes * Change the target shard count (MAX_SHARDS) to 2500 from 5000 when calculating pageserver utilization (i.e. a 200% overcommit now corresponds to 5000 shards, not 10000 shards) Co-authored-by: John Spray <john.spray@databricks.com>

There were a few uses of these already, so collect them to the compatibility header to avoid the repetition and scattered #ifdefs. The definition of MyProcNumber is a little different from what was used before, but the end result is the same. (PGPROC->pgprocno values were just assigned sequentially to all PGPROC array members, see InitProcGlobal(). That's a bit silly, which is why it was removed in v17.)

## Problem close LKB-753. `test_pageserver_metrics_removed_after_offload` is unstable and it sometimes leave the metrics behind after tenant offloading. It turns out that we triggered an image compaction before the offload and the job was stopped after the offload request was completed. ## Summary of changes Wait all background tasks to finish before checking the metrics. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

- Remove some unused code - Use `is_multiple_of()` instead of '%' - Collapse consecuative "if let" statements - Elided lifetime fixes It is enough just to review the code of your team

## Problem Password hashing for sql-over-http takes up a lot of CPU. Perhaps we can get away with temporarily caching some steps so we only need fewer rounds, which will save some CPU time. ## Summary of changes The output of pbkdf2 is the XOR of the outputs of each iteration round, eg `U1 ^ U2 ^ ... U15 ^ U16 ^ U17 ^ ... ^ Un`. We cache the suffix of the expression `U16 ^ U17 ^ ... ^ Un`. To compute the result from the cached suffix, we only need to compute the prefix `U1 ^ U2 ^ ... U15`. The suffix by itself is useless, which prevent's its use in brute-force attacks should this cached memory leak. We are also caching the full 4096 round hash in memory, which can be used for brute-force attacks, where this suffix could be used to speed it up. My hope/expectation is that since these will be in different allocations, it makes any such memory exploitation much much harder. Since the full hash cache might be invalidated while the suffix is cached, I'm storing the timestamp of the computation as a way to identity the match. I also added `zeroize()` to clear the sensitive state from the stack/heap. For the most security conscious customers, we hope to roll out OIDC soon, so they can disable passwords entirely. --- The numbers for the threadpool were pretty random, but according to our busiest region for sql-over-http, we only see about 150 unique endpoints every minute. So storing ~100 of the most common endpoints for that minute should be the vast majority of requests. 1 minute was chosen so we don't keep data in memory for too long.

The `std::mem::offset_of` macro was introduced in Rust 1.77.0. In the passing, mark the function as `const`, as suggested in the comment. Not sure which compiler version that requires, but it works with what have currently.

…2770) ## Problem `test_lsn_lease_storcon` might fail in debug builds due to slow ShardSplit ## Summary of changes - Make `test_lsn_lease_storcon ` test to ignore `.*Exclusive lock by ShardSplit was held.*` warning in debug builds Ref: https://databricks.slack.com/archives/C09254R641L/p1753777051481029

…12767) ## Problem We used ClickHouse v. 24.8, which is outdated, for logical replication testing. We could miss some problems. ## Summary of changes The version was updated to 25.6, with a workaround using the environment variable `PGSSLCERT`. Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com>

We use 'tracing' everywhere.

… loop (#12754) ## Problem We have seen some errors in staging when the shard migration was triggered by optimizations, and it was ongoing during draining the node it was migrating from. It happens because the node draining loop only waits for the migrations started by the drain loop itself. The ongoing migrations are ignored. Closes: https://databricks.atlassian.net/browse/LKB-1625 ## Summary of changes - Wait for the shard reconciliation during the drain if it is being migrated from the drained node.

## Problem Given a container image it is difficult to figure out dependencies and doesn't work automatically. ## Summary of changes - Build all rust binaries with `cargo auditable`, to allow sbom scanners to find it's dependencies. - Adjust `attests` for `docker/build-push-action`, so that buildkit creates sbom and provenance attestations. - Dropping `--locked` for `rustfilt`, because `rustfilt` can't build with locked dependencies[^5] ## Further details Building with `cargo auditable`[^1] embeds a dependency list into Linux, Windows, MacOS and WebAssembly artifacts. A bunch of tools support discovering dependencies from this, among them `syft`[^2], which is used by the BuildKit Syft scanner[^3] plugin. This BuildKit plugin is the default[^4] used in docker for generating sbom attestations, but we're making that default explicit by referencing the container image. [^1]: https://github.com/rust-secure-code/cargo-auditable [^2]: https://github.com/anchore/syft [^3]: https://github.com/docker/buildkit-syft-scanner [^4]: https://docs.docker.com/build/metadata/attestations/sbom/#sbom-generator [^5]: luser/rustfilt#23

…#12772) ## Problem In #12467, timeouts and retries were added to `Cache::get` tenant shard resolution to paper over an issue with read unavailability during shard splits. However, this retries _all_ errors, including irrecoverable errors like `NotFound`. This causes problems with gRPC child shard routing in #12702, which targets specific shards with `ShardSelector::Known` and relies on prompt `NotFound` errors to reroute requests to child shards. These retries introduce a 1s delay for all reads during child routing. The broader problem of read unavailability during shard splits is left as future work, see https://databricks.atlassian.net/browse/LKB-672. Touches #12702. Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191). ## Summary of changes * Change `TenantManager` to always return a concrete `GetActiveTimelineError`. * Only retry `WaitForActiveTimeout` errors. * Lots of code unindentation due to the simplified error handling. Out of caution, we do not gate the retries on `ShardSelector`, since this can trigger other races. Improvements here are left as future work.

…sk in pull timeline (#12778) ## Problem I discovered two bugs corresponding to safekeeper migration, which together might lead to a data loss during the migration. The second bug is from a hadron patch and might lead to a data loss during the safekeeper restore in hadron as well. 1. `switch_membership` returns the current `term` instead of `last_log_term`. It is used to choose the `sync_position` in the algorithm, so we might choose the wrong one and break the correctness guarantees. 2. The current `term` is used to choose the most advanced SK in `pull_timeline` with higher priority than `flush_lsn`. It is incorrect because the most advanced safekeeper is the one with the highest `(last_log_term, flush_lsn)` pair. The compute might bump term on the least advanced sk, making it the best choice to pull from, and thus making committed log entries "uncommitted" after `pull_timeline` Part of https://databricks.atlassian.net/browse/LKB-1017 ## Summary of changes - Return `last_log_term` in `switch_membership` - Use `(last_log_term, flush_lsn)` as a primary key for choosing the most advanced sk in `pull_timeline` and deny pulling if the `max_term` is higher than on the most advanced sk (hadron only) - Write tests for both cases - Retry `sync_safekeepers` in `compute_ctl` - Take into the account the quorum size when calculating `sync_position`

As is, e.g. quota errors on wake compute are logged as "compute" errors.

It's an anon v1 failed launch artifact, I suppose.

Clap automatically uses doc comments as help/about texts. Doc comments are strictly better, since they're also used e.g. for IDE documentation, and are better formatted. This patch updates all `neon_local` commands to use doc comments (courtesy of GPT-o3).

) Commit 1dce2a9 changed how the `neon.pageserver_connstring` setting is formed, but it messed up setting the `neon.stripe_size` setting so that it was set twice. That got mixed up during development of the patch, as commit 7fef443 landed first and was merged incorrectly.

- Return sub-actions time spans for prewarm, prewarm offload, and promotion in http handlers. - Set `synchronous_standby_names=walproposer` for promoted endpoints. Otherwise, walproposer on promoted standby ignores reply from safekeeper and is stuck on lsn COMMIT eternally.

## Problem Right now if we commit a joint configuration to DB, there is no way back. The only way to get the clean mconf is to continue the migration. The RFC also described an abort mechanism, which allows to abort current migration and revert mconf change. It might be needed if the migration is stuck and cannot have any progress, e.g. if the sk we are migrating to went down during the migration. This PR implements this abort algorithm. - Closes: https://databricks.atlassian.net/browse/LKB-899 - Closes: #12549 ## Summary of changes - Implement `safekeeper_migrate_abort` handler with the algorithm described in RFC - Add `timeline-safekeeper-migrate-abort` subcommand to `storcon_cli` - Add test for the migration abort algorithm.

## Problem rest broker needs to respond with the correct cors headers for the api to be usable from other domains ## Summary of changes added a code path in rest broker to handle the OPTIONS requests --------- Co-authored-by: Ruslan Talpa <ruslan.talpa@databricks.com>

github-actions · 2025-08-01T06:13:44Z

If this PR added a GUC in the Postgres fork or neon extension,
please regenerate the Postgres settings in the cloud repo:

make NEON_WORKDIR=path/to/neon/checkout \
  -C goapp/internal/shareddomain/postgres generate

If you're an external contributor, a Neon employee will assist in
making sure this step is done.

github-actions · 2025-08-01T07:18:54Z

9119 tests run: 8465 passed, 0 failed, 654 skipped (full report)

Flaky tests (1)

Postgres 17

test_import_timeline_disk_pressure_eviction: debug-x86-64-without-lfc

Code coverage* (full report)

functions: 34.7% (8841 of 25482 functions)
lines: 45.7% (71641 of 156719 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
f5a4a01 at 2025-08-01T07:18:53.567Z :recycle:}

conradludgate and others added 30 commits July 25, 2025 09:03

move subzero_core to proxy libs (#12742)

a70a5bc

We have a dedicated libs folder for proxy related libraries. Let's move the subzero_core stub there.

pageserver: document gRPC compute accessibility (#12724)

37e3224

Document that the Pageserver gRPC port is accessible by computes, and should not provide internal services. Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).

[proxy] replace TimedLru with moka (#12726)

d09664f

LKB-2536 TimedLru is hard to maintain. Let's use moka instead. Stacked on top of #12710.

Fix misc typos and some cosmetic code cleanup (#12695)

40cae8c

chore: Fix nightly lints (#12746)

6be5721

- Remove some unused code - Use `is_multiple_of()` instead of '%' - Collapse consecuative "if let" statements - Elided lifetime fixes It is enough just to review the code of your team

Replace 'memoffset' crate with core functionality (#12761)

d487ba2

The `std::mem::offset_of` macro was introduced in Rust 1.77.0. In the passing, mark the function as `const`, as suggested in the comment. Not sure which compiler version that requires, but it works with what have currently.

Remove unnecessary dependency to 'log' crate (#12763)

568927a

We use 'tracing' everywhere.

DimasKovas and others added 9 commits July 31, 2025 09:29

[proxy] Correctly classify ConnectErrors (#12793)

f3ee6e8

As is, e.g. quota errors on wake compute are logged as "compute" errors.

chore(compute_tools): Delete unused anon_ext_fn_reassign.sql (#12787)

8fe7596

It's an anon v1 failed launch artifact, I suppose.

Storage release 2025-08-01 06:13 UTC

f5a4a01

vipvap requested review from a team as code owners August 1, 2025 06:13

vipvap requested review from hlinnaka, iddm, DimasKovas, milosv-db, andrewdoss-db, awarus and fedordikarev and removed request for a team August 1, 2025 06:13

bayandin closed this Aug 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Storage release 2025-08-01 06:13 UTC #12800

Storage release 2025-08-01 06:13 UTC #12800

vipvap commented Aug 1, 2025

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

github-actions bot commented Aug 1, 2025

Postgres 17

Uh oh!

Uh oh!

Storage release 2025-08-01 06:13 UTC #12800

Storage release 2025-08-01 06:13 UTC #12800

Conversation

vipvap commented Aug 1, 2025

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

github-actions bot commented Aug 1, 2025

9119 tests run: 8465 passed, 0 failed, 654 skipped (full report)

Postgres 17

Code coverage* (full report)

Uh oh!

Uh oh!