Add cache for relation persistence #12166

knizhnik · 2025-06-08T15:29:05Z

Problem

See #12073
and https://neondb.slack.com/archives/C04DGM6SMTM/p1748620049660389

There is race condition in the current unlogged build schema in neon_write with smgr_relpersistence==0 we first check if local file exists and if so, consider that it is unlogged build and call mdwrite to perform local write. But many things can happen between mdexists and mdwritev . For example some other backend can complete unlogged build and unlink this files.

Summary of changes

Add cache for relation kind which can avoid mdexists calls and eliminate race condition at unlogged build end.

github-actions · 2025-06-08T15:29:23Z

If this PR added a GUC in the Postgres fork or neon extension,
please regenerate the Postgres settings in the cloud repo:

make NEON_WORKDIR=path/to/neon/checkout \
  -C goapp/internal/shareddomain/postgres generate

If you're an external contributor, a Neon employee will assist in
making sure this step is done.

github-actions · 2025-06-08T16:26:59Z

9130 tests run: 8475 passed, 0 failed, 655 skipped (full report)

Flaky tests (1)

Postgres 15

test_replica_query_race: release-arm64-with-lfc

Code coverage* (full report)

functions: 34.7% (8842 of 25482 functions)
lines: 45.7% (71646 of 156719 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
fbd668e at 2025-07-31T16:31:46.640Z :recycle:}

myrrc

I'd like a second opinion about the cache implementation

pgxn/neon/pagestore_client.h

pgxn/neon/relkind_cache.c

test_runner/performance/test_unlogged.py

alexanderlaw · 2025-06-12T07:54:04Z

@knizhnik , please pay attention to the new test failures produced for 6bcd46b:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-12166/15589546198/index.html#/testresult/55c01525c6a0084d

regress/regression.diffs

diff -U3 /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/gist.out /tmp/test_output/test_pg_regress[release-pg15-v1-4]-1/regress/results/gist.out
--- /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/gist.out	2025-06-11 15:57:37.168439715 +0000
+++ /tmp/test_output/test_pg_regress[release-pg15-v1-4]-1/regress/results/gist.out	2025-06-11 16:05:34.917725056 +0000
@@ -376,15 +376,8 @@
 -- This case isn't supported, but it should at least EXPLAIN correctly.
 explain (verbose, costs off)
 select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
-                                     QUERY PLAN                                     
-------------------------------------------------------------------------------------
- Limit
-   Output: p, ((circle(p, '1'::double precision) <-> '(0,0)'::point))
-   ->  Index Only Scan using gist_tbl_multi_index on public.gist_tbl
-         Output: p, (circle(p, '1'::double precision) <-> '(0,0)'::point)
-         Order By: ((circle(gist_tbl.p, '1'::double precision)) <-> '(0,0)'::point)
-(5 rows)
-
+ERROR:  lock neon_relkind is not held
+CONTEXT:  writing block 0 of relation base/16384/26093
 select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
 ERROR:  lossy distance functions are not supported in index-only scans
 -- Clean up

knizhnik · 2025-06-12T09:12:11Z

@knizhnik , please pay attention to the new test failures produced for 6bcd46b: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-12166/15589546198/index.html#/testresult/55c01525c6a0084d

regress/regression.diffs

diff -U3 /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/gist.out /tmp/test_output/test_pg_regress[release-pg15-v1-4]-1/regress/results/gist.out
--- /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/gist.out	2025-06-11 15:57:37.168439715 +0000
+++ /tmp/test_output/test_pg_regress[release-pg15-v1-4]-1/regress/results/gist.out	2025-06-11 16:05:34.917725056 +0000
@@ -376,15 +376,8 @@
 -- This case isn't supported, but it should at least EXPLAIN correctly.
 explain (verbose, costs off)
 select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
-                                     QUERY PLAN                                     
-------------------------------------------------------------------------------------
- Limit
-   Output: p, ((circle(p, '1'::double precision) <-> '(0,0)'::point))
-   ->  Index Only Scan using gist_tbl_multi_index on public.gist_tbl
-         Output: p, (circle(p, '1'::double precision) <-> '(0,0)'::point)
-         Order By: ((circle(gist_tbl.p, '1'::double precision)) <-> '(0,0)'::point)
-(5 rows)
-
+ERROR:  lock neon_relkind is not held
+CONTEXT:  writing block 0 of relation base/16384/26093
 select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
 ERROR:  lossy distance functions are not supported in index-only scans
 -- Clean up

Thank you for reporting the problem.
I hope b81baef will fix it.

problame · 2025-06-26T17:03:03Z

Removing myself from review, speaking as a storage person, I don't feel qualified to review this. If there's anything in the design that you want reviewed / cross-checked with storage team, please write a mini-rfc explaining what is being cached here, how invalidation works, etc.

hlinnaka

The locking, with the pinning, spinlock and lwlock, is a bit hard to grasp. I think it's correct, but I wonder if it could be simplified somehow?

In get_cached_relkind(), you do quite a lot of work while holding a spinlock. In particular, there's a call to LWLockAcquire. Acquiring a lwlock while holding a spinlock seems really bad.

pgxn/neon/pagestore_client.h

pgxn/neon/relkind_cache.c

pgxn/neon/pagestore_smgr.c

hlinnaka · 2025-07-11T15:40:23Z

I think the locking gets more straightforward, if you move the LWLockAcquire/Release calls out of relkind_cache.c, into pagestore_smgr.c:

When starting unlogged build, look up the entry with get_cached_relkind(), and set the relkind to RELKIND_UNLOGGED_BUILD.
When ending unlogged build, acquire lock, update the relkind in the entry to RELKIND_PERMANENT, release lock, then release the pin on the entry.
in neon_read, call get_cached_relkind to look up and pin the entry. If relkind is RELKIND_UNLOGGED_BUILD`, acquire the LWLock too.

Let's rename relkind_lock to something like finish_unlogged_build_lock or something. That's really the only thing it protects: the moment when a relation goes from RELKIND_UNLOGGED_BUILD to RELKIND_PERMANENT.

hlinnaka · 2025-07-11T15:42:36Z

During an unlogged build, I assume most of the writes to happen by the process that's building the index. All those writes will also call get_cached_relkind() and acquire/release the lwlock. Could we bypass easily that for the process building the index? Or is it a premature optimization?

Never mind, we only call get_cached_relkind() if the SMgrRelation entry doesn't have the relpersistence. In the process building the index, that should be up-to-date.

pgxn/neon/relkind_cache.c

knizhnik · 2025-07-11T19:27:14Z

I think the locking gets more straightforward, if you move the LWLockAcquire/Release calls out of relkind_cache.c, into pagestore_smgr.c:

I want to think more about it tomorrow.
But right now I just want to share my concerns.

I am not sure that it is correct to have gap between releasing spin lock protecting relkind_hash and obtaining shared lock. I think it can cause race, but I will think more about it tomorrow.
I think that setting the relkind to RELKIND_UNLOGGED_BUILD should be done under spinlock. Otherwise it once again can cause race.
neon_read doesn't need to care about unlogged builds. I think you mixed it with neon_write

Still not sure that moving lock to pagestore_smgr.c is really make logic more straightforward.
My intention was to hide all implementation details in relkind_cache. Certainly API is not so simple, but it seems to be in any case better than delegating some locking responsibilities to pagestore_smgr.

knizhnik · 2025-07-12T04:22:03Z

I think the locking gets more straightforward, if you move the LWLockAcquire/Release calls out of relkind_cache.c, into pagestore_smgr.c:

It is not correct to allow gap between releasing spin lock and granting shared lock in neon_write, because during this gap the backend performing uncloggingg build can complete the build, update status under exclusive lock and then remove relation files. In this case write to the file will fail.

knizhnik · 2025-07-12T14:05:58Z

I address all you comments except two.
The main one is your suggestion to move lwlock to pagestore_smgr.c
Why it doesn't work?

Backend 1 evict page fro shared buffers and calls neon_write which in turn calls get_cached_relkind get get relkind. Assume that it returns UNLOGGED_BUILD because unlogged build is performed by backend 2. No locks are hold at this moment.
Backend2 completes unlogged build. It obtain exclusive lock and changes relkind to PERMANENT. Then it releases lock. Then it removes relation file on local disk.
Backend1 obtains shared lock and starts writing file. It may file because file was removed by backend1.

How this problem is handled now?
get_relkind_cache rechecked relkind after obtaining shared lwlock. If it is still UNLOGGED_BUILD, then backend1 may write file. And lwlock will protect files from been removed by backend2.

Can it be reimplemented in different way? Certainly it can. But I think that it is good to keep all locking logic in one place (relkind_cache) rather than split it between relkind_cache and pagestore_smgr.
There are two main aspect which should IMHo be addressed:

Write to the file should not block concurrent backends - so no exclusive lock is acceptable here.
As far as there are two locks: one spin lock to protect hash table and lwlock to protect write to the file - we should be careful to avoid deadlock. Deadlock is now not possible because we can request lwlock under spin lock ,but not opposite.

In principle obtaining lwlock under deadlock is not necessary, because we never can try to evict page of unlogged build index before unlogged build is started. So no race is possible at the beginning of unlogged build. But I do not want to reply on it.

So I have renamed lock to finish_unlogged_build_lock as you proposed, but I do not want to change current locking schema. I do not see now how some alternative implementation can simplify it or increase performance.

Concerning eviction from relkind_cache and relying on HASH_ENTER_NULL - it is less principle. But once again, I do not want to rely on dynahash behaviour - when it report hash overflow and can it affect other hashes (looks like not - for shared hash). But if sometimes we decided to change to partitioned hash, then behaviour of HASH_ENTER_NULL is even more obscure. Current implementation may be a little but paranoid and redundant but it seems to be more predictable and reliable. And it is the same as in relsize_cache. If we want to change it, then it makes sense to create separate PR which change it in both caches.

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

- Make it faster by using GIN instead of SP-GiST - Make it more robust by checking some of the assumptions, like that the index is larger than 1 GB - Improve comments

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

test_runner/regress/test_unlogged_build.py

pgxn/neon/relperst_cache.c

hlinnaka

LGTM now.

cc @MMeent, you had plans for changes in this area as part of the v18 merge. This might become obsolete with those. But I think we can proceed with this now, because we have a concrete bug to dix, and possible revert later if those other changes land.

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

knizhnik requested review from a team as code owners June 8, 2025 15:29

knizhnik requested review from dimitri, bizwark and problame June 8, 2025 15:29

myrrc self-requested a review June 9, 2025 15:04

myrrc reviewed Jun 11, 2025

View reviewed changes

knizhnik force-pushed the relkind_cache branch from b81baef to 87b8f37 Compare June 12, 2025 12:57

ololobus mentioned this pull request Jun 23, 2025

Creating large spgist indexes leads to "could not open file" error #12073

Closed

alexanderlaw linked an issue Jun 23, 2025 that may be closed by this pull request

Creating large spgist indexes leads to "could not open file" error #12073

Closed

knizhnik requested a review from hlinnaka June 23, 2025 15:17

problame removed request for problame and bizwark June 26, 2025 17:03

knizhnik force-pushed the relkind_cache branch from d62de29 to 3d16219 Compare July 4, 2025 11:31

hlinnaka reviewed Jul 11, 2025

View reviewed changes

pgxn/neon/relkind_cache.c Outdated Show resolved Hide resolved

hlinnaka reviewed Jul 11, 2025

View reviewed changes

pgxn/neon/pagestore_smgr.c Outdated Show resolved Hide resolved

hlinnaka reviewed Jul 11, 2025

View reviewed changes

pgxn/neon/relkind_cache.c Outdated Show resolved Hide resolved

hlinnaka reviewed Jul 11, 2025

View reviewed changes

pgxn/neon/relkind_cache.c Outdated Show resolved Hide resolved

knizhnik requested a review from hlinnaka July 12, 2025 13:07

Kosntantin Knizhnik and others added 18 commits July 29, 2025 08:04

Restore timeout for test_unlogged_)build_test

0226a67

Update pgxn/neon/relperst_cache.c

6164f5e

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Update pgxn/neon/relperst_cache.c

dd14409

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Update pgxn/neon/relperst_cache.c

d961d39

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Update pgxn/neon/relperst_cache.c

74076ee

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Update pgxn/neon/relperst_cache.c

34cf566

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Update pgxn/neon/relperst_cache.c

374cd22

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Update pgxn/neon/relperst_cache.c

963ffda

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Update pgxn/neon/relperst_cache.c

382895d

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Update pgxn/neon/relperst_cache.c

9030bc8

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Update pgxn/neon/relperst_cache.c

4c49423

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Update test_runner/performance/test_unlogged.py

b7dbf4c

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Update test_runner/performance/test_unlogged.py

2546b79

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Address review comments

71980e3

Improve the added test case

43b0106

- Make it faster by using GIN instead of SP-GiST - Make it more robust by checking some of the assumptions, like that the index is larger than 1 GB - Improve comments

Update test_runner/performance/test_unlogged_build.py

d161115

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Move test_unlogged_build.py from p[erformance to regression tests

b71c609

Fix bug with using @skip_in_debug_build directive

499df11

knizhnik force-pushed the relkind_cache branch from 1e28f51 to 499df11 Compare July 29, 2025 05:04

Konstantin Knizhnik added 2 commits July 29, 2025 15:15

Compare unlogged build speed with vanilla

d3f91be

Speedup test_unlogged_build.py

75dafd4

hlinnaka reviewed Jul 31, 2025

View reviewed changes

test_runner/regress/test_unlogged_build.py Outdated Show resolved Hide resolved

hlinnaka reviewed Jul 31, 2025

View reviewed changes

test_runner/regress/test_unlogged_build.py Outdated Show resolved Hide resolved

hlinnaka reviewed Jul 31, 2025

View reviewed changes

test_runner/regress/test_unlogged_build.py Outdated Show resolved Hide resolved

hlinnaka reviewed Jul 31, 2025

View reviewed changes

pgxn/neon/relperst_cache.c Outdated Show resolved Hide resolved

hlinnaka approved these changes Jul 31, 2025

View reviewed changes

knizhnik and others added 4 commits July 31, 2025 18:19

Update pgxn/neon/relperst_cache.c

9a1317e

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Update test_runner/regress/test_unlogged_build.py

f8b4875

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Update test_runner/regress/test_unlogged_build.py

9290103

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Update comment for test_runner/regress/test_unlogged_build.py

fbd668e

Add cache for relation persistence #12166

Are you sure you want to change the base?

Add cache for relation persistence #12166

Uh oh!

Conversation

knizhnik commented Jun 8, 2025

Problem

Uh oh!

github-actions bot commented Jun 8, 2025

Uh oh!

github-actions bot commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

9130 tests run: 8475 passed, 0 failed, 655 skipped (full report)

Postgres 15

Code coverage* (full report)

Uh oh!

myrrc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexanderlaw commented Jun 12, 2025

Uh oh!

knizhnik commented Jun 12, 2025

Uh oh!

problame commented Jun 26, 2025

Uh oh!

hlinnaka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hlinnaka commented Jul 11, 2025

Uh oh!

hlinnaka commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

knizhnik commented Jul 11, 2025

Uh oh!

knizhnik commented Jul 12, 2025

Uh oh!

knizhnik commented Jul 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hlinnaka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jun 8, 2025 •

edited

Loading

hlinnaka commented Jul 11, 2025 •

edited

Loading