Skip to content

Conversation

knizhnik
Copy link
Contributor

@knizhnik knizhnik commented Jun 8, 2025

Problem

See #12073
and https://neondb.slack.com/archives/C04DGM6SMTM/p1748620049660389

There is race condition in the current unlogged build schema in neon_write with smgr_relpersistence==0 we first check if local file exists and if so, consider that it is unlogged build and call mdwrite to perform local write. But many things can happen between mdexists and mdwritev . For example some other backend can complete unlogged build and unlink this files.

Summary of changes

Add cache for relation kind which can avoid mdexists calls and eliminate race condition at unlogged build end.

@knizhnik knizhnik requested review from a team as code owners June 8, 2025 15:29
@knizhnik knizhnik requested review from dimitri, bizwark and problame June 8, 2025 15:29
Copy link

github-actions bot commented Jun 8, 2025

If this PR added a GUC in the Postgres fork or neon extension,
please regenerate the Postgres settings in the cloud repo:

make NEON_WORKDIR=path/to/neon/checkout \
  -C goapp/internal/shareddomain/postgres generate

If you're an external contributor, a Neon employee will assist in
making sure this step is done.

Copy link

github-actions bot commented Jun 8, 2025

9130 tests run: 8475 passed, 0 failed, 655 skipped (full report)


Flaky tests (1)

Postgres 15

Code coverage* (full report)

  • functions: 34.7% (8842 of 25482 functions)
  • lines: 45.7% (71646 of 156719 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
fbd668e at 2025-07-31T16:31:46.640Z :recycle:

@myrrc myrrc self-requested a review June 9, 2025 15:04
Copy link
Contributor

@myrrc myrrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like a second opinion about the cache implementation

@alexanderlaw
Copy link
Contributor

@knizhnik , please pay attention to the new test failures produced for 6bcd46b:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-12166/15589546198/index.html#/testresult/55c01525c6a0084d

regress/regression.diffs

diff -U3 /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/gist.out /tmp/test_output/test_pg_regress[release-pg15-v1-4]-1/regress/results/gist.out
--- /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/gist.out	2025-06-11 15:57:37.168439715 +0000
+++ /tmp/test_output/test_pg_regress[release-pg15-v1-4]-1/regress/results/gist.out	2025-06-11 16:05:34.917725056 +0000
@@ -376,15 +376,8 @@
 -- This case isn't supported, but it should at least EXPLAIN correctly.
 explain (verbose, costs off)
 select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
-                                     QUERY PLAN                                     
-------------------------------------------------------------------------------------
- Limit
-   Output: p, ((circle(p, '1'::double precision) <-> '(0,0)'::point))
-   ->  Index Only Scan using gist_tbl_multi_index on public.gist_tbl
-         Output: p, (circle(p, '1'::double precision) <-> '(0,0)'::point)
-         Order By: ((circle(gist_tbl.p, '1'::double precision)) <-> '(0,0)'::point)
-(5 rows)
-
+ERROR:  lock neon_relkind is not held
+CONTEXT:  writing block 0 of relation base/16384/26093
 select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
 ERROR:  lossy distance functions are not supported in index-only scans
 -- Clean up

@knizhnik
Copy link
Contributor Author

@knizhnik , please pay attention to the new test failures produced for 6bcd46b: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-12166/15589546198/index.html#/testresult/55c01525c6a0084d

regress/regression.diffs

diff -U3 /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/gist.out /tmp/test_output/test_pg_regress[release-pg15-v1-4]-1/regress/results/gist.out
--- /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/gist.out	2025-06-11 15:57:37.168439715 +0000
+++ /tmp/test_output/test_pg_regress[release-pg15-v1-4]-1/regress/results/gist.out	2025-06-11 16:05:34.917725056 +0000
@@ -376,15 +376,8 @@
 -- This case isn't supported, but it should at least EXPLAIN correctly.
 explain (verbose, costs off)
 select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
-                                     QUERY PLAN                                     
-------------------------------------------------------------------------------------
- Limit
-   Output: p, ((circle(p, '1'::double precision) <-> '(0,0)'::point))
-   ->  Index Only Scan using gist_tbl_multi_index on public.gist_tbl
-         Output: p, (circle(p, '1'::double precision) <-> '(0,0)'::point)
-         Order By: ((circle(gist_tbl.p, '1'::double precision)) <-> '(0,0)'::point)
-(5 rows)
-
+ERROR:  lock neon_relkind is not held
+CONTEXT:  writing block 0 of relation base/16384/26093
 select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
 ERROR:  lossy distance functions are not supported in index-only scans
 -- Clean up

Thank you for reporting the problem.
I hope b81baef will fix it.

@problame
Copy link
Contributor

Removing myself from review, speaking as a storage person, I don't feel qualified to review this. If there's anything in the design that you want reviewed / cross-checked with storage team, please write a mini-rfc explaining what is being cached here, how invalidation works, etc.

Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The locking, with the pinning, spinlock and lwlock, is a bit hard to grasp. I think it's correct, but I wonder if it could be simplified somehow?

In get_cached_relkind(), you do quite a lot of work while holding a spinlock. In particular, there's a call to LWLockAcquire. Acquiring a lwlock while holding a spinlock seems really bad.

@hlinnaka
Copy link
Contributor

I think the locking gets more straightforward, if you move the LWLockAcquire/Release calls out of relkind_cache.c, into pagestore_smgr.c:

  • When starting unlogged build, look up the entry with get_cached_relkind(), and set the relkind to RELKIND_UNLOGGED_BUILD.

  • When ending unlogged build, acquire lock, update the relkind in the entry to RELKIND_PERMANENT, release lock, then release the pin on the entry.

  • in neon_read, call get_cached_relkind to look up and pin the entry. If relkind is RELKIND_UNLOGGED_BUILD`, acquire the LWLock too.

Let's rename relkind_lock to something like finish_unlogged_build_lock or something. That's really the only thing it protects: the moment when a relation goes from RELKIND_UNLOGGED_BUILD to RELKIND_PERMANENT.

@hlinnaka
Copy link
Contributor

hlinnaka commented Jul 11, 2025

During an unlogged build, I assume most of the writes to happen by the process that's building the index. All those writes will also call get_cached_relkind() and acquire/release the lwlock. Could we bypass easily that for the process building the index? Or is it a premature optimization?

Never mind, we only call get_cached_relkind() if the SMgrRelation entry doesn't have the relpersistence. In the process building the index, that should be up-to-date.

@knizhnik
Copy link
Contributor Author

I think the locking gets more straightforward, if you move the LWLockAcquire/Release calls out of relkind_cache.c, into pagestore_smgr.c:

I want to think more about it tomorrow.
But right now I just want to share my concerns.

  1. I am not sure that it is correct to have gap between releasing spin lock protecting relkind_hash and obtaining shared lock. I think it can cause race, but I will think more about it tomorrow.
  2. I think that setting the relkind to RELKIND_UNLOGGED_BUILD should be done under spinlock. Otherwise it once again can cause race.
  3. neon_read doesn't need to care about unlogged builds. I think you mixed it with neon_write

Still not sure that moving lock to pagestore_smgr.c is really make logic more straightforward.
My intention was to hide all implementation details in relkind_cache. Certainly API is not so simple, but it seems to be in any case better than delegating some locking responsibilities to pagestore_smgr.

@knizhnik
Copy link
Contributor Author

I think the locking gets more straightforward, if you move the LWLockAcquire/Release calls out of relkind_cache.c, into pagestore_smgr.c:

It is not correct to allow gap between releasing spin lock and granting shared lock in neon_write, because during this gap the backend performing uncloggingg build can complete the build, update status under exclusive lock and then remove relation files. In this case write to the file will fail.

@knizhnik knizhnik requested a review from hlinnaka July 12, 2025 13:07
@knizhnik
Copy link
Contributor Author

I address all you comments except two.
The main one is your suggestion to move lwlock to pagestore_smgr.c
Why it doesn't work?

  1. Backend 1 evict page fro shared buffers and calls neon_write which in turn calls get_cached_relkind get get relkind. Assume that it returns UNLOGGED_BUILD because unlogged build is performed by backend 2. No locks are hold at this moment.
  2. Backend2 completes unlogged build. It obtain exclusive lock and changes relkind to PERMANENT. Then it releases lock. Then it removes relation file on local disk.
  3. Backend1 obtains shared lock and starts writing file. It may file because file was removed by backend1.

How this problem is handled now?
get_relkind_cache rechecked relkind after obtaining shared lwlock. If it is still UNLOGGED_BUILD, then backend1 may write file. And lwlock will protect files from been removed by backend2.

Can it be reimplemented in different way? Certainly it can. But I think that it is good to keep all locking logic in one place (relkind_cache) rather than split it between relkind_cache and pagestore_smgr.
There are two main aspect which should IMHo be addressed:

  1. Write to the file should not block concurrent backends - so no exclusive lock is acceptable here.
  2. As far as there are two locks: one spin lock to protect hash table and lwlock to protect write to the file - we should be careful to avoid deadlock. Deadlock is now not possible because we can request lwlock under spin lock ,but not opposite.

In principle obtaining lwlock under deadlock is not necessary, because we never can try to evict page of unlogged build index before unlogged build is started. So no race is possible at the beginning of unlogged build. But I do not want to reply on it.

So I have renamed lock to finish_unlogged_build_lock as you proposed, but I do not want to change current locking schema. I do not see now how some alternative implementation can simplify it or increase performance.

Concerning eviction from relkind_cache and relying on HASH_ENTER_NULL - it is less principle. But once again, I do not want to rely on dynahash behaviour - when it report hash overflow and can it affect other hashes (looks like not - for shared hash). But if sometimes we decided to change to partitioned hash, then behaviour of HASH_ENTER_NULL is even more obscure. Current implementation may be a little but paranoid and redundant but it seems to be more predictable and reliable. And it is the same as in relsize_cache. If we want to change it, then it makes sense to create separate PR which change it in both caches.

Kosntantin Knizhnik and others added 18 commits July 29, 2025 08:04
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
- Make it faster by using GIN instead of SP-GiST

- Make it more robust by checking some of the assumptions, like that
  the index is larger than 1 GB

- Improve comments
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now.

cc @MMeent, you had plans for changes in this area as part of the v18 merge. This might become obsolete with those. But I think we can proceed with this now, because we have a concrete bug to dix, and possible revert later if those other changes land.

knizhnik and others added 4 commits July 31, 2025 18:19
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Creating large spgist indexes leads to "could not open file" error
5 participants