feat: change JWT cache to limited LRU based cache #4008

taimoorzaeem · 2025-04-12T06:02:27Z

BREAKING CHANGE

Closes #4003.

Implementation
Tests
Docs

taimoorzaeem · 2025-04-12T06:12:04Z

How should the two configs jwt-cache-max-lifetime and jwt-cache-max-entries behave relative to each other?
What should be the default value for jwt-cache-max-entries config?
How should we handle the case where the config jwt-cache-max-entries is set to 0? The library throws an exception:

-- | Make an LRU.  If a size limit is specified, the LRU is guaranteed
-- to not grow above the specified number of entries.
newLRU :: (Ord key) => Maybe Integer -- ^ the optional maximum size of the LRU
       -> LRU key val
newLRU (Just s)  | s <= 0 = error "non-positive size LRU"
newLRU s = LRU Nothing Nothing Map.empty

Should we fail at startup?

wolfgangwalther · 2025-04-12T10:59:06Z

What exactly is the breaking part here? Just the different config options?

taimoorzaeem · 2025-04-12T11:54:12Z

What exactly is the breaking part here? Just the different config options?

I think the breaking part is not the interface here (because we didn't remove the old config) but the part where we changed the implementation. Is that considered breaking?

wolfgangwalther · 2025-04-12T12:03:19Z

I think the breaking part is not the interface here (because we didn't remove the old config) but the part where we changed the implementation.

Not changing the interface, but only the implementation is pretty much the definition of "not breaking", I think :D

taimoorzaeem · 2025-04-12T12:19:19Z

Not changing the interface, but only the implementation is pretty much the definition of "not breaking", I think :D

Thanks 👍, then I guess this PR could be breaking if we decide to remove the config jwt-cache-max-lifetime which is unlikely, but still not sure. If we are only adding a new config then I guess this is just another feature and hence "not breaking".

Btw, having two configs for one feature seems extra. Maybe there is some way where we can somehow only use one config for this.

steve-chavez · 2025-04-12T15:06:49Z

jwt-cache-max-lifetime and jwt-cache-max-entries

Does it make sense to have jwt-cache-max-lifetime now? Since the main motivation of that config was to keep memory usage at bay. I believe jwt-cache-max-entries should be enough now, then we could remove the other config (to keep things simple for users).

What should be the default value for jwt-cache-max-entries config?

I think it should start at 1000, I don't have the reference handy but I commented before an average JWT size.

How should we handle the case where the config jwt-cache-max-entries is set to 0? The library throws an exception:
Should we fail at startup?

I believe that 0 should mean the JWT cache is disabled.

taimoorzaeem · 2025-04-14T08:03:15Z

How should we handle this when config-jwt-secret is changed in a reload? I think we should invalidate all cache entries because they were cached with the previous secret.

One spec test is failing when all spec tests are run, but If I run this individually like postgrest-test-spec --match "/Feature.Auth.NoJwtSecretSpec/server started without JWT secret/responds with error on attempted auth/", it passes. Is this also related to reloading?

wolfgangwalther · 2025-04-14T10:03:52Z

How should we handle this when config-jwt-secret is changed in a reload? I think we should invalidate all cache entries because they were cached with the previous secret.

There can also be cases where the jwt-secret is set to a JWKS, i.e. multiple keys. This is especially important with key rotation, where multiple keys can be considered valid at a given time.

Not sure whether the ability to keep the cache for keys which are still valid would complicate things too much, though...

One spec test is failing when all spec tests are run, but If I run this individually like postgrest-test-spec --match "/Feature.Auth.NoJwtSecretSpec/server started without JWT secret/responds with error on attempted auth/", it passes. Is this also related to reloading?

I think the spec tests need to be run without jwt cache, because they should be independent. The JWT cache will affect state between tests, so one test depends on which other tests ran before - as you are seeing here. This will also be a problem when running the tests in parallel.

taimoorzaeem · 2025-04-14T15:06:12Z

How should we handle this when config-jwt-secret is changed in a reload? I think we should invalidate all cache entries because they were cached with the previous secret.

We are currently not doing this in v12. Does that mean we have a bug in v12 that should be fixed? Not sure if this is a severe issue from a security POV.

wolfgangwalther · 2025-04-14T15:10:04Z

We are currently not doing this in v12. Does that mean we have a bug in v12 that should be fixed? Not sure if this is a severe issue from a security POV.

I think it is a bug that needs to be fixed, yes. Old tokens would continue to work despite the secret being changed, as long as PostgREST is not restarted.

steve-chavez · 2025-04-19T02:12:04Z

@taimoorzaeem Please rebase 🙏 , I'll review then.

taimoorzaeem · 2025-04-19T16:26:48Z

Waiting for #4023 to be fixed.

wolfgangwalther

How should we handle this when config-jwt-secret is changed in a reload? I think we should invalidate all cache entries because they were cached with the previous secret.

Could we have a similar case with the jwt-aud setting?

Also, since we're caching AuthResult, I guess we have the same thing with jwt-role-claim-key, too. If that changes, the AuthResult for a given JWT must change as well - so we need to invalidate the cache.

(those comments are probably more about the existing state, than this PR specifically)

wolfgangwalther · 2025-04-20T19:45:29Z

src/PostgREST/Auth/JwtCache.hs

+            else do
+                -- update cache afte lookup -> return result
+                I.writeIORef jwtCacheIORef jwtCache'
+                return $ Right result


Suggested change

else do

-- update cache afte lookup -> return result

I.writeIORef jwtCacheIORef jwtCache'

return $ Right result

else do

-- update cache after lookup -> return result

I.writeIORef jwtCacheIORef jwtCache'

return $ Right result

Why do we update the cache after a cache hit?

When we lookup, the state of cache changes (cache hit updates that entry because its pure LRU cache now that entry is used "recently"). So, we need to write the new cache state (I should mention it in the comment).

See my comments about lack of thread safety here and possible lost updates - which means we do not have up to date information about cache usage and the whole idea of LRU cache does not make too much sense any more.

OTOH updating a shared variable upon each cache hit is going to have a penalty because of contention under load.

To be honest - looking at this I am not convinced LRU cache is a good idea - at least with current immutable LRU cache implementation.

Not really sure what to do with this but Nikita Volkov implemented https://hackage.haskell.org/package/stm-containers which should provide a map implementation that minimizes contention.

Or we could stick to current implementation but use a better cache implementation: https://hackage.haskell.org/package/psqueues comes to mind as it should provide efficient purging.

wolfgangwalther · 2025-04-20T19:46:17Z

src/PostgREST/Auth/JwtCache.hs

+-- | Check if exp claim is expired when looked up from cache
+isExpClaimExpired :: AuthResult -> UTCTime -> Expired
+isExpClaimExpired result utc =
+    case expireJSON of
+      Nothing                      -> False -- if exp not present then it is valid
+      Just (JSON.Number expiredAt) -> (sciToInt expiredAt - now) < 0
+      Just _                       -> False -- if exp is not a number then valid
+      where
+        expireJSON = KM.lookup "exp" (authClaims result)
+        now = (floor . nominalDiffTimeToSeconds . utcTimeToPOSIXSeconds) utc :: Int
+        sciToInt = fromMaybe 0 . Sci.toBoundedInteger


Does this take the 30 seconds clock skew we allow into account?

BREAKING CHANGE Our JWT cache implementation had no upper bound for number of cache entries. This caused OOM errors. Additionally, the purge mechanism for expired entries was quite slow. This changes our implementation to a LRU based cache which limits the amount of cached entries.

taimoorzaeem · 2025-04-21T06:17:47Z

How should we handle this when config-jwt-secret is changed in a reload? I think we should invalidate all cache entries because they were cached with the previous secret.

Could we have a similar case with the jwt-aud setting?

Also, since we're caching AuthResult, I guess we have the same thing with jwt-role-claim-key, too. If that changes, the AuthResult for a given JWT must change as well - so we need to invalidate the cache.

(those comments are probably more about the existing state, than this PR specifically)

@wolfgangwalther Yes, this is important. We should discuss config reloading in more detail in the context of authentication. Could you open an issue with this?

mkleczek · 2025-04-21T18:36:43Z

src/PostgREST/Auth/JwtCache.hs

+                -- update cache after lookup -> return result
+                I.writeIORef jwtCacheIORef jwtCache'
+                return $ Right result
+


The above code is not thread safe - it is read-modify-write and causes lost updates.
Granted - in case of caching lost updates might not be that much of a deal (athough see my comment about possible performance implications) but maybe TRef instead of IORef would be better here.

I think we should run the entire lookupJwtCache function atomically. This would make it more thread safe.

If lookupJwtCache is atomic then you introduce very high contention as all threads wait for each other to just perform lookup. The problem is even worse because with LRU cache lookups are not read-only anymore, so the shared variable has to be updated upon every access.

I don't think there is a good solution to this without a proper mutable concurrent high performance LRU cache implementation.

@steve-chavez @wolfgangwalther - what do you think?

This sounds very plausible, @mkleczek.

I suggest we use real numbers to see the effect. Making the whole lookupJwtCache function atomic should not be hard (?) - and then we can run the JWT loadtest on it. Then compare that to running with https://hackage.haskell.org/package/psqueues. Ideally, we'd have both branches available, so we can run a comparison in one go.

If lookupJwtCache is atomic then you introduce very high contention as all threads wait for each other to just perform lookup. The problem is even worse because with LRU cache lookups are not read-only anymore, so the shared variable has to be updated upon every access.

With many threads, performance for any shared resource will take a hit no matter what because of mutual exclusion.

I don't think there is a good solution to this without a proper mutable concurrent high performance LRU cache implementation.

So is that even possible to have "high" performance with concurrency when waiting will be involved? I think the only way to maintain performance here would be to decrease waiting which means we need to use a distributed the cache which will get complex. 🤔

Maintaining a separate library would have been possible if we only used hackage+cabal, but we don't. It takes months for a new version of a library to be available on Stack and NixOS. I think our own implementation in postgrest would be our best bet.

I don't think any of that is a blocker for a separate lib, all of that can easily be done immediately via overrides etc.

Still, building our own library for that is too big of a task to pursue anyway.

@taimoorzaeem @wolfgangwalther
After some research I found the right library: https://hackage.haskell.org/package/lrucaching-0.3.4

It is implementation based on the blogpost: https://jaspervdj.be/posts/2015-02-24-lru-cache.html and uses psqueues.

It provides a version of the cache that minimises contention by splitting the cache into stripes:
https://hackage.haskell.org/package/lrucaching-0.3.4/docs/Data-LruCache-IO.html#v:stripedCached

I guess we could configure it to https://hackage.haskell.org/package/base-4.21.0.0/docs/GHC-Conc.html#v:getNumCapabilities number of stripes and not bother users with configuring it.

It does not support expiration though. It would be good to add it otherwise the cache is vulnerable to cache thrashing attacks by continuously sending requests with expired JWTs.

This implementation is pure as well, so we'd still need to write the new cache on a lookup 😢.

It does not support expiration though. It would be good to add it otherwise the cache is vulnerable to cache thrashing attacks by continuously sending requests with expired JWTs.

That is not an issue at all. We don't need a built-in expiration support. We can just check the exp claim on lookup. It would just be an O(1) task.

This implementation is pure as well, so we'd still need to write the new cache on a lookup 😢.

It is not - see this link: https://hackage.haskell.org/package/lrucaching-0.3.4/docs/Data-LruCache-IO.html#v:stripedCached

The stripedCached does not have any interface for insert,update,delete etc. Internally, it is still LruCache.

mkleczek · 2025-04-21T18:41:08Z

src/PostgREST/AppState.hs

-  -- | JWT Cache
-  , stateJwtCache          :: JwtCache.JwtCacheState
+  -- | JWT Cache, disabled when config jwt-cache-max-entries is set to 0
+  , stateJwtCache          :: IORef JwtCache.JwtCacheState


I am not sure if IORef is needed here - we already have a mutable variable in JwtCacheState? Why do we need another one here?
A cleaner solution would be to export reset :: JwtCacheState -> IO () function from JwtCache module and call it upon config reload.

mkleczek · 2025-04-21T18:41:53Z

src/PostgREST/AppState.hs

  loggerState  <- Logger.init
-  metricsState <- Metrics.init configDbPoolSize
-  let observer = liftA2 (>>) (Logger.observationLogger loggerState configLogLevel) (Metrics.observationMetrics metricsState)
+  metricsState <- Metrics.init (configDbPoolSize conf)


Does this change have anything to do with JWT cache?

taimoorzaeem · 2025-04-23T07:37:01Z

Concurrency and atomic operations are important for correctness and consistency but we also need a mutable reference for LRU cache type. So I think we would need to use both TVar and IORef for transactional operations and mutable state respectively. Considering this, I think this would need quite a bit of refactoring.

taimoorzaeem · 2025-05-05T12:55:19Z

Because the already existing libraries don't satisfy our requirements yet, I guess our best bet is to come up with our own implementation of cache. It might take a while and it would probably delay the v13 release. @steve-chavez WDYT?

taimoorzaeem marked this pull request as draft April 12, 2025 06:02

taimoorzaeem force-pushed the jwt/lrucache branch from 9cae300 to 546d6cf Compare April 12, 2025 12:14

taimoorzaeem force-pushed the jwt/lrucache branch 2 times, most recently from 4da76a0 to d3395b8 Compare April 14, 2025 07:58

taimoorzaeem force-pushed the jwt/lrucache branch from d3395b8 to 3e8c1a0 Compare April 15, 2025 06:32

steve-chavez mentioned this pull request Apr 15, 2025

fix: valid JWTs after jwt-secret is changed in a config reload #4015

Merged

taimoorzaeem force-pushed the jwt/lrucache branch 2 times, most recently from 4452e4a to f62b31d Compare April 16, 2025 16:54

steve-chavez mentioned this pull request Apr 16, 2025

bump version to 13.0.0 #4020

Closed

taimoorzaeem force-pushed the jwt/lrucache branch 2 times, most recently from 0eec86e to e8a19b1 Compare April 17, 2025 07:44

taimoorzaeem marked this pull request as ready for review April 17, 2025 07:45

taimoorzaeem requested a review from steve-chavez April 17, 2025 07:45

taimoorzaeem force-pushed the jwt/lrucache branch 2 times, most recently from df3befa to c675a83 Compare April 18, 2025 09:32

taimoorzaeem added the breaking change A bug fix or enhancement that would cause a breaking change label Apr 18, 2025

taimoorzaeem force-pushed the jwt/lrucache branch 2 times, most recently from 8fb3df5 to 3c29af1 Compare April 19, 2025 03:41

taimoorzaeem marked this pull request as draft April 19, 2025 16:25

steve-chavez mentioned this pull request Apr 20, 2025

perf: Purge JWT cache asynchronously in a separate thread #3889

Merged

wolfgangwalther reviewed Apr 20, 2025

View reviewed changes

taimoorzaeem force-pushed the jwt/lrucache branch from 3c29af1 to bddcc5d Compare April 21, 2025 06:12

mkleczek reviewed Apr 21, 2025

View reviewed changes

pebosi mentioned this pull request Apr 29, 2025

Return error of jwt error message #4050

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: change JWT cache to limited LRU based cache #4008

feat: change JWT cache to limited LRU based cache #4008

taimoorzaeem commented Apr 12, 2025 •

edited

Loading

taimoorzaeem commented Apr 12, 2025

wolfgangwalther commented Apr 12, 2025

taimoorzaeem commented Apr 12, 2025

wolfgangwalther commented Apr 12, 2025 •

edited

Loading

taimoorzaeem commented Apr 12, 2025 •

edited

Loading

steve-chavez commented Apr 12, 2025 •

edited

Loading

taimoorzaeem commented Apr 14, 2025 •

edited

Loading

wolfgangwalther commented Apr 14, 2025

taimoorzaeem commented Apr 14, 2025

wolfgangwalther commented Apr 14, 2025

steve-chavez commented Apr 19, 2025

taimoorzaeem commented Apr 19, 2025

wolfgangwalther left a comment

wolfgangwalther Apr 20, 2025

taimoorzaeem Apr 21, 2025 •

edited

Loading

mkleczek Apr 21, 2025 •

edited

Loading

wolfgangwalther Apr 20, 2025

This comment was marked as outdated.

taimoorzaeem commented Apr 21, 2025 •

edited

Loading

mkleczek Apr 21, 2025 •

edited

Loading

taimoorzaeem Apr 22, 2025

mkleczek Apr 23, 2025

wolfgangwalther Apr 23, 2025

taimoorzaeem Apr 27, 2025

wolfgangwalther Apr 29, 2025

mkleczek May 1, 2025

taimoorzaeem May 1, 2025 •

edited

Loading

mkleczek May 1, 2025

taimoorzaeem May 5, 2025

mkleczek Apr 21, 2025

mkleczek Apr 21, 2025

taimoorzaeem commented Apr 23, 2025

taimoorzaeem commented May 5, 2025

feat: change JWT cache to limited LRU based cache #4008

Are you sure you want to change the base?

feat: change JWT cache to limited LRU based cache #4008

Conversation

taimoorzaeem commented Apr 12, 2025 • edited Loading

taimoorzaeem commented Apr 12, 2025

wolfgangwalther commented Apr 12, 2025

taimoorzaeem commented Apr 12, 2025

wolfgangwalther commented Apr 12, 2025 • edited Loading

taimoorzaeem commented Apr 12, 2025 • edited Loading

steve-chavez commented Apr 12, 2025 • edited Loading

taimoorzaeem commented Apr 14, 2025 • edited Loading

wolfgangwalther commented Apr 14, 2025

taimoorzaeem commented Apr 14, 2025

wolfgangwalther commented Apr 14, 2025

steve-chavez commented Apr 19, 2025

taimoorzaeem commented Apr 19, 2025

wolfgangwalther left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taimoorzaeem Apr 21, 2025 • edited Loading

Choose a reason for hiding this comment

mkleczek Apr 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

taimoorzaeem commented Apr 21, 2025 • edited Loading

mkleczek Apr 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taimoorzaeem May 1, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taimoorzaeem commented Apr 23, 2025

taimoorzaeem commented May 5, 2025

taimoorzaeem commented Apr 12, 2025 •

edited

Loading

wolfgangwalther commented Apr 12, 2025 •

edited

Loading

taimoorzaeem commented Apr 12, 2025 •

edited

Loading

steve-chavez commented Apr 12, 2025 •

edited

Loading

taimoorzaeem commented Apr 14, 2025 •

edited

Loading

taimoorzaeem Apr 21, 2025 •

edited

Loading

mkleczek Apr 21, 2025 •

edited

Loading

taimoorzaeem commented Apr 21, 2025 •

edited

Loading

mkleczek Apr 21, 2025 •

edited

Loading

taimoorzaeem May 1, 2025 •

edited

Loading