Skip to content

feat: change JWT cache to limited LRU based cache #4008

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

taimoorzaeem
Copy link
Collaborator

@taimoorzaeem taimoorzaeem commented Apr 12, 2025

BREAKING CHANGE

Closes #4003.

  • Implementation
  • Tests
  • Docs

@taimoorzaeem taimoorzaeem marked this pull request as draft April 12, 2025 06:02
@taimoorzaeem
Copy link
Collaborator Author

  • How should the two configs jwt-cache-max-lifetime and jwt-cache-max-entries behave relative to each other?
  • What should be the default value for jwt-cache-max-entries config?
  • How should we handle the case where the config jwt-cache-max-entries is set to 0? The library throws an exception:
-- | Make an LRU.  If a size limit is specified, the LRU is guaranteed
-- to not grow above the specified number of entries.
newLRU :: (Ord key) => Maybe Integer -- ^ the optional maximum size of the LRU
       -> LRU key val
newLRU (Just s)  | s <= 0 = error "non-positive size LRU"
newLRU s = LRU Nothing Nothing Map.empty

Should we fail at startup?

@wolfgangwalther
Copy link
Member

What exactly is the breaking part here? Just the different config options?

@taimoorzaeem
Copy link
Collaborator Author

What exactly is the breaking part here? Just the different config options?

I think the breaking part is not the interface here (because we didn't remove the old config) but the part where we changed the implementation. Is that considered breaking?

@wolfgangwalther
Copy link
Member

wolfgangwalther commented Apr 12, 2025

I think the breaking part is not the interface here (because we didn't remove the old config) but the part where we changed the implementation.

Not changing the interface, but only the implementation is pretty much the definition of "not breaking", I think :D

@taimoorzaeem
Copy link
Collaborator Author

taimoorzaeem commented Apr 12, 2025

Not changing the interface, but only the implementation is pretty much the definition of "not breaking", I think :D

Thanks 👍, then I guess this PR could be breaking if we decide to remove the config jwt-cache-max-lifetime which is unlikely, but still not sure. If we are only adding a new config then I guess this is just another feature and hence "not breaking".

Btw, having two configs for one feature seems extra. Maybe there is some way where we can somehow only use one config for this.

@steve-chavez
Copy link
Member

steve-chavez commented Apr 12, 2025

jwt-cache-max-lifetime and jwt-cache-max-entries

Does it make sense to have jwt-cache-max-lifetime now? Since the main motivation of that config was to keep memory usage at bay. I believe jwt-cache-max-entries should be enough now, then we could remove the other config (to keep things simple for users).

What should be the default value for jwt-cache-max-entries config?

I think it should start at 1000, I don't have the reference handy but I commented before an average JWT size.

How should we handle the case where the config jwt-cache-max-entries is set to 0? The library throws an exception:
Should we fail at startup?

I believe that 0 should mean the JWT cache is disabled.

@taimoorzaeem taimoorzaeem force-pushed the jwt/lrucache branch 2 times, most recently from 4da76a0 to d3395b8 Compare April 14, 2025 07:58
@taimoorzaeem
Copy link
Collaborator Author

taimoorzaeem commented Apr 14, 2025

How should we handle this when config-jwt-secret is changed in a reload? I think we should invalidate all cache entries because they were cached with the previous secret.

One spec test is failing when all spec tests are run, but If I run this individually like postgrest-test-spec --match "/Feature.Auth.NoJwtSecretSpec/server started without JWT secret/responds with error on attempted auth/", it passes. Is this also related to reloading?

@wolfgangwalther
Copy link
Member

How should we handle this when config-jwt-secret is changed in a reload? I think we should invalidate all cache entries because they were cached with the previous secret.

There can also be cases where the jwt-secret is set to a JWKS, i.e. multiple keys. This is especially important with key rotation, where multiple keys can be considered valid at a given time.

Not sure whether the ability to keep the cache for keys which are still valid would complicate things too much, though...

One spec test is failing when all spec tests are run, but If I run this individually like postgrest-test-spec --match "/Feature.Auth.NoJwtSecretSpec/server started without JWT secret/responds with error on attempted auth/", it passes. Is this also related to reloading?

I think the spec tests need to be run without jwt cache, because they should be independent. The JWT cache will affect state between tests, so one test depends on which other tests ran before - as you are seeing here. This will also be a problem when running the tests in parallel.

@taimoorzaeem
Copy link
Collaborator Author

How should we handle this when config-jwt-secret is changed in a reload? I think we should invalidate all cache entries because they were cached with the previous secret.

We are currently not doing this in v12. Does that mean we have a bug in v12 that should be fixed? Not sure if this is a severe issue from a security POV.

@wolfgangwalther
Copy link
Member

We are currently not doing this in v12. Does that mean we have a bug in v12 that should be fixed? Not sure if this is a severe issue from a security POV.

I think it is a bug that needs to be fixed, yes. Old tokens would continue to work despite the secret being changed, as long as PostgREST is not restarted.

@taimoorzaeem taimoorzaeem force-pushed the jwt/lrucache branch 2 times, most recently from 4452e4a to f62b31d Compare April 16, 2025 16:54
@taimoorzaeem taimoorzaeem force-pushed the jwt/lrucache branch 2 times, most recently from 0eec86e to e8a19b1 Compare April 17, 2025 07:44
@taimoorzaeem taimoorzaeem marked this pull request as ready for review April 17, 2025 07:45
@taimoorzaeem taimoorzaeem force-pushed the jwt/lrucache branch 2 times, most recently from df3befa to c675a83 Compare April 18, 2025 09:32
@taimoorzaeem taimoorzaeem added the breaking change A bug fix or enhancement that would cause a breaking change label Apr 18, 2025
@steve-chavez
Copy link
Member

@taimoorzaeem Please rebase 🙏 , I'll review then.

@taimoorzaeem taimoorzaeem force-pushed the jwt/lrucache branch 2 times, most recently from 8fb3df5 to 3c29af1 Compare April 19, 2025 03:41
@taimoorzaeem taimoorzaeem marked this pull request as draft April 19, 2025 16:25
@taimoorzaeem
Copy link
Collaborator Author

Waiting for #4023 to be fixed.

Copy link
Member

@wolfgangwalther wolfgangwalther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should we handle this when config-jwt-secret is changed in a reload? I think we should invalidate all cache entries because they were cached with the previous secret.

Could we have a similar case with the jwt-aud setting?

Also, since we're caching AuthResult, I guess we have the same thing with jwt-role-claim-key, too. If that changes, the AuthResult for a given JWT must change as well - so we need to invalidate the cache.

(those comments are probably more about the existing state, than this PR specifically)

Comment on lines 84 to 88
else do
-- update cache afte lookup -> return result
I.writeIORef jwtCacheIORef jwtCache'
return $ Right result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
else do
-- update cache afte lookup -> return result
I.writeIORef jwtCacheIORef jwtCache'
return $ Right result
else do
-- update cache after lookup -> return result
I.writeIORef jwtCacheIORef jwtCache'
return $ Right result

Why do we update the cache after a cache hit?

Copy link
Collaborator Author

@taimoorzaeem taimoorzaeem Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we lookup, the state of cache changes (cache hit updates that entry because its pure LRU cache now that entry is used "recently"). So, we need to write the new cache state (I should mention it in the comment).

Copy link
Contributor

@mkleczek mkleczek Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments about lack of thread safety here and possible lost updates - which means we do not have up to date information about cache usage and the whole idea of LRU cache does not make too much sense any more.

OTOH updating a shared variable upon each cache hit is going to have a penalty because of contention under load.

To be honest - looking at this I am not convinced LRU cache is a good idea - at least with current immutable LRU cache implementation.

Not really sure what to do with this but Nikita Volkov implemented https://hackage.haskell.org/package/stm-containers which should provide a map implementation that minimizes contention.

Or we could stick to current implementation but use a better cache implementation: https://hackage.haskell.org/package/psqueues comes to mind as it should provide efficient purging.

Comment on lines +92 to +103
-- | Check if exp claim is expired when looked up from cache
isExpClaimExpired :: AuthResult -> UTCTime -> Expired
isExpClaimExpired result utc =
case expireJSON of
Nothing -> False -- if exp not present then it is valid
Just (JSON.Number expiredAt) -> (sciToInt expiredAt - now) < 0
Just _ -> False -- if exp is not a number then valid
where
expireJSON = KM.lookup "exp" (authClaims result)
now = (floor . nominalDiffTimeToSeconds . utcTimeToPOSIXSeconds) utc :: Int
sciToInt = fromMaybe 0 . Sci.toBoundedInteger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this take the 30 seconds clock skew we allow into account?

This comment was marked as outdated.

BREAKING CHANGE

Our JWT cache implementation had no upper bound for number of cache
entries. This caused OOM errors. Additionally, the purge mechanism for
expired entries was quite slow. This changes our implementation to a
LRU based cache which limits the amount of cached entries.
@taimoorzaeem
Copy link
Collaborator Author

taimoorzaeem commented Apr 21, 2025

How should we handle this when config-jwt-secret is changed in a reload? I think we should invalidate all cache entries because they were cached with the previous secret.

Could we have a similar case with the jwt-aud setting?

Also, since we're caching AuthResult, I guess we have the same thing with jwt-role-claim-key, too. If that changes, the AuthResult for a given JWT must change as well - so we need to invalidate the cache.

(those comments are probably more about the existing state, than this PR specifically)

@wolfgangwalther Yes, this is important. We should discuss config reloading in more detail in the context of authentication. Could you open an issue with this?

-- update cache after lookup -> return result
I.writeIORef jwtCacheIORef jwtCache'
return $ Right result

Copy link
Contributor

@mkleczek mkleczek Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above code is not thread safe - it is read-modify-write and causes lost updates.
Granted - in case of caching lost updates might not be that much of a deal (athough see my comment about possible performance implications) but maybe TRef instead of IORef would be better here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should run the entire lookupJwtCache function atomically. This would make it more thread safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If lookupJwtCache is atomic then you introduce very high contention as all threads wait for each other to just perform lookup. The problem is even worse because with LRU cache lookups are not read-only anymore, so the shared variable has to be updated upon every access.

I don't think there is a good solution to this without a proper mutable concurrent high performance LRU cache implementation.

@steve-chavez @wolfgangwalther - what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds very plausible, @mkleczek.

I suggest we use real numbers to see the effect. Making the whole lookupJwtCache function atomic should not be hard (?) - and then we can run the JWT loadtest on it. Then compare that to running with https://hackage.haskell.org/package/psqueues. Ideally, we'd have both branches available, so we can run a comparison in one go.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If lookupJwtCache is atomic then you introduce very high contention as all threads wait for each other to just perform lookup. The problem is even worse because with LRU cache lookups are not read-only anymore, so the shared variable has to be updated upon every access.

With many threads, performance for any shared resource will take a hit no matter what because of mutual exclusion.

I don't think there is a good solution to this without a proper mutable concurrent high performance LRU cache implementation.

So is that even possible to have "high" performance with concurrency when waiting will be involved? I think the only way to maintain performance here would be to decrease waiting which means we need to use a distributed the cache which will get complex. 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintaining a separate library would have been possible if we only used hackage+cabal, but we don't. It takes months for a new version of a library to be available on Stack and NixOS. I think our own implementation in postgrest would be our best bet.

I don't think any of that is a blocker for a separate lib, all of that can easily be done immediately via overrides etc.

Still, building our own library for that is too big of a task to pursue anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taimoorzaeem @wolfgangwalther
After some research I found the right library: https://hackage.haskell.org/package/lrucaching-0.3.4

It is implementation based on the blogpost: https://jaspervdj.be/posts/2015-02-24-lru-cache.html and uses psqueues.

It provides a version of the cache that minimises contention by splitting the cache into stripes:
https://hackage.haskell.org/package/lrucaching-0.3.4/docs/Data-LruCache-IO.html#v:stripedCached

I guess we could configure it to https://hackage.haskell.org/package/base-4.21.0.0/docs/GHC-Conc.html#v:getNumCapabilities number of stripes and not bother users with configuring it.

It does not support expiration though. It would be good to add it otherwise the cache is vulnerable to cache thrashing attacks by continuously sending requests with expired JWTs.

Copy link
Collaborator Author

@taimoorzaeem taimoorzaeem May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is pure as well, so we'd still need to write the new cache on a lookup 😢.

It does not support expiration though. It would be good to add it otherwise the cache is vulnerable to cache thrashing attacks by continuously sending requests with expired JWTs.

That is not an issue at all. We don't need a built-in expiration support. We can just check the exp claim on lookup. It would just be an O(1) task.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is pure as well, so we'd still need to write the new cache on a lookup 😢.

It is not - see this link: https://hackage.haskell.org/package/lrucaching-0.3.4/docs/Data-LruCache-IO.html#v:stripedCached

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stripedCached does not have any interface for insert,update,delete etc. Internally, it is still LruCache.

-- | JWT Cache
, stateJwtCache :: JwtCache.JwtCacheState
-- | JWT Cache, disabled when config jwt-cache-max-entries is set to 0
, stateJwtCache :: IORef JwtCache.JwtCacheState
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if IORef is needed here - we already have a mutable variable in JwtCacheState? Why do we need another one here?
A cleaner solution would be to export reset :: JwtCacheState -> IO () function from JwtCache module and call it upon config reload.

loggerState <- Logger.init
metricsState <- Metrics.init configDbPoolSize
let observer = liftA2 (>>) (Logger.observationLogger loggerState configLogLevel) (Metrics.observationMetrics metricsState)
metricsState <- Metrics.init (configDbPoolSize conf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change have anything to do with JWT cache?

@taimoorzaeem
Copy link
Collaborator Author

Concurrency and atomic operations are important for correctness and consistency but we also need a mutable reference for LRU cache type. So I think we would need to use both TVar and IORef for transactional operations and mutable state respectively. Considering this, I think this would need quite a bit of refactoring.

@taimoorzaeem
Copy link
Collaborator Author

Because the already existing libraries don't satisfy our requirements yet, I guess our best bet is to come up with our own implementation of cache. It might take a while and it would probably delay the v13 release. @steve-chavez WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change A bug fix or enhancement that would cause a breaking change
Development

Successfully merging this pull request may close these issues.

LRU based cache for JWT
4 participants