-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: change JWT cache to limited LRU based cache #4008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
-- | Make an LRU. If a size limit is specified, the LRU is guaranteed
-- to not grow above the specified number of entries.
newLRU :: (Ord key) => Maybe Integer -- ^ the optional maximum size of the LRU
-> LRU key val
newLRU (Just s) | s <= 0 = error "non-positive size LRU"
newLRU s = LRU Nothing Nothing Map.empty Should we fail at startup? |
What exactly is the breaking part here? Just the different config options? |
I think the breaking part is not the interface here (because we didn't remove the old config) but the part where we changed the implementation. Is that considered breaking? |
Not changing the interface, but only the implementation is pretty much the definition of "not breaking", I think :D |
9cae300
to
546d6cf
Compare
Thanks 👍, then I guess this PR could be breaking if we decide to remove the config Btw, having two configs for one feature seems extra. Maybe there is some way where we can somehow only use one config for this. |
Does it make sense to have
I think it should start at 1000, I don't have the reference handy but I commented before an average JWT size.
I believe that 0 should mean the JWT cache is disabled. |
4da76a0
to
d3395b8
Compare
How should we handle this when One spec test is failing when all spec tests are run, but If I run this individually like |
There can also be cases where the jwt-secret is set to a JWKS, i.e. multiple keys. This is especially important with key rotation, where multiple keys can be considered valid at a given time. Not sure whether the ability to keep the cache for keys which are still valid would complicate things too much, though...
I think the spec tests need to be run without jwt cache, because they should be independent. The JWT cache will affect state between tests, so one test depends on which other tests ran before - as you are seeing here. This will also be a problem when running the tests in parallel. |
We are currently not doing this in v12. Does that mean we have a bug in v12 that should be fixed? Not sure if this is a severe issue from a security POV. |
I think it is a bug that needs to be fixed, yes. Old tokens would continue to work despite the secret being changed, as long as PostgREST is not restarted. |
d3395b8
to
3e8c1a0
Compare
4452e4a
to
f62b31d
Compare
0eec86e
to
e8a19b1
Compare
df3befa
to
c675a83
Compare
@taimoorzaeem Please rebase 🙏 , I'll review then. |
8fb3df5
to
3c29af1
Compare
Waiting for #4023 to be fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should we handle this when
config-jwt-secret
is changed in a reload? I think we should invalidate all cache entries because they were cached with the previous secret.
Could we have a similar case with the jwt-aud
setting?
Also, since we're caching AuthResult
, I guess we have the same thing with jwt-role-claim-key
, too. If that changes, the AuthResult
for a given JWT must change as well - so we need to invalidate the cache.
(those comments are probably more about the existing state, than this PR specifically)
src/PostgREST/Auth/JwtCache.hs
Outdated
else do | ||
-- update cache afte lookup -> return result | ||
I.writeIORef jwtCacheIORef jwtCache' | ||
return $ Right result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else do | |
-- update cache afte lookup -> return result | |
I.writeIORef jwtCacheIORef jwtCache' | |
return $ Right result | |
else do | |
-- update cache after lookup -> return result | |
I.writeIORef jwtCacheIORef jwtCache' | |
return $ Right result |
Why do we update the cache after a cache hit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we lookup, the state of cache changes (cache hit updates that entry because its pure LRU cache now that entry is used "recently"). So, we need to write the new cache state (I should mention it in the comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments about lack of thread safety here and possible lost updates - which means we do not have up to date information about cache usage and the whole idea of LRU cache does not make too much sense any more.
OTOH updating a shared variable upon each cache hit is going to have a penalty because of contention under load.
To be honest - looking at this I am not convinced LRU cache is a good idea - at least with current immutable LRU cache implementation.
Not really sure what to do with this but Nikita Volkov implemented https://hackage.haskell.org/package/stm-containers which should provide a map implementation that minimizes contention.
Or we could stick to current implementation but use a better cache implementation: https://hackage.haskell.org/package/psqueues comes to mind as it should provide efficient purging.
-- | Check if exp claim is expired when looked up from cache | ||
isExpClaimExpired :: AuthResult -> UTCTime -> Expired | ||
isExpClaimExpired result utc = | ||
case expireJSON of | ||
Nothing -> False -- if exp not present then it is valid | ||
Just (JSON.Number expiredAt) -> (sciToInt expiredAt - now) < 0 | ||
Just _ -> False -- if exp is not a number then valid | ||
where | ||
expireJSON = KM.lookup "exp" (authClaims result) | ||
now = (floor . nominalDiffTimeToSeconds . utcTimeToPOSIXSeconds) utc :: Int | ||
sciToInt = fromMaybe 0 . Sci.toBoundedInteger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this take the 30 seconds clock skew we allow into account?
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
BREAKING CHANGE Our JWT cache implementation had no upper bound for number of cache entries. This caused OOM errors. Additionally, the purge mechanism for expired entries was quite slow. This changes our implementation to a LRU based cache which limits the amount of cached entries.
3c29af1
to
bddcc5d
Compare
@wolfgangwalther Yes, this is important. We should discuss config reloading in more detail in the context of authentication. Could you open an issue with this? |
-- update cache after lookup -> return result | ||
I.writeIORef jwtCacheIORef jwtCache' | ||
return $ Right result | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above code is not thread safe - it is read-modify-write and causes lost updates.
Granted - in case of caching lost updates might not be that much of a deal (athough see my comment about possible performance implications) but maybe TRef instead of IORef would be better here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should run the entire lookupJwtCache
function atomically. This would make it more thread safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If lookupJwtCache is atomic then you introduce very high contention as all threads wait for each other to just perform lookup. The problem is even worse because with LRU cache lookups are not read-only anymore, so the shared variable has to be updated upon every access.
I don't think there is a good solution to this without a proper mutable concurrent high performance LRU cache implementation.
@steve-chavez @wolfgangwalther - what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds very plausible, @mkleczek.
I suggest we use real numbers to see the effect. Making the whole lookupJwtCache
function atomic should not be hard (?) - and then we can run the JWT loadtest on it. Then compare that to running with https://hackage.haskell.org/package/psqueues. Ideally, we'd have both branches available, so we can run a comparison in one go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If lookupJwtCache is atomic then you introduce very high contention as all threads wait for each other to just perform lookup. The problem is even worse because with LRU cache lookups are not read-only anymore, so the shared variable has to be updated upon every access.
With many threads, performance for any shared resource will take a hit no matter what because of mutual exclusion.
I don't think there is a good solution to this without a proper mutable concurrent high performance LRU cache implementation.
So is that even possible to have "high" performance with concurrency when waiting will be involved? I think the only way to maintain performance here would be to decrease waiting which means we need to use a distributed the cache which will get complex. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maintaining a separate library would have been possible if we only used hackage+cabal, but we don't. It takes months for a new version of a library to be available on Stack and NixOS. I think our own implementation in postgrest would be our best bet.
I don't think any of that is a blocker for a separate lib, all of that can easily be done immediately via overrides etc.
Still, building our own library for that is too big of a task to pursue anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taimoorzaeem @wolfgangwalther
After some research I found the right library: https://hackage.haskell.org/package/lrucaching-0.3.4
It is implementation based on the blogpost: https://jaspervdj.be/posts/2015-02-24-lru-cache.html and uses psqueues.
It provides a version of the cache that minimises contention by splitting the cache into stripes:
https://hackage.haskell.org/package/lrucaching-0.3.4/docs/Data-LruCache-IO.html#v:stripedCached
I guess we could configure it to https://hackage.haskell.org/package/base-4.21.0.0/docs/GHC-Conc.html#v:getNumCapabilities number of stripes and not bother users with configuring it.
It does not support expiration though. It would be good to add it otherwise the cache is vulnerable to cache thrashing attacks by continuously sending requests with expired JWTs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is pure as well, so we'd still need to write the new cache on a lookup 😢.
It does not support expiration though. It would be good to add it otherwise the cache is vulnerable to cache thrashing attacks by continuously sending requests with expired JWTs.
That is not an issue at all. We don't need a built-in expiration support. We can just check the exp
claim on lookup. It would just be an O(1)
task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is pure as well, so we'd still need to write the new cache on a lookup 😢.
It is not - see this link: https://hackage.haskell.org/package/lrucaching-0.3.4/docs/Data-LruCache-IO.html#v:stripedCached
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stripedCached
does not have any interface for insert,update,delete etc. Internally, it is still LruCache
.
-- | JWT Cache | ||
, stateJwtCache :: JwtCache.JwtCacheState | ||
-- | JWT Cache, disabled when config jwt-cache-max-entries is set to 0 | ||
, stateJwtCache :: IORef JwtCache.JwtCacheState |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if IORef is needed here - we already have a mutable variable in JwtCacheState? Why do we need another one here?
A cleaner solution would be to export reset :: JwtCacheState -> IO () function from JwtCache module and call it upon config reload.
loggerState <- Logger.init | ||
metricsState <- Metrics.init configDbPoolSize | ||
let observer = liftA2 (>>) (Logger.observationLogger loggerState configLogLevel) (Metrics.observationMetrics metricsState) | ||
metricsState <- Metrics.init (configDbPoolSize conf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change have anything to do with JWT cache?
Concurrency and atomic operations are important for correctness and consistency but we also need a mutable reference for |
Because the already existing libraries don't satisfy our requirements yet, I guess our best bet is to come up with our own implementation of cache. It might take a while and it would probably delay the v13 release. @steve-chavez WDYT? |
BREAKING CHANGE
Closes #4003.