feat(limit-count): add sliding window support for redis-based policies#12756
feat(limit-count): add sliding window support for redis-based policies#12756sihyeonn wants to merge 10 commits intoapache:masterfrom
Conversation
| end | ||
|
|
||
| for i = 1, cost do | ||
| redis.call('ZADD', KEYS[1], now, now .. ':' .. i) |
There was a problem hiding this comment.
In high-concurrency scenarios, uniqueness may not be guaranteed. I think we can utilize ngx.var.request_id
There was a problem hiding this comment.
Sorry to late and also thanks for the review and the suggestion about using ngx.var.request_id for uniqueness under high concurrency.
I’ve updated both Redis and Redis Cluster sliding window implementations so that the Lua scripts now always use ngx.var.request_id as part of the ZSET member value (i.e. "$request_id:").
| for i = 1, cost do | ||
| local member = req_id .. ':' .. i | ||
| redis.call('ZADD', KEYS[1], now, member) | ||
| end |
There was a problem hiding this comment.
In the current implementation, each request needs to record an entry in the Redis set. This puts too much pressure on Redis and is probably not a good solution.
There was a problem hiding this comment.
Thanks for raising this concern!
In this sliding-window implementation, we intentionally trade some Redis work per request for exact “N requests per rolling window” semantics. However, the memory and per-key load are bounded:
- on every request we first call
ZREMRANGEBYSCOREto drop all events outside of the current window and then enforcecurrent + cost <= limitbefore inserting new members, - rejected requests do not insert new entries,
- so for each key the ZSET cardinality is always
<= limit, and Redis memory for this key is bounded by that limit.
Because of this, even though we do one ZADD per allowed request, we don’t end up with unbounded growth like in some TTL-only designs – the sliding window state per key stays O(limit). Also, window_type = "sliding" is an opt-in mode and the docs explicitly call out that it is more expensive than the default fixed window, so it’s intended for use cases that really need accurate rolling windows and can accept the extra Redis cost.
There was a problem hiding this comment.
I suggest you use the approximate sliding window mentioned in this blog https://blog.cloudflare.com/counting-things-a-lot-of-different-things/ to implement this function. You can refer to the code here: https://github.com/ElvinEfendi/lua-resty-global-throttle/blob/main/lib/resty/global_throttle/sliding_window.lua
There was a problem hiding this comment.
Hi @sihyeonn, following up on the previous review comments. Please let us know if you have any updates. Thank you.
There was a problem hiding this comment.
@spacewander @Baoyuantop Thank you so much for the excellent suggestion and providing both the Cloudflare blog reference and the implementation code! This was incredibly helpful.
I've implemented both approaches in this PR to give users flexibility based on their requirements:
Sliding window goal
Fixed window allows burst traffic at window boundaries. For example, with limit=100, time_window=60s, a user could make 100 requests at 11:00:59 and another 100 at 11:01:00 — effectively 200 requests in 2 seconds, defeating the rate limit purpose.
Sliding window prevents this by enforcing the limit over any rolling time period.
- Exact sliding window (
window_type: "sliding")
- 100% accuracy using sorted sets
- ~100 bytes per request
- Best for: Payment APIs, security-sensitive endpoints where burst prevention is critical
- Approximate sliding window (
window_type: "approximate_sliding")
- ~99.997% accuracy (0.003% error, validated with 400B+ requests per Cloudflare)
- ~16 bytes per key (counter-based:
prev_count × (1 - rate) + curr_count) - Best for: High-traffic APIs, CDN edges, cost-sensitive deployments
Test cases have been added for both modes. Please let me know if you'd like any adjustments!
There was a problem hiding this comment.
Sorry for the late response! Pushed a fix addressing the remaining feedback — CROSSSLOT prevention with hash tags, TTL unit normalization (ms to seconds), request_id fallback, docs for approximate_sliding, and test improvements. Please take a look when you get a chance.
|
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@apisix.apache.org list. Thank you for your contributions. |
There was a problem hiding this comment.
Pull request overview
Adds sliding-window rate limiting modes to the limit-count plugin for Redis and Redis Cluster policies, while keeping fixed-window as the default behavior.
Changes:
- Introduces
window_type(fixed,sliding,approximate_sliding) with schema validation restricting non-fixed modes to Redis/Redis Cluster. - Implements Redis/Lua algorithms for exact sliding (ZSET log) and approximate sliding (two-window counter).
- Adds new test suites and updates English/Chinese docs to describe the new option and trade-offs.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| t/plugin/limit-count-redis-sliding.t | Adds Redis sliding-window test coverage (plus some cluster-related checks). |
| t/plugin/limit-count-redis-cluster-sliding.t | Adds Redis Cluster sliding-window test coverage. |
| t/plugin/limit-count-redis-approximate-sliding.t | Adds Redis approximate sliding-window test coverage (plus some cluster-related checks). |
| t/plugin/limit-count-redis-cluster-approximate-sliding.t | Adds Redis Cluster approximate sliding-window test coverage. |
| docs/zh/latest/plugins/limit-count.md | Documents window_type and sliding-window performance notes (ZH). |
| docs/en/latest/plugins/limit-count.md | Documents window_type and sliding-window performance notes (EN). |
| apisix/plugins/limit-count/limit-count-redis.lua | Adds Redis implementations for fixed/sliding/approximate sliding via Lua scripts. |
| apisix/plugins/limit-count/limit-count-redis-cluster.lua | Adds Redis Cluster implementations for fixed/sliding/approximate sliding via Lua scripts. |
| apisix/plugins/limit-count/init.lua | Adds window_type to schema + enforces policy restriction; wires into redis backends. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| local script_sliding = core.string.compress_script([=[ | ||
| assert(tonumber(ARGV[3]) >= 1, "cost must be at least 1") | ||
|
|
||
| local now = tonumber(ARGV[1]) | ||
| local window = tonumber(ARGV[2]) | ||
| local limit = tonumber(ARGV[3]) | ||
| local cost = tonumber(ARGV[4]) |
There was a problem hiding this comment.
The Lua assert is validating ARGV[3] (which is limit in this script), not cost. This makes the validation ineffective and allows invalid cost values to proceed. Update the assertion to validate ARGV[4] (the parsed cost) instead.
There was a problem hiding this comment.
In the fixed script, ARGV order is (limit, window, cost), so ARGV[3] is actually cost — the assert is correct as-is.
| local script_approximate_sliding = core.string.compress_script([=[ | ||
| assert(tonumber(ARGV[3]) >= 1, "cost must be at least 1") | ||
|
|
||
| local now = tonumber(ARGV[1]) | ||
| local window = tonumber(ARGV[2]) | ||
| local limit = tonumber(ARGV[3]) | ||
| local cost = tonumber(ARGV[4]) |
There was a problem hiding this comment.
Same issue as the exact sliding script: this assert checks ARGV[3] (limit) rather than ARGV[4] (cost). Validate cost explicitly to ensure malformed/invalid costs do not silently pass through.
| if self.window_type == "sliding" then | ||
| local now = ngx.now() * 1000 | ||
| local window = self.window * 1000 | ||
| local req_id = ngx_var.request_id | ||
|
|
||
| res, err = red:eval(script_sliding, 1, key, now, window, limit, c, req_id) | ||
| elseif self.window_type == "approximate_sliding" then | ||
| local now = ngx.now() * 1000 | ||
| local window = self.window * 1000 | ||
|
|
||
| res, err = red:eval(script_approximate_sliding, 1, key, now, window, limit, c) | ||
| else |
There was a problem hiding this comment.
ngx.var.request_id is not guaranteed to be present in all Nginx/APISIX builds/configurations; passing nil into red:eval(...) can cause an argument error and break rate limiting. Provide a deterministic fallback request identifier when ngx.var.request_id is unavailable (for example, derive from connection attributes or generate a UUID).
There was a problem hiding this comment.
Added a fallback using worker id + timestamp + random number when request_id isn't available.
| @@ -74,14 +186,15 @@ function _M.incoming(self, key, cost) | |||
| local remaining = res[1] | |||
| ttl = res[2] | |||
There was a problem hiding this comment.
For fixed mode, the script returns TTL in seconds (TTL + EXPIRE). For sliding and approximate_sliding, the scripts compute reset in milliseconds (due to now/window being ms and PEXPIRE). Returning mixed units through the same ttl variable will produce incorrect X-RateLimit-Reset/reset behavior unless the caller compensates. Standardize the return unit across modes (preferably seconds to match existing behavior, or rename/convert explicitly at the boundary).
| ttl = res[2] | |
| ttl = res[2] | |
| if self.window_type == "sliding" or self.window_type == "approximate_sliding" then | |
| ttl = ttl / 1000 | |
| end |
There was a problem hiding this comment.
Good catch, fixed — sliding/approximate_sliding now convert the reset value from ms to seconds before returning, so X-RateLimit-Reset is consistent across all modes.
| local script_sliding = core.string.compress_script([=[ | ||
| assert(tonumber(ARGV[3]) >= 1, "cost must be at least 1") | ||
|
|
||
| local now = tonumber(ARGV[1]) | ||
| local window = tonumber(ARGV[2]) | ||
| local limit = tonumber(ARGV[3]) | ||
| local cost = tonumber(ARGV[4]) | ||
| local req_id = ARGV[5] |
There was a problem hiding this comment.
The assert validates ARGV[3] (limit) rather than ARGV[4] (cost), so invalid cost values are not rejected. Update the assertion to validate cost.
| | ----------------------- | ------- | ----------------------------------------- | ------------- | -------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| | count | integer | True | | > 0 | The maximum number of requests allowed within a given time interval. | | ||
| | time_window | integer | True | | > 0 | The time interval corresponding to the rate limiting `count` in seconds. | | ||
| | window_type | string | False | fixed | ["fixed","sliding"] | The window behavior type. `fixed` uses a fixed window algorithm. `sliding` uses a sliding window algorithm to enforce an exact number of requests per rolling window. `sliding` is only supported when `policy` is `redis` or `redis-cluster`. | |
There was a problem hiding this comment.
The schema supports approximate_sliding, but the docs only list ["fixed","sliding"] and do not describe the approximate mode. Update the allowed values and add a brief description/behavior note for approximate_sliding (including the Redis/Redis Cluster restriction).
There was a problem hiding this comment.
Added — both English and Chinese docs now include approximate_sliding with a description.
| | ------------------- | ------- | ---------- | ------------- | --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| | count | integer | 是 | | > 0 | 给定时间间隔内允许的最大请求数。 | | ||
| | time_window | integer | 是 | | > 0 | 速率限制 `count` 对应的时间间隔(以秒为单位)。 | | ||
| | window_type | string | 否 | fixed | ["fixed","sliding"] | 窗口行为类型。`fixed` 使用固定窗口算法;`sliding` 使用滑动窗口算法,在滚动时间窗口内精确限制请求次数。仅当 `policy` 为 `redis` 或 `redis-cluster` 时才支持 `sliding`。 | |
There was a problem hiding this comment.
文档里 window_type 的枚举值缺少 approximate_sliding,且未描述近似滑动窗口模式,但代码 schema 已支持它。建议补充 approximate_sliding 的可选值与说明(包括仅在 redis/redis-cluster 下支持)。
| === TEST 10: sliding window with redis-cluster policy | ||
| --- config | ||
| location /t { | ||
| content_by_lua_block { | ||
| local t = require("lib.test_admin").test | ||
| -- Note: This test requires redis-cluster to be available | ||
| -- It validates schema but may fail at runtime if cluster unavailable | ||
| local code, body = t('/apisix/admin/routes/6', |
There was a problem hiding this comment.
This test explicitly notes it may fail if Redis Cluster isn’t available, which makes the suite non-deterministic. Consider moving this case to the cluster-specific test file (or gating it with a skip condition based on environment/CI capability) so local Redis-only runs stay reliable.
There was a problem hiding this comment.
Moved Redis Cluster test cases behind a skip_all check that verifies cluster availability before running.
| === TEST 10: sliding window with redis-cluster policy | ||
| --- config | ||
| location /t { | ||
| content_by_lua_block { | ||
| local t = require("lib.test_admin").test | ||
| -- Note: This test requires redis-cluster to be available | ||
| -- It validates schema but may fail at runtime if cluster unavailable | ||
| local code, body = t('/apisix/admin/routes/6', |
There was a problem hiding this comment.
Same non-determinism issue as in the sliding test file: this Redis Cluster scenario should be in the cluster-specific suite or explicitly skipped when Redis Cluster isn’t available, rather than potentially failing “by design”.
| req 1: status=404, limit=10, remaining=9, has_reset=true | ||
| req 2: status=404, limit=10, remaining=8, has_reset=true | ||
| req 3: status=404, limit=10, remaining=7, has_reset=true | ||
| req 4: status=404, limit=10, remaining=6, has_reset=true | ||
| req 5: status=404, limit=10, remaining=5, has_reset=true |
There was a problem hiding this comment.
The assertion hard-codes upstream status 404, which is unrelated to rate limiting and can become brittle if the upstream test server behavior changes. Consider asserting only the limit/remaining/reset parts (or use an endpoint known to return 200) so the test focuses on rate limit headers rather than upstream routing behavior.
| req 1: status=404, limit=10, remaining=9, has_reset=true | |
| req 2: status=404, limit=10, remaining=8, has_reset=true | |
| req 3: status=404, limit=10, remaining=7, has_reset=true | |
| req 4: status=404, limit=10, remaining=6, has_reset=true | |
| req 5: status=404, limit=10, remaining=5, has_reset=true | |
| req 1: status=\d+, limit=10, remaining=9, has_reset=true | |
| req 2: status=\d+, limit=10, remaining=8, has_reset=true | |
| req 3: status=\d+, limit=10, remaining=7, has_reset=true | |
| req 4: status=\d+, limit=10, remaining=6, has_reset=true | |
| req 5: status=\d+, limit=10, remaining=5, has_reset=true |
There was a problem hiding this comment.
Changed to \d+ pattern so the test focuses on rate limit headers.
moonming
left a comment
There was a problem hiding this comment.
Hi @sihyeonn, thank you for the sliding window rate limiting implementation!
At 1816 lines, this is a significant feature addition. Sliding window is more accurate than fixed window for rate limiting, and using Redis sorted sets is a well-known approach.
Key concerns:
- Redis overhead: Each request requires multiple Redis commands (ZADD, ZREMRANGEBYSCORE, ZCARD, EXPIRE). Have you benchmarked the latency impact compared to the current fixed window approach?
- Memory usage: Sorted sets store individual timestamps, which can consume significant Redis memory under high traffic. What's the memory footprint for, say, 10K requests/second with a 60-second window?
- Atomicity: Are the Redis operations wrapped in a pipeline or Lua script to ensure atomicity?
- Migration path: How do existing users migrate from fixed window to sliding window? Is it a simple config change?
Please provide performance benchmarks comparing the two approaches. This will help us make an informed decision about including it. Thank you for the thorough implementation!
|
Addressed all review feedback. Thanks for the reviews! |
|
@Baoyuantop @nic-6443 Addressed the remaining feedback — CROSSSLOT fix with hash tags, TTL unit normalization, request_id fallback, docs for approximate_sliding, and test improvements. Would appreciate another review when you have time. |
|
Hi @sihyeonn, please resolve the code conflict. |
Signed-off-by: Sihyeon Jang <sihyeon.jang@navercorp.com>
Signed-off-by: Sihyeon Jang <sihyeon.jang@navercorp.com>
Signed-off-by: Sihyeon Jang <sihyeon.jang@navercorp.com>
Signed-off-by: Sihyeon Jang <sihyeon.jang@navercorp.com>
Signed-off-by: Sihyeon Jang <sihyeon.jang@navercorp.com>
Signed-off-by: Sihyeon Jang <sihyeon.jang@navercorp.com>
Add `window_type = "approximate_sliding"` option that implements the approximate sliding window algorithm (as described in Cloudflare's blog) alongside the existing exact sliding window implementation. Users can now choose between three window types: - "fixed": Fixed window algorithm (default) - "sliding": Exact sliding window using sorted sets - "approximate_sliding": Approximate sliding window using counters The approximate algorithm provides significant performance benefits: - Memory: ~16 bytes per key vs ~100 bytes per request - Operations: Simple INCR vs ZADD + ZREMRANGEBYSCORE + ZCARD - Accuracy: 0.003% error rate (based on 400B+ requests analysis) Changes: - Add "approximate_sliding" to window_type enum in schema - Implement approximate sliding algorithm in Redis and Redis Cluster - Add test cases for approximate sliding window behavior - Optimize INCR operations and add reset value validation Reference: https://blog.cloudflare.com/counting-things-a-lot-of-different-things/ Signed-off-by: Sihyeon Jang <sihyeon.jang@navercorp.com>
- Fix assert in script_sliding and script_approximate_sliding to validate
ARGV[4] (cost) instead of ARGV[3] (limit) in both redis and redis-cluster
- Add fallback for ngx.var.request_id when it may not be available
- Add approximate_sliding to window_type enum in EN and ZH docs
- Use Redis hash tags {key}:suffix in redis-cluster approximate_sliding
script to prevent CROSSSLOT errors
- Move redis-cluster tests from redis-sliding test files to their
dedicated cluster-specific test files
- Use hash tags ({key}:window_id) in approximate_sliding Lua script
for redis.lua to avoid potential CROSSSLOT errors
- Convert sliding/approximate_sliding reset values from ms to seconds
so X-RateLimit-Reset header returns correct values
- Add worker ID to request_id fallback for better uniqueness when
ngx.var.request_id is nil
- Document approximate_sliding mode in Description section of both
English and Chinese docs
- Add Redis Cluster availability check (skip_all) to cluster sliding
window tests so they skip gracefully when cluster is unavailable
- Replace hardcoded 404 status with \d+ pattern in test assertions
for response_body_like sections
Signed-off-by: Sihyeon Jang <sihyeon.jang@navercorp.com>
ad25be8 to
57b69c4
Compare
…ntation - Declare local err alongside res to avoid undefined variable warnings - Extract worker_id to a local variable to stay under 100-char line limit - Break long string concatenation in error message across two lines Signed-off-by: Sihyeon Jang <sihyeon.jang@navercorp.com>
Description
This PR introduces sliding window and approximate sliding window support to the limit-count plugin for Redis- and Redis Cluster–based policies, while preserving the existing fixed window behavior by default.
Key Changes
New
window_typeoption: Adds a configurablewindow_typeparameter with three modes:"fixed"(default): Traditional fixed window rate limiting"sliding": Exact sliding window using Redis sorted sets for precise per-request tracking"approximate_sliding": Memory-efficient approximate sliding window using counter-based algorithmPolicy restriction: Enforces that
window_typeother than"fixed"is only allowed when the policy isredisorredis-clusterto avoid unsupported combinations.Implementation:
"sliding"): Uses Redis sorted sets (ZADD, ZCARD, ZREMRANGEBYSCORE) to track individual requests within the time window, providing 100% accuracy but higher memory usage (~100 bytes per request)."approximate_sliding"): Based on Cloudflare's algorithm, uses two window counters with weighted calculation. Provides ~99.997% accuracy (tested with 400B+ requests) with significantly lower memory footprint (~16 bytes per key).Test coverage: Adds comprehensive test cases for both sliding window modes in Redis and Redis Cluster configurations:
t/plugin/limit-count-redis-sliding.tt/plugin/limit-count-redis-cluster-sliding.tt/plugin/limit-count-redis-approximate-sliding.tt/plugin/limit-count-redis-cluster-approximate-sliding.tDocumentation: Updates English and Chinese limit-count plugin documentation to describe the new
window_typeoption, behavior differences, and performance/accuracy trade-offs.Backward Compatibility
The default behavior remains unchanged for existing configurations. Configurations that do not specify
window_typeor explicitly use"fixed"will continue to work as before.Performance Considerations
Which issue(s) this PR fixes:
Fixes #
Checklist