Skip to content

bug: /status/ready endpoint always returns 503 in file-driven standalone mode #12662

@Falven

Description

@Falven

Current Behavior

When running APISIX 3.13.0 in file-driven standalone mode (deployment.role=data_plane, config_provider=yaml), the /status/ready health check endpoint always returns HTTP 503 with error "worker id: X has not received configuration", despite:

  • Routes working correctly
  • Configuration being successfully loaded from apisix.yaml
  • All workers functioning normally

Example error response:

{"error":"worker id: 0 has not received configuration","status":"error"}

Expected Behavior

The /status/ready endpoint should return HTTP 200 with {"status":"ok"} when all workers have successfully loaded the configuration from the YAML file.

Error Logs

2025/01/10 00:41:47 [warn] 33#33: *3 [lua] init.lua:1003: status_ready(): worker id: 0 has not received configuration, context: ngx.timer

Steps to Reproduce

  1. Configure APISIX in file-driven standalone mode:
# config.yaml
deployment:
  role: data_plane
  role_data_plane:
    config_provider: yaml
apisix:
  enable_admin: false
  1. Create a valid apisix.yaml with routes
  2. Start APISIX
  3. Test the health check endpoint:
curl http://127.0.0.1:7085/status/ready
  1. Observe HTTP 503 error despite routes working correctly

Environment

  • APISIX version: 3.13.0
  • Operating System: Docker (apache/apisix:3.13.0-debian)
  • OpenResty / Nginx version: From official image
  • Deployment mode: data_plane with yaml config_provider

Root Cause Analysis (UPDATED)

After extensive debugging with added logging, I've identified the actual root cause. The issue occurs when the configuration file is rendered before APISIX starts (common in container environments):

Timing Issue:

  1. Configuration file (apisix.yaml) is created by an entrypoint script before APISIX starts
  2. Master process reads the file during startup, setting apisix_yaml_mtime global variable
  3. Workers initialize and call sync_status_to_shdict(false) marking themselves as unhealthy
  4. Workers create timers that call read_apisix_config() every second
  5. Critical bug: read_apisix_config() checks if file mtime has changed:
    if apisix_yaml_mtime == last_modification_time then
        return  -- File hasn't changed, return early
    end
  6. Because the file was rendered before startup, the mtime never changes
  7. update_config() is never called by workers
  8. Workers remain marked as unhealthy forever
  9. /status/ready endpoint fails perpetually

Debug Evidence:
Adding logging to config_yaml.lua confirmed:

  • update_config() is only called once by the master process (PID 1) during startup
  • Master's call to sync_status_to_shdict(true) does nothing because it checks if process.type() ~= "worker" then return end
  • All 12 workers successfully create timers
  • Timers fire every second but return early due to unchanged mtime
  • Workers never call update_config(), thus never call sync_status_to_shdict(true)

Relevant Code

apisix/core/config_yaml.lua - Lines ~565-585:

function _M.init_worker()
    sync_status_to_shdict(false)  -- Mark worker as unhealthy
    
    if is_use_admin_api() then
        apisix_yaml = {}
        apisix_yaml_mtime = 0
        return true
    end

    -- sync data in each non-master process
    ngx.timer.every(1, read_apisix_config)  -- Timer created but never calls update_config
    
    return true
end

apisix/core/config_yaml.lua - Lines ~150-165:

local function read_apisix_config(premature, pre_mtime)
    if premature then
        return
    end
    
    local attributes, err = lfs.attributes(config_file.path)
    if not attributes then
        log.error("failed to fetch ", config_file.path, " attributes: ", err)
        return
    end

    local last_modification_time = attributes.modification
    if apisix_yaml_mtime == last_modification_time then
        return  -- BUG: Returns early, never calls update_config()
    end
    
    -- This code is never reached if file hasn't changed since startup
    local config_new, err = config_file:parse()
    if err then
        log.error("failed to parse the content of file ", config_file.path, ": ", err)
        return
    end

    update_config(config_new, last_modification_time)
    log.warn("config file ", config_file.path, " reloaded.")
end

apisix/core/config_yaml.lua - Lines ~136-148:

local function sync_status_to_shdict(status)
    if process.type() ~= "worker" then
        return  -- Master process calls are ignored
    end

    local dict_name = "status-report"
    local key = worker_id()
    local shdict = ngx.shared[dict_name]
    local _, err = shdict:set(key, status)
    if err then
        log.error("failed to ", status and "set" or "clear",
                  " shdict " .. dict_name .. ", key=" .. key, ", err: ", err)
    end
end

Proposed Solution

In init_worker(), immediately call update_config() after creating the timer to mark the worker as healthy:

function _M.init_worker()
    sync_status_to_shdict(false)
    
    if is_use_admin_api() then
        apisix_yaml = {}
        apisix_yaml_mtime = 0
        return true
    end

    -- sync data in each non-master process
    ngx.timer.every(1, read_apisix_config)
    
    -- FIX: Mark worker as healthy immediately if config already loaded
    if apisix_yaml then
        update_config(apisix_yaml, apisix_yaml_mtime)
    end
    
    return true
end

This ensures workers are marked healthy on initialization, before the timer even fires. The timer will still update configuration when the file changes.

Verified Fix

I patched the code in a running container and confirmed:

  • All 12 workers call update_config() in init_worker_by_lua* context
  • /status/ready returns {"status":"ok"} with HTTP 200
  • Docker health check passes (container shows "healthy" status)
  • Routes continue working correctly

Impact

This bug affects production deployments using:

  • Kubernetes readiness probes with file-driven standalone mode
  • Docker health checks
  • Load balancers that depend on /status/ready endpoint
  • Any container orchestration that renders config files before starting APISIX

The health check always fails, preventing proper deployment orchestration, even though APISIX is functioning correctly and serving traffic.

Additional Context

The bug is specific to the timing of when the configuration file is created relative to APISIX startup. If the file is created and never modified, workers never get marked as healthy. This is a common pattern in containerized deployments where entrypoint scripts render configuration from environment variables before starting the main process.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    🏗 In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions