bug: /status/ready endpoint always returns 503 in file-driven standalone mode

### Current Behavior

When running APISIX 3.13.0 in file-driven standalone mode (deployment.role=data_plane, config_provider=yaml), the `/status/ready` health check endpoint always returns HTTP 503 with error "worker id: X has not received configuration", despite:
- Routes working correctly
- Configuration being successfully loaded from apisix.yaml
- All workers functioning normally

Example error response:
```json
{"error":"worker id: 0 has not received configuration","status":"error"}
```

### Expected Behavior

The `/status/ready` endpoint should return HTTP 200 with `{"status":"ok"}` when all workers have successfully loaded the configuration from the YAML file.

### Error Logs

```
2025/01/10 00:41:47 [warn] 33#33: *3 [lua] init.lua:1003: status_ready(): worker id: 0 has not received configuration, context: ngx.timer
```

### Steps to Reproduce

1. Configure APISIX in file-driven standalone mode:
```yaml
# config.yaml
deployment:
  role: data_plane
  role_data_plane:
    config_provider: yaml
apisix:
  enable_admin: false
```

2. Create a valid apisix.yaml with routes
3. Start APISIX
4. Test the health check endpoint:
```bash
curl http://127.0.0.1:7085/status/ready
```

5. Observe HTTP 503 error despite routes working correctly

### Environment

- APISIX version: 3.13.0
- Operating System: Docker (apache/apisix:3.13.0-debian)
- OpenResty / Nginx version: From official image
- Deployment mode: data_plane with yaml config_provider

### Root Cause Analysis (UPDATED)

After extensive debugging with added logging, I've identified the actual root cause. The issue occurs when the configuration file is rendered **before** APISIX starts (common in container environments):

**Timing Issue:**
1. Configuration file (`apisix.yaml`) is created by an entrypoint script before APISIX starts
2. Master process reads the file during startup, setting `apisix_yaml_mtime` global variable
3. Workers initialize and call `sync_status_to_shdict(false)` marking themselves as **unhealthy**
4. Workers create timers that call `read_apisix_config()` every second
5. **Critical bug**: `read_apisix_config()` checks if file mtime has changed:
   ```lua
   if apisix_yaml_mtime == last_modification_time then
       return  -- File hasn't changed, return early
   end
   ```
6. Because the file was rendered before startup, the mtime never changes
7. `update_config()` is **never called** by workers
8. Workers remain marked as unhealthy forever
9. `/status/ready` endpoint fails perpetually

**Debug Evidence:**
Adding logging to `config_yaml.lua` confirmed:
- `update_config()` is only called once by the master process (PID 1) during startup
- Master's call to `sync_status_to_shdict(true)` does nothing because it checks `if process.type() ~= "worker" then return end`
- All 12 workers successfully create timers
- Timers fire every second but return early due to unchanged mtime
- Workers never call `update_config()`, thus never call `sync_status_to_shdict(true)`

### Relevant Code

**apisix/core/config_yaml.lua** - Lines ~565-585:
```lua
function _M.init_worker()
    sync_status_to_shdict(false)  -- Mark worker as unhealthy
    
    if is_use_admin_api() then
        apisix_yaml = {}
        apisix_yaml_mtime = 0
        return true
    end

    -- sync data in each non-master process
    ngx.timer.every(1, read_apisix_config)  -- Timer created but never calls update_config
    
    return true
end
```

**apisix/core/config_yaml.lua** - Lines ~150-165:
```lua
local function read_apisix_config(premature, pre_mtime)
    if premature then
        return
    end
    
    local attributes, err = lfs.attributes(config_file.path)
    if not attributes then
        log.error("failed to fetch ", config_file.path, " attributes: ", err)
        return
    end

    local last_modification_time = attributes.modification
    if apisix_yaml_mtime == last_modification_time then
        return  -- BUG: Returns early, never calls update_config()
    end
    
    -- This code is never reached if file hasn't changed since startup
    local config_new, err = config_file:parse()
    if err then
        log.error("failed to parse the content of file ", config_file.path, ": ", err)
        return
    end

    update_config(config_new, last_modification_time)
    log.warn("config file ", config_file.path, " reloaded.")
end
```

**apisix/core/config_yaml.lua** - Lines ~136-148:
```lua
local function sync_status_to_shdict(status)
    if process.type() ~= "worker" then
        return  -- Master process calls are ignored
    end

    local dict_name = "status-report"
    local key = worker_id()
    local shdict = ngx.shared[dict_name]
    local _, err = shdict:set(key, status)
    if err then
        log.error("failed to ", status and "set" or "clear",
                  " shdict " .. dict_name .. ", key=" .. key, ", err: ", err)
    end
end
```

### Proposed Solution

In `init_worker()`, immediately call `update_config()` after creating the timer to mark the worker as healthy:

```lua
function _M.init_worker()
    sync_status_to_shdict(false)
    
    if is_use_admin_api() then
        apisix_yaml = {}
        apisix_yaml_mtime = 0
        return true
    end

    -- sync data in each non-master process
    ngx.timer.every(1, read_apisix_config)
    
    -- FIX: Mark worker as healthy immediately if config already loaded
    if apisix_yaml then
        update_config(apisix_yaml, apisix_yaml_mtime)
    end
    
    return true
end
```

This ensures workers are marked healthy on initialization, before the timer even fires. The timer will still update configuration when the file changes.

### Verified Fix

I patched the code in a running container and confirmed:
- All 12 workers call `update_config()` in `init_worker_by_lua*` context
- `/status/ready` returns `{"status":"ok"}` with HTTP 200
- Docker health check passes (container shows "healthy" status)
- Routes continue working correctly

### Impact

This bug affects production deployments using:
- Kubernetes readiness probes with file-driven standalone mode
- Docker health checks
- Load balancers that depend on `/status/ready` endpoint
- Any container orchestration that renders config files before starting APISIX

The health check always fails, preventing proper deployment orchestration, even though APISIX is functioning correctly and serving traffic.

### Additional Context

The bug is specific to the timing of when the configuration file is created relative to APISIX startup. If the file is created and never modified, workers never get marked as healthy. This is a common pattern in containerized deployments where entrypoint scripts render configuration from environment variables before starting the main process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: /status/ready endpoint always returns 503 in file-driven standalone mode #12662

Current Behavior

Expected Behavior

Error Logs

Steps to Reproduce

Environment

Root Cause Analysis (UPDATED)

Relevant Code

Proposed Solution

Verified Fix

Impact

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: /status/ready endpoint always returns 503 in file-driven standalone mode #12662

Description

Current Behavior

Expected Behavior

Error Logs

Steps to Reproduce

Environment

Root Cause Analysis (UPDATED)

Relevant Code

Proposed Solution

Verified Fix

Impact

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions