Skip to content

Commit 97c0495

Browse files
committed
Made various improvements to the AutostartedAgent._ensure_agents method (Issue #7612, PR #7612)
Fixes bug in autostarted agent manager that occasionally causes agent's sessions to time out, dropping calls, resulting in a "stuck" orchestrator until a redeploy is triggered. ~~The bug has been there forever, but the conditions for it to trigger happen to be set up by #7278.~~ ~~The core of the bug is the following. When we need a certain agent to be up, we call `AutostartedAgentManager._ensure_agents`. This makes sure an agent process is running for these agents and waits for them to be up. However, instead of starting a process to track the autostarted agent map and to trust that it would keep up with changes to it, we instead start a process with an explicit list of agent endpoints.~~ If a new call comes in to `_ensure_agents`, and the agent is not yet up, we would kill the current process and start a new one. In killing the process, we would not expire its sessions, letting them time out on the 30s heartbeat timeout, losing any calls made to it in that window. EDIT: Scratched some of the above: I wrote a test case for the first bullet point below, which I thought to be the root cause of the reason for killing the old agent process. However, turns out the test also succeeds on master. The agent's `on_reconnect` actually updates the process' agent map. So, I think my root cause analysis may have been wrong (at least less likely), but the second and third bullet points should fix the issue anyway (doesn't matter exactly *how* we got in the situation where only part of the agent endpoints were up, as long as we handle it correctly), and even if part of the issue persists, the logging improvements should help future investigation. And the first bullet point is still a good consistency fix imo. This PR makes the following changes: - An agent process for autostarted agents (`use_autostart_agent_map` in the config file) now ignores any agent endpoints in the config file. Instead it runs purely in autostarted agent mode, starting instances for each of the endpoints in the agent map. It then tracks any changes made to the agent map. This last part was already in place, and resulted in a small inconsistency where the process would respect agent names in the config at start, and then suddenly move to consider the agent map the authority as soon as a change comes in. This inconsistency is now resolved by considering the agent map the authority for the entire lifetime of the agent process. - The autostarted agent manager now trusts in its processes to track the agent map. If a process is already running for an environment, but the agent endpoints are not yet up, it waits for them rather than to kill the process and start fresh. - When an explicit restart is requested, the autostarted agent manager now expires all sessions for the agents in the agent map, i.e. the endpoints for the killed process. - Improved robustness of wait condition for an agent to be up, specifically make sure we don't consider expired sessions to be up. - Made some log messages a bit more informative, no major changes. Strike through any lines that are not applicable (`~~line~~`) then check the box - [x] ~Attached issue to pull request~ - [x] Changelog entry - [x] Type annotations are present - [x] Code is clear and sufficiently documented - [x] No (preventable) type errors (check using make mypy or make mypy-diff) - [x] Sufficient test cases (reproduces the bug/tests the requested feature) - [x] Correct, in line with design - [x] ~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~ - [x] ~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~
1 parent 33bc2cb commit 97c0495

File tree

7 files changed

+370
-189
lines changed

7 files changed

+370
-189
lines changed
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
description: "Made various improvements to the AutostartedAgent._ensure_agents method"
2+
sections:
3+
bugfix: "Fixed a race condition where autostarted agents might become unresponsive for 30s when restarted"
4+
issue-nr: 7612
5+
change-type: patch
6+
destination-branches:
7+
# master and iso7 merged in #7612
8+
- iso6
9+

src/inmanta/agent/agent.py

Lines changed: 19 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1184,16 +1184,18 @@ async def _init_agent_map(self) -> None:
11841184
self.agent_map = cfg.agent_map.get()
11851185

11861186
async def _init_endpoint_names(self) -> None:
1187-
if self.hostname is not None:
1188-
await self.add_end_point_name(self.hostname)
1189-
else:
1190-
# load agent names from the config file
1191-
agent_names = cfg.agent_names.get()
1192-
if agent_names is not None:
1193-
for name in agent_names:
1194-
if "$" in name:
1195-
name = name.replace("$node-name", self.node_name)
1196-
await self.add_end_point_name(name)
1187+
assert self.agent_map is not None
1188+
endpoints: Iterable[str] = (
1189+
[self.hostname]
1190+
if self.hostname is not None
1191+
else (
1192+
self.agent_map.keys()
1193+
if cfg.use_autostart_agent_map.get()
1194+
else (name if "$" not in name else name.replace("$node-name", self.node_name) for name in cfg.agent_names.get())
1195+
)
1196+
)
1197+
for endpoint in endpoints:
1198+
await self.add_end_point_name(endpoint)
11971199

11981200
async def stop(self) -> None:
11991201
await super().stop()
@@ -1271,6 +1273,13 @@ async def update_agent_map(self, agent_map: dict[str, str]) -> None:
12711273
await self._update_agent_map(agent_map)
12721274

12731275
async def _update_agent_map(self, agent_map: dict[str, str]) -> None:
1276+
if "internal" not in agent_map:
1277+
LOGGER.warning(
1278+
"Agent received an update_agent_map() trigger without internal agent in the agent_map %s",
1279+
agent_map,
1280+
)
1281+
agent_map = {"internal": "local:", **agent_map}
1282+
12741283
async with self._instances_lock:
12751284
self.agent_map = agent_map
12761285
# Add missing agents

src/inmanta/data/__init__.py

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -28,12 +28,12 @@
2828
import warnings
2929
from abc import ABC, abstractmethod
3030
from collections import abc, defaultdict
31-
from collections.abc import Awaitable, Iterable, Sequence
31+
from collections.abc import Awaitable, Callable, Iterable, Sequence, Set
3232
from configparser import RawConfigParser
3333
from contextlib import AbstractAsyncContextManager
3434
from itertools import chain
3535
from re import Pattern
36-
from typing import Any, Callable, Generic, NewType, Optional, TypeVar, Union, cast, overload
36+
from typing import Any, Generic, NewType, Optional, TypeVar, Union, cast, overload
3737
from uuid import UUID
3838

3939
import asyncpg
@@ -1274,7 +1274,7 @@ def get_connection(
12741274
"""
12751275
if connection is not None:
12761276
return util.nullcontext(connection)
1277-
# Make pypi happy
1277+
# Make mypy happy
12781278
assert cls._connection_pool is not None
12791279
return cls._connection_pool.acquire()
12801280

@@ -3421,10 +3421,12 @@ def get_valid_field_names(cls) -> list[str]:
34213421
return super().get_valid_field_names() + ["process_name", "status"]
34223422

34233423
@classmethod
3424-
async def get_statuses(cls, env_id: uuid.UUID, agent_names: set[str]) -> dict[str, Optional[AgentStatus]]:
3424+
async def get_statuses(
3425+
cls, env_id: uuid.UUID, agent_names: Set[str], *, connection: Optional[asyncpg.connection.Connection] = None
3426+
) -> dict[str, Optional[AgentStatus]]:
34253427
result: dict[str, Optional[AgentStatus]] = {}
34263428
for agent_name in agent_names:
3427-
agent = await cls.get_one(environment=env_id, name=agent_name)
3429+
agent = await cls.get_one(environment=env_id, name=agent_name, connection=connection)
34283430
if agent:
34293431
result[agent_name] = agent.get_status()
34303432
else:

0 commit comments

Comments
 (0)