-
Notifications
You must be signed in to change notification settings - Fork 22
Description
This is just my opinion based on my experience. Thank you for the wonderful product :)
Problem
The current code monitoring the status of the containers only checks the container's status, resulting in error messages that are not user-friendly.
I believe that by having get_container_status return information other than pod_status, we can display more appropriate errors to the users.
Proposed Solution
This is quite simplified, but here's the idea. I'm using stringify to override methods, but there might be a better way.
k8s.py
# can not encode datetime type, define custom encoder and use it
class DatetimeJSONEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, datetime.datetime):
return obj.isoformat()
return obj
@overrides
def get_container_status(self, iteration: Optional[str]) -> str:
# Locates the kernel pod using the kernel_id selector. Note that we also include 'component=kernel'
# in the selector so that executor pods (when Spark is in use) are not considered.
# If the phase indicates Running, the pod's IP is used for the assigned_ip.
pod_status = ""
kernel_label_selector = f"kernel_id={self.kernel_id},component=kernel"
ret = client.CoreV1Api().list_namespaced_pod(
namespace=self.kernel_namespace, label_selector=kernel_label_selector
)
if ret and ret.items:
# if ret.items is not empty, then return the strigify json of the pod data
pod_dict = ret.items[0].to_dict()
dump_json = json.dumps(pod_dict, cls=DatetimeJSONEncoder)
return dump_json
else:
self.log.warning(f"kernel server pod not found in namespace '{self.kernel_namespace}'")
return ""Additional context
This might be specific to my environment, but by setting it to wait when the k8s pod is in the ContainerCreating state or no error has occurred, and ContainersReady is false, it has started to work properly even without a kernel image puller.
This is quite simplified example code
@overrides
async def confirm_remote_startup(self):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pod_info = self.get_container_status(str(i))
# if pod_info is empty string or None, it means the container is not found
if pod_info:
pod_info_json = json.loads(pod_info)
status = pod_info_json["status"]
pod_phase = status["phase"].lower()
if pod_phase == "running":
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
else:
if "conditions" in status:
for condition in status["conditions"]:
if "containerStatuses" in status:
# check if the ContainerCreating
if (
status["containerStatuses"][0]["state"]["waiting"]["reason"]
== "ContainerCreating"
):
self.log.info("Container is creating ...")
continue
if (
condition["type"] == "ContainersReady"
and condition["status"] != "True"
):
self.log.warning("Containers are not ready waiting 1 second.")
await asyncio.sleep(1)
continue