Skip to content

Enhance k8s container status #132

@mmmommm

Description

@mmmommm

This is just my opinion based on my experience. Thank you for the wonderful product :)

Problem

The current code monitoring the status of the containers only checks the container's status, resulting in error messages that are not user-friendly.

I believe that by having get_container_status return information other than pod_status, we can display more appropriate errors to the users.

Proposed Solution

This is quite simplified, but here's the idea. I'm using stringify to override methods, but there might be a better way.

k8s.py

# can not encode datetime type, define custom encoder and use it
class DatetimeJSONEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, datetime.datetime):
            return obj.isoformat()
        return obj

@overrides
def get_container_status(self, iteration: Optional[str]) -> str:
    # Locates the kernel pod using the kernel_id selector.  Note that we also include 'component=kernel'
    # in the selector so that executor pods (when Spark is in use) are not considered.
    # If the phase indicates Running, the pod's IP is used for the assigned_ip.
    pod_status = ""
    kernel_label_selector = f"kernel_id={self.kernel_id},component=kernel"
    ret = client.CoreV1Api().list_namespaced_pod(
        namespace=self.kernel_namespace, label_selector=kernel_label_selector
    )
    if ret and ret.items:
        # if ret.items is not empty, then return the strigify json of the pod data
        pod_dict = ret.items[0].to_dict()
        dump_json = json.dumps(pod_dict, cls=DatetimeJSONEncoder)
        return dump_json
    else:
        self.log.warning(f"kernel server pod not found in namespace '{self.kernel_namespace}'")
        return ""

Additional context

This might be specific to my environment, but by setting it to wait when the k8s pod is in the ContainerCreating state or no error has occurred, and ContainersReady is false, it has started to work properly even without a kernel image puller.

This is quite simplified example code

@overrides
async def confirm_remote_startup(self):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  pod_info = self.get_container_status(str(i))
      # if pod_info is empty string or None, it means the container is not found
      if pod_info:
          pod_info_json = json.loads(pod_info)
          status = pod_info_json["status"]
          pod_phase = status["phase"].lower()
          if pod_phase == "running":
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          else:
              if "conditions" in status:
                  for condition in status["conditions"]:
                      if "containerStatuses" in status:
                          # check if the ContainerCreating
                          if (
                              status["containerStatuses"][0]["state"]["waiting"]["reason"]
                              == "ContainerCreating"
                          ):
                              self.log.info("Container is creating ...")
                              continue
               if (
                          condition["type"] == "ContainersReady"
                          and condition["status"] != "True"
                      ):
                          self.log.warning("Containers are not ready waiting 1 second.")
                          await asyncio.sleep(1)
                          continue

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAn improvement to an existing feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions