errors with the fuseTS CropSAR UDP #118

Patrick1G · 2023-12-01T16:36:01Z

I ran the UDP for crop SAR from a JN and the below job ID:
vito-j-231201cffeaa4c34a36b4eb3dd2cdb0f

It seems to have run for a good hour or so but the errored.
I do not read anything meaningful from the error logs.. Could you take a look?

Also as you see below this consumed 1585 credits.
For future mature UDP execution we should probably only charge these if a process completes?! TBD also w @jdries

JanssenBrm · 2023-12-04T10:35:54Z

Thank you for the feedback Patrick!

After taking a quick look, I noticed the following:

The job (j-231201cffeaa4c34a36b4eb3dd2cdb0f) is calculating CropSAR for a surface area of 8.30 square kilometers using the following AOI:

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "properties": {},
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              12.24193750206419,
              42.16891379340433
            ],
            [
              12.24193750206419,
              41.94976187473506
            ],
            [
              12.654130750909985,
              41.94976187473506
            ],
            [
              12.654130750909985,
              42.16891379340433
            ],
            [
              12.24193750206419,
              42.16891379340433
            ]
          ]
        ]
      }
    }
  ]
}

It looks like the job was reaching the default memory limits:

Stage error: Job aborted due to stage failure: Task 366 in stage 40.0 failed 4 times, most recent failure: Lost task 366.3 in stage 40.0 (TID 2352) (epod131.vgt.vito.be executor 21): ExecutorLostFailure (executor 21 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding physical memory limits. 4.1 GB of 4 GB physical memory used. Consider boosting spark.executor.memoryOverhead.

@patrick-griffiths - To support the processing of larger areas, bigger memory allocation will be required. This can be achieved by using the job_options parameter when launching the batch job (example available in https://github.com/Open-EO/FuseTS/blob/main/notebooks/OpenEO/FuseTS%20-%20CropSAR.ipynb):

job = datacube.execute_batch(
    title="FuseTS - CropSAR",
    out_format="GTIFF",
    job_options={
        "executor-cores": "8",
        "task-cpus": "8",
        "executor-memoryOverhead": "6g",
    }
)

We see that the job was logged with FINISHED at the accounting service, which means that credits are deducted. As @soxofaan can we check why the final status was not set to FAILED for this job?

As the credits were wrongly deducted from your account, we've refunded the lost credits.

soxofaan · 2023-12-04T17:25:33Z

We see that the job was logged with FINISHED at the accounting service, which means that credits are deducted. As @soxofaan can we check why the final status was not set to FAILED for this job?

I find this state/status handling quite confusing in the current ETL reporting implementation.
We track both "app_state" and "status". In the YARN variant we use YARN (ApplicationMaster) "state" as "app_state" and (YARN Application) "final status" as "status".
It turns out with YARN in practice that the former "(app) state" can be "FINISHED", while latter "final status" is "FAILED". In that case we will report to ETL: state="FINISHED" (and status="UNDEFINED"). I think this is going on here (the history of j-231201cffeaa4c34a36b4eb3dd2cdb0f is not available anymore to verify this).

The confusing part of multiple state indicators in YARN is something we can not change, but I'm not sure this has to leak into the ETL reporting part. Is there a reason there that for example POST /resources has both a state and status parameter (both required)?

bossie · 2023-12-05T15:40:48Z

As discussed:

in the short term, a YARN status of FAILED (and not only state) should also be reported as FAILED @ the ETL API;
in the long term, the decision whether or not the usage should be charged for will lie with OpenEO rather than ETL API. This removes logic from the ETL API and would allow OpenEO to e.g. still charge the usage if a job failed because of a user error, and not charge if a job fails because of an error within OpenEO.

Patrick1G · 2023-12-06T08:24:21Z

Thank you for the feedback Patrick!

After taking a quick look, I noticed the following:

The job (j-231201cffeaa4c34a36b4eb3dd2cdb0f) is calculating CropSAR for a surface area of 8.30 square kilometers using the following AOI:
{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "properties": {},
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              12.24193750206419,
              42.16891379340433
            ],
            [
              12.24193750206419,
              41.94976187473506
            ],
            [
              12.654130750909985,
              41.94976187473506
            ],
            [
              12.654130750909985,
              42.16891379340433
            ],
            [
              12.24193750206419,
              42.16891379340433
            ]
          ]
        ]
      }
    }
  ]
}
It looks like the job was reaching the default memory limits:

Stage error: Job aborted due to stage failure: Task 366 in stage 40.0 failed 4 times, most recent failure: Lost task 366.3 in stage 40.0 (TID 2352) (epod131.vgt.vito.be executor 21): ExecutorLostFailure (executor 21 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding physical memory limits. 4.1 GB of 4 GB physical memory used. Consider boosting spark.executor.memoryOverhead.

@patrick-griffiths - To support the processing of larger areas, bigger memory allocation will be required. This can be achieved by using the job_options parameter when launching the batch job (example available in https://github.com/Open-EO/FuseTS/blob/main/notebooks/OpenEO/FuseTS%20-%20CropSAR.ipynb):
job = datacube.execute_batch(
    title="FuseTS - CropSAR",
    out_format="GTIFF",
    job_options={
        "executor-cores": "8",
        "task-cpus": "8",
        "executor-memoryOverhead": "6g",
    }
)
We see that the job was logged with FINISHED at the accounting service, which means that credits are deducted. As @soxofaan can we check why the final status was not set to FAILED for this job?

As the credits were wrongly deducted from your account, we've refunded the lost credits.

thanks for the feedback, rerunning it with more memory now..

JanssenBrm · 2024-01-22T12:38:41Z

@patrick-griffiths - Were you able to re-run the CropSAR service with the increased memory usage?

bossie mentioned this issue Dec 4, 2023

failed job resources logged as successful on YARN Open-EO/openeo-geopyspark-driver#565

Open

soxofaan mentioned this issue Dec 12, 2023

Simplify ETL job state/status indicators Open-EO/openeo-geopyspark-driver#610

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

errors with the fuseTS CropSAR UDP #118

errors with the fuseTS CropSAR UDP #118

Patrick1G commented Dec 1, 2023

JanssenBrm commented Dec 4, 2023

soxofaan commented Dec 4, 2023

bossie commented Dec 5, 2023

Patrick1G commented Dec 6, 2023

JanssenBrm commented Jan 22, 2024

errors with the fuseTS CropSAR UDP #118

errors with the fuseTS CropSAR UDP #118

Comments

Patrick1G commented Dec 1, 2023

JanssenBrm commented Dec 4, 2023

soxofaan commented Dec 4, 2023

bossie commented Dec 5, 2023

Patrick1G commented Dec 6, 2023

JanssenBrm commented Jan 22, 2024