Skip to content

errors with the fuseTS CropSAR UDP #118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Patrick1G opened this issue Dec 1, 2023 · 5 comments
Open

errors with the fuseTS CropSAR UDP #118

Patrick1G opened this issue Dec 1, 2023 · 5 comments

Comments

@Patrick1G
Copy link

hi @JanssenBrm,

I ran the UDP for crop SAR from a JN and the below job ID:
vito-j-231201cffeaa4c34a36b4eb3dd2cdb0f

It seems to have run for a good hour or so but the errored.
I do not read anything meaningful from the error logs.. Could you take a look?

Also as you see below this consumed 1585 credits.
For future mature UDP execution we should probably only charge these if a process completes?! TBD also w @jdries

image

@JanssenBrm
Copy link
Collaborator

Thank you for the feedback Patrick!

After taking a quick look, I noticed the following:

  • The job (j-231201cffeaa4c34a36b4eb3dd2cdb0f) is calculating CropSAR for a surface area of 8.30 square kilometers using the following AOI:
{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "properties": {},
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              12.24193750206419,
              42.16891379340433
            ],
            [
              12.24193750206419,
              41.94976187473506
            ],
            [
              12.654130750909985,
              41.94976187473506
            ],
            [
              12.654130750909985,
              42.16891379340433
            ],
            [
              12.24193750206419,
              42.16891379340433
            ]
          ]
        ]
      }
    }
  ]
}
  • It looks like the job was reaching the default memory limits:

Stage error: Job aborted due to stage failure: Task 366 in stage 40.0 failed 4 times, most recent failure: Lost task 366.3 in stage 40.0 (TID 2352) (epod131.vgt.vito.be executor 21): ExecutorLostFailure (executor 21 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding physical memory limits. 4.1 GB of 4 GB physical memory used. Consider boosting spark.executor.memoryOverhead.

@patrick-griffiths - To support the processing of larger areas, bigger memory allocation will be required. This can be achieved by using the job_options parameter when launching the batch job (example available in https://github.com/Open-EO/FuseTS/blob/main/notebooks/OpenEO/FuseTS%20-%20CropSAR.ipynb):

job = datacube.execute_batch(
    title="FuseTS - CropSAR",
    out_format="GTIFF",
    job_options={
        "executor-cores": "8",
        "task-cpus": "8",
        "executor-memoryOverhead": "6g",
    }
)
  • We see that the job was logged with FINISHED at the accounting service, which means that credits are deducted. As @soxofaan can we check why the final status was not set to FAILED for this job?

As the credits were wrongly deducted from your account, we've refunded the lost credits.

@soxofaan
Copy link
Member

soxofaan commented Dec 4, 2023

We see that the job was logged with FINISHED at the accounting service, which means that credits are deducted. As @soxofaan can we check why the final status was not set to FAILED for this job?

I find this state/status handling quite confusing in the current ETL reporting implementation.
We track both "app_state" and "status". In the YARN variant we use YARN (ApplicationMaster) "state" as "app_state" and (YARN Application) "final status" as "status".
It turns out with YARN in practice that the former "(app) state" can be "FINISHED", while latter "final status" is "FAILED". In that case we will report to ETL: state="FINISHED" (and status="UNDEFINED"). I think this is going on here (the history of j-231201cffeaa4c34a36b4eb3dd2cdb0f is not available anymore to verify this).

The confusing part of multiple state indicators in YARN is something we can not change, but I'm not sure this has to leak into the ETL reporting part. Is there a reason there that for example POST /resources has both a state and status parameter (both required)?

@bossie
Copy link

bossie commented Dec 5, 2023

As discussed:

  • in the short term, a YARN status of FAILED (and not only state) should also be reported as FAILED @ the ETL API;
  • in the long term, the decision whether or not the usage should be charged for will lie with OpenEO rather than ETL API. This removes logic from the ETL API and would allow OpenEO to e.g. still charge the usage if a job failed because of a user error, and not charge if a job fails because of an error within OpenEO.

@Patrick1G
Copy link
Author

Thank you for the feedback Patrick!

After taking a quick look, I noticed the following:

  • The job (j-231201cffeaa4c34a36b4eb3dd2cdb0f) is calculating CropSAR for a surface area of 8.30 square kilometers using the following AOI:
{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "properties": {},
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              12.24193750206419,
              42.16891379340433
            ],
            [
              12.24193750206419,
              41.94976187473506
            ],
            [
              12.654130750909985,
              41.94976187473506
            ],
            [
              12.654130750909985,
              42.16891379340433
            ],
            [
              12.24193750206419,
              42.16891379340433
            ]
          ]
        ]
      }
    }
  ]
}
  • It looks like the job was reaching the default memory limits:

Stage error: Job aborted due to stage failure: Task 366 in stage 40.0 failed 4 times, most recent failure: Lost task 366.3 in stage 40.0 (TID 2352) (epod131.vgt.vito.be executor 21): ExecutorLostFailure (executor 21 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding physical memory limits. 4.1 GB of 4 GB physical memory used. Consider boosting spark.executor.memoryOverhead.

@patrick-griffiths - To support the processing of larger areas, bigger memory allocation will be required. This can be achieved by using the job_options parameter when launching the batch job (example available in https://github.com/Open-EO/FuseTS/blob/main/notebooks/OpenEO/FuseTS%20-%20CropSAR.ipynb):

job = datacube.execute_batch(
    title="FuseTS - CropSAR",
    out_format="GTIFF",
    job_options={
        "executor-cores": "8",
        "task-cpus": "8",
        "executor-memoryOverhead": "6g",
    }
)
  • We see that the job was logged with FINISHED at the accounting service, which means that credits are deducted. As @soxofaan can we check why the final status was not set to FAILED for this job?

As the credits were wrongly deducted from your account, we've refunded the lost credits.

thanks for the feedback, rerunning it with more memory now..

@JanssenBrm
Copy link
Collaborator

@patrick-griffiths - Were you able to re-run the CropSAR service with the increased memory usage?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants