Recommended limits for created/running jobs #559

soxofaan · 2025-02-25T13:55:32Z

This is something that pops up regularly while working on client-side job managers: how many jobs can a user create, how many jobs can run in parallel, ... ?

At the moment, we have in VITO projects some adhoc and per-user configs in the backend and user scripts to steer job managers that create and start tens/hundreds of jobs, but that involves poorly documented and non-standard aligning of various tools.

I think it makes sense to add something to the openEO API that allows backends to expose global or per-user capacity/limits for the number of created jobs, number of concurrently running jobs, etc. That would allow clients to handle this in a cleaner and more transparent way. With the current API, the only official "UI" is basically: just try starting jobs until you get an error, and make sure to backoff/retry properly in some sense.

To give a bit an idea about what I think could be covered here, a non-exhaustive list of things that could be included:

maximum number of concurrently running batch jobs
currently remaining capacity for concurrently running batch jobs
maximum number and currently remaining capacity for number of created (not started) batch jobs
maximum number and currently remaining capacity of concurrent sync requests

These numbers would be just recommendations to follow for clients/tools that support it. Going over limits would just trigger the errors we already have.

I'm not sure yet what would be a good place to expose:

new endpoint
main capabilities doc GET /
part of response on GET /jobs and related?
...?

Note that this would also be interesting in a federation context to steer job distribution.

The text was updated successfully, but these errors were encountered:

m-mohr · 2025-02-26T14:01:22Z

I don't think this is a good idea. (But I'm also not a fan of the client-side approach to create hundres of jobs in the job manager. Shouldn't that be one job? It seems like a back-end limitation that is exposed to the user.)

For example, an implementation in the Web Editor that blocks a submission due to capacity limits would often not be up-to-date due to the request interval. So you could in theory already submit something, the UI just hasn't received up-to-date data. Additionally, if pagination is active, the Editor may not even know how many jobs are active (assuming we just iterate through jobs). Otherwise, you probably need separate statistics of active jobs as part of GET /jobs etc, but then how to expose how many sync jobs are running?

I think the try and error approach here is okay. To ensure up-to-date limits you need to make a request in anyway, we'd just move it to another endpoint. So I'm not sure what we gain.

Users could also just be informed about limits in other ways, e.g. the backend description, and then configure the job manager manually with those limits. Generally, we tried to avoid defining limits too specifically because backends could have limits in so many different ways, that we probably can't think of all of them and in the end it could be an endless list of options. For example, someone may combined limits for sync and batch job, yet another property to add...

soxofaan · 2025-02-27T12:04:06Z

But I'm also not a fan of the client-side approach to create hundres of jobs in the job manager. Shouldn't that be one job? It seems like a back-end limitation that is exposed to the user

We handle a lot of use cases where multi-job management is a important requirement. These users don't want a single giant job that would take weeks/months to finish, they want multiple more manageable jobs that finish within reasonable time, where results can easily be inspected on the go. They want to scale up/down their load to manage credit consumption, re-run where necessary, ... It's true that it would be nice that this kind of functionality would be provided by openEO, but that's completely not the case yet.

And we already tried to experiment with "large area" processing and automatic job splitting at the level of the aggregator, but there are so many aspects and details to that, that it is just easier and flexible to just do the whole management from the client side. In the long term these ideas could/should certainly ported to a backend component, but it's just too early as we are still exploring this space.

that blocks a submission due to capacity limits would often not be up-to-date due to the request interval

It's true that you can get in race condition troubles when client and server are bit out of sync, but that doesn't mean this information is worthless. That's like saying that an email client should not report the number of unread messages because it could be off from time to time. And part of the proposal is also about numbers that almost never change like "maximum number of concurrently running batch/sync jobs".

Additionally, if pagination is active, the Editor may not even know how many jobs are active (assuming we just iterate through jobs).

I'm not sure what you mean here, because this proposal is about the backend stating numbers/limits, not about clients guessing them.

I think the try and error approach here is okay. To ensure up-to-date limits you need to make a request in anyway, we'd just move it to another endpoint. So I'm not sure what we gain.

It's true that you still would have to try and catch errors in the end. But users can be pretty panicky when errors pop up (even when shown as warning). And likewise, as a backend, we also monitor 400/500 HTTP responses to get an idea about our service health. It's not ideal that these error stats would be polluted by clients that are just pushing their luck because there is no other way to detect limits.

Users could also just be informed about limits in other ways, e.g. the backend description, and then configure the job manager manually with those limits. Generally, we tried to avoid defining limits too specifically because backends could have limits in so many different ways, that we probably can't think of all of them and in the end it could be an endless list of options. For example, someone may combined limits for sync and batch job, yet another property to add

Ok makes sense and I understand that we don't want to predefine all possible limits at the level of the openEO API. But I still think it's valuable to at least standardize the place/endpoint where they can be found and consumed programmatically.

m-mohr · 2025-02-27T15:33:56Z

I'm not sure what you mean here

To make something useful with the limit, you'd need to also know the number of active operations. Where do you get this information from?

soxofaan mentioned this issue Feb 25, 2025

job manager: automatically discover backend limits Open-EO/openeo-python-client#740

Open

soxofaan mentioned this issue Apr 22, 2025

more flexible job manager end state Open-EO/openeo-python-client#763

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommended limits for created/running jobs #559

Recommended limits for created/running jobs #559

soxofaan commented Feb 25, 2025

m-mohr commented Feb 26, 2025 •

edited

Loading

soxofaan commented Feb 27, 2025

m-mohr commented Feb 27, 2025

Recommended limits for created/running jobs #559

Recommended limits for created/running jobs #559

Comments

soxofaan commented Feb 25, 2025

m-mohr commented Feb 26, 2025 • edited Loading

soxofaan commented Feb 27, 2025

m-mohr commented Feb 27, 2025

m-mohr commented Feb 26, 2025 •

edited

Loading