Skip to content

Recommended limits for created/running jobs #559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
soxofaan opened this issue Feb 25, 2025 · 3 comments
Open

Recommended limits for created/running jobs #559

soxofaan opened this issue Feb 25, 2025 · 3 comments

Comments

@soxofaan
Copy link
Member

This is something that pops up regularly while working on client-side job managers: how many jobs can a user create, how many jobs can run in parallel, ... ?

At the moment, we have in VITO projects some adhoc and per-user configs in the backend and user scripts to steer job managers that create and start tens/hundreds of jobs, but that involves poorly documented and non-standard aligning of various tools.

I think it makes sense to add something to the openEO API that allows backends to expose global or per-user capacity/limits for the number of created jobs, number of concurrently running jobs, etc. That would allow clients to handle this in a cleaner and more transparent way. With the current API, the only official "UI" is basically: just try starting jobs until you get an error, and make sure to backoff/retry properly in some sense.

To give a bit an idea about what I think could be covered here, a non-exhaustive list of things that could be included:

  • maximum number of concurrently running batch jobs
  • currently remaining capacity for concurrently running batch jobs
  • maximum number and currently remaining capacity for number of created (not started) batch jobs
  • maximum number and currently remaining capacity of concurrent sync requests

These numbers would be just recommendations to follow for clients/tools that support it. Going over limits would just trigger the errors we already have.

I'm not sure yet what would be a good place to expose:

  • new endpoint
  • main capabilities doc GET /
  • part of response on GET /jobs and related?
  • ...?

Note that this would also be interesting in a federation context to steer job distribution.

@m-mohr
Copy link
Member

m-mohr commented Feb 26, 2025

I don't think this is a good idea. (But I'm also not a fan of the client-side approach to create hundres of jobs in the job manager. Shouldn't that be one job? It seems like a back-end limitation that is exposed to the user.)

For example, an implementation in the Web Editor that blocks a submission due to capacity limits would often not be up-to-date due to the request interval. So you could in theory already submit something, the UI just hasn't received up-to-date data. Additionally, if pagination is active, the Editor may not even know how many jobs are active (assuming we just iterate through jobs). Otherwise, you probably need separate statistics of active jobs as part of GET /jobs etc, but then how to expose how many sync jobs are running?

I think the try and error approach here is okay. To ensure up-to-date limits you need to make a request in anyway, we'd just move it to another endpoint. So I'm not sure what we gain.

Users could also just be informed about limits in other ways, e.g. the backend description, and then configure the job manager manually with those limits. Generally, we tried to avoid defining limits too specifically because backends could have limits in so many different ways, that we probably can't think of all of them and in the end it could be an endless list of options. For example, someone may combined limits for sync and batch job, yet another property to add...

@soxofaan
Copy link
Member Author

But I'm also not a fan of the client-side approach to create hundres of jobs in the job manager. Shouldn't that be one job? It seems like a back-end limitation that is exposed to the user

We handle a lot of use cases where multi-job management is a important requirement. These users don't want a single giant job that would take weeks/months to finish, they want multiple more manageable jobs that finish within reasonable time, where results can easily be inspected on the go. They want to scale up/down their load to manage credit consumption, re-run where necessary, ... It's true that it would be nice that this kind of functionality would be provided by openEO, but that's completely not the case yet.

And we already tried to experiment with "large area" processing and automatic job splitting at the level of the aggregator, but there are so many aspects and details to that, that it is just easier and flexible to just do the whole management from the client side. In the long term these ideas could/should certainly ported to a backend component, but it's just too early as we are still exploring this space.

that blocks a submission due to capacity limits would often not be up-to-date due to the request interval

It's true that you can get in race condition troubles when client and server are bit out of sync, but that doesn't mean this information is worthless. That's like saying that an email client should not report the number of unread messages because it could be off from time to time. And part of the proposal is also about numbers that almost never change like "maximum number of concurrently running batch/sync jobs".

Additionally, if pagination is active, the Editor may not even know how many jobs are active (assuming we just iterate through jobs).

I'm not sure what you mean here, because this proposal is about the backend stating numbers/limits, not about clients guessing them.

I think the try and error approach here is okay. To ensure up-to-date limits you need to make a request in anyway, we'd just move it to another endpoint. So I'm not sure what we gain.

It's true that you still would have to try and catch errors in the end. But users can be pretty panicky when errors pop up (even when shown as warning). And likewise, as a backend, we also monitor 400/500 HTTP responses to get an idea about our service health. It's not ideal that these error stats would be polluted by clients that are just pushing their luck because there is no other way to detect limits.

Users could also just be informed about limits in other ways, e.g. the backend description, and then configure the job manager manually with those limits. Generally, we tried to avoid defining limits too specifically because backends could have limits in so many different ways, that we probably can't think of all of them and in the end it could be an endless list of options. For example, someone may combined limits for sync and batch job, yet another property to add

Ok makes sense and I understand that we don't want to predefine all possible limits at the level of the openEO API. But I still think it's valuable to at least standardize the place/endpoint where they can be found and consumed programmatically.

@m-mohr
Copy link
Member

m-mohr commented Feb 27, 2025

I'm not sure what you mean here

To make something useful with the limit, you'd need to also know the number of active operations. Where do you get this information from?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants