Skip to content

more flexible job manager end state #763

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
HansVRP opened this issue Apr 19, 2025 · 4 comments
Open

more flexible job manager end state #763

HansVRP opened this issue Apr 19, 2025 · 4 comments

Comments

@HansVRP
Copy link
Contributor

HansVRP commented Apr 19, 2025

With the new internal queue; jobs are automatically retried incase more jobs are created that the amount of allowed parallel jobs.

Since the job manager runs until all jobs end in finalized, start failed or error, it doe snot support the internal queueing.

Ideally we would build in some flexibility that allows the user to submit and track more parallel jobs than those supported with their standard account.
Can we make the 'end condition' on start_failed more flexible while not risking an endless loop?

@soxofaan
Copy link
Member

Since the job manager runs until all jobs end in finalized, start failed or error, it does not support the internal queueing.

I'm not sure I understand what you mean. The "internal queuing" feature is just an internal backend thing by design, I don't think there is anything required client-side to support that.

Something that might be possible however, is to have a standard API to discover and leverage job submission limits as discussed at

@HansVRP
Copy link
Contributor Author

HansVRP commented Apr 22, 2025

Will create a minimal example to reproduce the issue

@HansVRP
Copy link
Contributor Author

HansVRP commented Apr 22, 2025

narrowed down the issue;

It comes from the try except loop in the PR: #736

`def execute(self) -> _TaskResult:
"""
Executes the job start process using the OpenEO connection.

    Authenticates if a bearer token is provided, retrieves the job by ID,
    and attempts to start it.

    :returns:
        A `_TaskResult` with status and statistics metadata, indicating
        success or failure of the job start.
    """
    try:
        conn = openeo.connect(self.root_url)
        if self.bearer_token:
            conn.authenticate_bearer_token(self.bearer_token)
        job = conn.job(self.job_id)
        job.start()
        _log.info(f"Job {self.job_id} started successfully")
        return _TaskResult(
            job_id=self.job_id,
            db_update={"status": "queued"},
            stats_update={"job start": 1},
        )
    except Exception as e:
        _log.error(f"Failed to start job {self.job_id}: {e}")
        return _TaskResult(
            job_id=self.job_id,
            db_update={"status": "start_failed"},  
            stats_update={"start_job error": 1})`

Failed to start job j-2504220752104722b90406957695f315: [429] Too Many Requests

--> We need to avoid labeling too many request errors as start_failed and instead handle those jobs as 'created'

@soxofaan
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants