Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workers stuck in terminating state when 'job_completion_wait' is specified and 'handle_sig_wait_for_completion' is triggered #369

Closed
mernmic opened this issue Dec 5, 2022 · 4 comments · Fixed by #370

Comments

@mernmic
Copy link
Contributor

mernmic commented Dec 5, 2022

Currently the workers will be stuck in a termination state when the handle_sig_wait_for_completion is activated, causing ineffective k8s pod lifecycles.

This appears to be happening since the '_sleep_until_tasks_complete' function is checking len() of self.tasks, but the deletion of done task ids is nested under the if self.allow_pick_jobs. reference: https://github.com/samuelcolvin/arq/blob/main/arq/worker.py#L388

Have tried a couple things:

  1. tabbing over the task deletion out from the if self.allow_pick_jobs check
  2. having '_sleep_until_tasks_complete' read from self.job_tasks, which are being deleted during finally block in 'run_job'. reference: https://github.com/samuelcolvin/arq/blob/main/arq/worker.py#L583

For both attempted fixes, have been met with asyncio.exceptions.CancelledError errors.

A fix that seems to be working, iterate tasks during '_sleep_until_tasks_complete' while loop and search for not done.

Proposed fix:

Existing code:

async def _sleep_until_tasks_complete(self) -> None:  
    """
    Sleeps until all tasks are done. Used together with asyncio.wait_for()
    """
    while len(self.tasks):
        await asyncio.sleep(0.1)

Recommended fix:

async def _sleep_until_tasks_complete(self) -> None:  
    """
    Sleeps until all tasks are done. Used together with asyncio.wait_for()
    """
    while len([task for task in self.tasks.values() if not task.done()]):
        await asyncio.sleep(0.1)

Is this a change you would support if a PR is made, or is there a better way around this issue?

@mernmic mernmic changed the title Workers stuck in terminating state when 'job_completion_wait' is specified Workers stuck in terminating state when 'job_completion_wait' is specified and 'handle_sig_wait_for_completion' is triggered Dec 5, 2022
@mernmic
Copy link
Contributor Author

mernmic commented Dec 5, 2022

@JonasKs nice work on getting this feature implemented!
It would be nice to not have to wait for the configured job_completion_wait time in order for the workers to shutdown, if you had any ideas.

@JonasKs
Copy link
Collaborator

JonasKs commented Dec 6, 2022

Hey @mernmic! Thanks for the good bug report!

I was unsure how I missed that, but I actually used your suggested implementation when I was looking at my logs in my k8s lab, but acted on self.tasks() since it was used in the other signal handler.

I'm sorry I didn't catch this. PR is very welcome - we deployed this to QA yesterday, so I assume I'll see your findings when I get to work in a few hours..
(PR world preferably come together with a test that proves it's a bug, so we don't implement it again. I'll look into helping with this ASAP.)

JonasKs added a commit to JonasKs/arq that referenced this issue Dec 6, 2022
JonasKs added a commit to JonasKs/arq that referenced this issue Dec 6, 2022
@JonasKs
Copy link
Collaborator

JonasKs commented Dec 6, 2022

I'd love if you could check out #370 and test out the fix before we merge @mernmic 😊

@mernmic
Copy link
Contributor Author

mernmic commented Dec 6, 2022

I'd love if you could check out #370 and test out the fix before we merge @mernmic 😊

Things are working on my end! Thanks for getting those changes in so quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants