-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling delay for pod provisioning with higher job spikes #3276
Comments
Hello! Thank you for filing an issue. The maintainers will triage your issue shortly. In the meantime, please take a look at the troubleshooting guide for bug reports. If this is a feature request, please review our contribution guidelines. |
@ventsislav-georgiev we're observing similar behaviour, it seems like it's batching 50 jobs from the queue every 60s and whilst min runners are actioning and completing it's only bringing in 50 new queued jobs within that timeframe so you'll always see 50 or less runners start within a minute. Yet to completely understand why this is happening but have raised it with Github. |
@ventsislav-georgiev we've internally been looking at what/how this can be resolved with no real success - I'm hoping that GitHub can provide more feedback here... We've gone as far as ensuring this isn't network bandwidth, compute, or storage related. We do have a forward proxy we use to reach out to GitHub which I'll double check this week, however it should be able to handle much more load than 50 runner instances, and I haven't seen any error logs. @nikola-jokic any review from GitHubs side? I know we internally escalated a support ticket which was unfortuntely closed with a unsatisfactory answer https://support.github.com/ticket/enterprise/1617/2658092 |
We are experiencing the same behavior in version 0.9.2, how are these issues being handled? |
I tried to split runners as below, and it slightly reduced the startup time of runners. For our repository, up to 400+ jobs are running at once. Workflow changes like:
Startup time changes:
|
I suggest to check if you aren't affected by actions/runner-container-hooks#167. |
I discussed this issue with @mumoshu. He wrote a patch f58dd76 to reduce the total time to reconcile an I tested the patch in our organization. According to the listener metrics, we can see that the job startup duration has been improved by the patch. Here are the distribution graphs of job startup duration. |
Hey everyone! In addition to what @int128 has kindly shared, JFYI, I created another patch a4876c5 to enable customizing In theory it should also alleviate the issue originally reported by @ventsislav-georgiev. To summarize:
I'd appreciate it if you all could help testing in your prod-like environments if possible. |
Hello, we are using the runner-scale-set version (0.9.3) with 300-400 runners at peak times and experiencing the same issue. We've been investigating bottlenecks and reached a similar conclusion as @mumoshu. These are bottlenecks we found in the EphemeralRunnerReconciler by taking trace data and measuring time of suspicious parts.
I'd appreciate it if maintainers could check my PRs and give any comments. Thank you. |
Checks
Controller Version
0.8.2
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
We are experiencing scaling issues during higher demand. If our CI triggers big amount of jobs the gha-runner-scale-set has hard time spinning pods and they are seem like stuck in pending state.
Here are some screen captures of scaling from 0 to 30, to 100 and to 400 runners:
0 to 30 (took 15s to create 30 pods)
30target_15s_to_30pod.mov
0 to 100 (took 40s to create 30 pods)
100target_40s_to_30pod.mov
0 to 400 (took 2m 8s to create 30 pods)
400target_2m8s_to_30pod.mov
Describe the expected behavior
The scaling speed to first pods should be the same. Otherwise the CI slows down the moment it is needed most.
Additional Context
Controller Logs
Runner Pod Logs
The text was updated successfully, but these errors were encountered: