Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling delay for pod provisioning with higher job spikes #3276

Open
4 tasks done
ventsislav-georgiev opened this issue Feb 8, 2024 · 9 comments
Open
4 tasks done
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode

Comments

@ventsislav-georgiev
Copy link

Checks

Controller Version

0.8.2

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Reproducible by modifying the min runners of an AutoscalingRunnerSet.
From 0 to 30, 100, 400.

Describe the bug

We are experiencing scaling issues during higher demand. If our CI triggers big amount of jobs the gha-runner-scale-set has hard time spinning pods and they are seem like stuck in pending state.

Here are some screen captures of scaling from 0 to 30, to 100 and to 400 runners:

0 to 30 (took 15s to create 30 pods)

30target_15s_to_30pod.mov

0 to 100 (took 40s to create 30 pods)

100target_40s_to_30pod.mov

0 to 400 (took 2m 8s to create 30 pods)

400target_2m8s_to_30pod.mov

Describe the expected behavior

The scaling speed to first pods should be the same. Otherwise the CI slows down the moment it is needed most.

Additional Context

-

Controller Logs

https://gist.github.com/ventsislav-georgiev/f318f84b6bc6e801d733907087ce287c

Runner Pod Logs

[Irrelevant]
@ventsislav-georgiev ventsislav-georgiev added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Feb 8, 2024
Copy link
Contributor

github-actions bot commented Feb 8, 2024

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@xunholy
Copy link
Contributor

xunholy commented Feb 14, 2024

@ventsislav-georgiev we're observing similar behaviour, it seems like it's batching 50 jobs from the queue every 60s and whilst min runners are actioning and completing it's only bringing in 50 new queued jobs within that timeframe so you'll always see 50 or less runners start within a minute.

Yet to completely understand why this is happening but have raised it with Github.

@nikola-jokic nikola-jokic removed the needs triage Requires review from the maintainers label Feb 19, 2024
@xunholy
Copy link
Contributor

xunholy commented Mar 24, 2024

@ventsislav-georgiev we've internally been looking at what/how this can be resolved with no real success - I'm hoping that GitHub can provide more feedback here...

We've gone as far as ensuring this isn't network bandwidth, compute, or storage related. We do have a forward proxy we use to reach out to GitHub which I'll double check this week, however it should be able to handle much more load than 50 runner instances, and I haven't seen any error logs.

@nikola-jokic any review from GitHubs side? I know we internally escalated a support ticket which was unfortuntely closed with a unsatisfactory answer https://support.github.com/ticket/enterprise/1617/2658092

@diegotecbr
Copy link

We are experiencing the same behavior in version 0.9.2, how are these issues being handled?

@int128
Copy link
Contributor

int128 commented Oct 1, 2024

I tried to split runners as below, and it slightly reduced the startup time of runners. For our repository, up to 400+ jobs are running at once.

Workflow changes like:

  • Before
    • Test jobs run on runner A
    • Deploy jobs run on runner A
    • Other jobs run on runner A
  • After
    • Test jobs run on runner A
    • Deploy jobs run on runner B
    • Other jobs run on runner C

Startup time changes:

  • Before: 90s (75%tile), 136s (90%tile)
  • After: 71s (75%tile), 122s (90%tile)

@krzysztof-magosa
Copy link

I suggest to check if you aren't affected by actions/runner-container-hooks#167.
In our case ARC waits for files to be copied before spawning new containers, and that's the main delaying factor.

@int128
Copy link
Contributor

int128 commented Nov 2, 2024

I discussed this issue with @mumoshu. He wrote a patch f58dd76 to reduce the total time to reconcile an EphemeralRunner object.

I tested the patch in our organization. According to the listener metrics, we can see that the job startup duration has been improved by the patch. Here are the distribution graphs of job startup duration.

image

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 25, 2024

Hey everyone!

In addition to what @int128 has kindly shared, JFYI, I created another patch a4876c5 to enable customizing MaxConcurrentReconciles.
Please see #3021 for more context. The original issue was about the legacy ARC but it should also apply to the runnerscalesets version.

In theory it should also alleviate the issue originally reported by @ventsislav-georgiev. To summarize:

  • f58dd76 should fix the runner pod startup latency to NOT get longer proportional to the number of ephemeralrunners. it does so by removing unnecessary "requeues"(or back-and-forth of the reconsiler) in the happy-path.
  • a4876c5 should improve the runner pod startup times by using more goroutines(almost "more cpus", assuming you are NOT K8s API or network bounded) for ephemeralrunner reconcilation.

I'd appreciate it if you all could help testing in your prod-like environments if possible.

@tfujiwar
Copy link
Contributor

tfujiwar commented Dec 3, 2024

Hello, we are using the runner-scale-set version (0.9.3) with 300-400 runners at peak times and experiencing the same issue. We've been investigating bottlenecks and reached a similar conclusion as @mumoshu.

These are bottlenecks we found in the EphemeralRunnerReconciler by taking trace data and measuring time of suspicious parts.

  • Workqueue rate limiter
    • The EphemeralRunnerReconciler returns ctrl.Result{Requeue: true} after modifying secrets. Such retries are rate limited by the workqueue rate limiter, and it delays subsequent reconciles.
    • This will be fixed by f58dd76
  • Concurrency of reconcilers
    • The MaxConcurrentReconciles is set to 1 for all reconcilers. It can be a bottleneck when other bottlenecks are resolved.
    • This will be fixed by a4876c5
  • K8s API client rate limiter

I'd appreciate it if maintainers could check my PRs and give any comments. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode
Projects
None yet
Development

No branches or pull requests

8 participants