Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Httpcache is causing excessive runner creation #205

Open
alfred-stokespace opened this issue May 13, 2024 · 1 comment
Open

Httpcache is causing excessive runner creation #205

alfred-stokespace opened this issue May 13, 2024 · 1 comment

Comments

@alfred-stokespace
Copy link
Contributor

Took quite a bit of digging to get a reliable test case that got me to a smoking gun...

First, the symptom....

For very short jobs, as in jobs that only ran for a 20 - 50 seconds (for example a job that errors out super fast on tag check), these jobs would produce a runner, great! The job would complete, good! Buutt... I'd end up eventually getting another runner that would replace that runner. When I'd check in the db i'd find a job in the jobs table for that same job which already ran.

As I turned debug on I kept finding these over and over...

2024/05/13 15:30:36 [DEBUG] myshoes-e06dbeee-2566-46af-86f1-6ea1f933aca3 is not found in GitHub, will retry... (second: 70s)

and these were getting produced despite the fact that I was watch with a gh api ... /repos/$org/$repo/actions/runners cli script the runner get registered, be online, then be removed.
yet still I kept seeing that message saying myshoes-e06dbeee-2566-46af-86f1-6ea1f933aca3 is not found in GitHub, will retry...
So I threw a bunch of logging in and finally nailed down to a test with to a minimum of this call

func Test(t *testing.T) {
	config.Load()
	config.Config.Debug = true

	client, err := NewClient(os.Getenv("TEST_PAT"))
	if err != nil {
		t.Fatalf("%v", err)
	}
	count := 0
	for count < 300 {
		count++
		ctx := context.Background()
		var opts = &github.ListOptions{
			Page:    0,
			PerPage: 100,
		}
		runners, _, err := client.Actions.ListRunners(ctx, "REDACTED", "REDACTEDREPO", opts)
		if err != nil {
			t.Fatalf("failed to list repository runners %v", err)
		}
		for _, r := range runners.Runners {
			fmt.Println("API Response:" + *r.Name + "\n")
		}
		time.Sleep(400 * time.Millisecond)
	}
}

Running against my shell script...

until [ "" == "!" ]; do
    echo "### SSSTTTAAARRRTTT"
    date
    gh api \
      -H "Accept: application/vnd.github+json" \
      -H "X-GitHub-Api-Version: 2022-11-28" \
      --hostname REDACTED \
      -q '.runners[] | "\(.name, .status)"' \
      /repos/$org/$repo/actions/runners
    echo "### EEENNNDDD"
    sleep 2
done

and sure enough! I saw that the bash process was accurate but the client you are producing was reporting stale information.

So, I dug deeper! and found this...

func NewClient(token string) (*github.Client, error) {
	oauth2Transport := &oauth2.Transport{
		Source: oauth2.StaticTokenSource(
			&oauth2.Token{AccessToken: token},
		),
	}
	transport := &httpcache.Transport{
		Transport:           oauth2Transport,
		Cache:               httpCache,
		MarkCachedResponses: true,
	}

So, I decide, hey lets deactivate that and see what happens...

this

return github.NewEnterpriseClient(config.Config.GitHubURL, config.Config.GitHubURL, &http.Client{Transport: transport})

became this

return github.NewEnterpriseClient(config.Config.GitHubURL, config.Config.GitHubURL, &http.Client{Transport: oauth2Transport})

difference is, I'm not using the cache transport, I'm just using the outh2Transport now.

and when I put bash against the go test, they matched up! Good.

So I'm thinking is happening is the following...

Fast Job Race Condition
GHE -> MS (new job)
MS -> AWS (new EC2) 2024/05/13 14:02:50 instance create successfully! (job: 80813ad7-dbad-4f4f-aaba-147bbe4057fa, cloud ID: i-0a3d5b43ed7e398ad)
MS -> GHE (poll) is not found in GitHub, will retry... 
EC2 -> GHE (register)
MS -> GHE (poll) is not found in GitHub, will retry... 
GHE -> GHE (schedules on EC2)
MS -> GHE (poll) is not found in GitHub, will retry... 
GHE -> GHE (unregisters worker)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
EC2 -> GHE (starts working on job)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
EC2 -> GHE (finishes the job)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
... 5 minutes of this ...
... myshoes now thiks the job never got picked up and tries to create a new runner ---
MS -> AWS (new EC2) 2024/05/13 14:07:56 instance create successfully! (job: 80813ad7-dbad-4f4f-aaba-147bbe4057fa, cloud ID: i-03b972b4d9ec4ca14)
... if your shoes-aws implementation doesn't catch the return of "80813ad7-dbad-4f4f-aaba-147bbe4057fa" you will get a new ec2 instance that sits idle.
MS -> MS (polling through function sanitizeGitHubRunner) "{...} is idle and not running 6h0m0s, so not will delete ..."

So eventually, after paying for a runner for 6 hours this gets cleaned up.

@alfred-stokespace
Copy link
Contributor Author

As a mitigation for this, I'm going to do the following in my private fork.

I'm creating a new uncached client method


func NewUncachedClient(token string) (*github.Client, error) {
	oauth2Transport := &oauth2.Transport{
		Source: oauth2.StaticTokenSource(
			&oauth2.Token{AccessToken: token},
		),
	}

	if !config.Config.IsGHES() {
		return github.NewClient(&http.Client{Transport: oauth2Transport}), nil
	}

	return github.NewEnterpriseClient(config.Config.GitHubURL, config.Config.GitHubURL, &http.Client{Transport: oauth2Transport})
}

which will get called from starter.go's checkRegisteredRunner function

func (s *Starter) checkRegisteredRunner(ctx context.Context, runnerName string, target datastore.Target) error {
	client, err := gh.NewUncachedClient(target.GitHubToken)

which will then be used lower down ...

if _, err := gh.ExistGitHubRunner(cctx, client, owner, repo, runnerName); err == nil {
      // success to register runner to GitHub
      return nil
} else if !errors.Is(err, gh.ErrNotFound) {

and we'll see whether that blows everything up or not :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant