Httpcache is causing excessive runner creation #205

alfred-stokespace · 2024-05-13T20:29:35Z

Took quite a bit of digging to get a reliable test case that got me to a smoking gun...

First, the symptom....

For very short jobs, as in jobs that only ran for a 20 - 50 seconds (for example a job that errors out super fast on tag check), these jobs would produce a runner, great! The job would complete, good! Buutt... I'd end up eventually getting another runner that would replace that runner. When I'd check in the db i'd find a job in the jobs table for that same job which already ran.

As I turned debug on I kept finding these over and over...

2024/05/13 15:30:36 [DEBUG] myshoes-e06dbeee-2566-46af-86f1-6ea1f933aca3 is not found in GitHub, will retry... (second: 70s)

and these were getting produced despite the fact that I was watch with a gh api ... /repos/$org/$repo/actions/runners cli script the runner get registered, be online, then be removed.
yet still I kept seeing that message saying myshoes-e06dbeee-2566-46af-86f1-6ea1f933aca3 is not found in GitHub, will retry...
So I threw a bunch of logging in and finally nailed down to a test with to a minimum of this call

func Test(t *testing.T) {
	config.Load()
	config.Config.Debug = true

	client, err := NewClient(os.Getenv("TEST_PAT"))
	if err != nil {
		t.Fatalf("%v", err)
	}
	count := 0
	for count < 300 {
		count++
		ctx := context.Background()
		var opts = &github.ListOptions{
			Page:    0,
			PerPage: 100,
		}
		runners, _, err := client.Actions.ListRunners(ctx, "REDACTED", "REDACTEDREPO", opts)
		if err != nil {
			t.Fatalf("failed to list repository runners %v", err)
		}
		for _, r := range runners.Runners {
			fmt.Println("API Response:" + *r.Name + "\n")
		}
		time.Sleep(400 * time.Millisecond)
	}
}

Running against my shell script...

until [ "" == "!" ]; do
    echo "### SSSTTTAAARRRTTT"
    date
    gh api \
      -H "Accept: application/vnd.github+json" \
      -H "X-GitHub-Api-Version: 2022-11-28" \
      --hostname REDACTED \
      -q '.runners[] | "\(.name, .status)"' \
      /repos/$org/$repo/actions/runners
    echo "### EEENNNDDD"
    sleep 2
done

and sure enough! I saw that the bash process was accurate but the client you are producing was reporting stale information.

So, I dug deeper! and found this...

func NewClient(token string) (*github.Client, error) {
	oauth2Transport := &oauth2.Transport{
		Source: oauth2.StaticTokenSource(
			&oauth2.Token{AccessToken: token},
		),
	}
	transport := &httpcache.Transport{
		Transport:           oauth2Transport,
		Cache:               httpCache,
		MarkCachedResponses: true,
	}

So, I decide, hey lets deactivate that and see what happens...

this

return github.NewEnterpriseClient(config.Config.GitHubURL, config.Config.GitHubURL, &http.Client{Transport: transport})

became this

return github.NewEnterpriseClient(config.Config.GitHubURL, config.Config.GitHubURL, &http.Client{Transport: oauth2Transport})

difference is, I'm not using the cache transport, I'm just using the outh2Transport now.

and when I put bash against the go test, they matched up! Good.

So I'm thinking is happening is the following...

Fast Job Race Condition
GHE -> MS (new job)
MS -> AWS (new EC2) 2024/05/13 14:02:50 instance create successfully! (job: 80813ad7-dbad-4f4f-aaba-147bbe4057fa, cloud ID: i-0a3d5b43ed7e398ad)
MS -> GHE (poll) is not found in GitHub, will retry... 
EC2 -> GHE (register)
MS -> GHE (poll) is not found in GitHub, will retry... 
GHE -> GHE (schedules on EC2)
MS -> GHE (poll) is not found in GitHub, will retry... 
GHE -> GHE (unregisters worker)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
EC2 -> GHE (starts working on job)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
EC2 -> GHE (finishes the job)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
... 5 minutes of this ...
... myshoes now thiks the job never got picked up and tries to create a new runner ---
MS -> AWS (new EC2) 2024/05/13 14:07:56 instance create successfully! (job: 80813ad7-dbad-4f4f-aaba-147bbe4057fa, cloud ID: i-03b972b4d9ec4ca14)
... if your shoes-aws implementation doesn't catch the return of "80813ad7-dbad-4f4f-aaba-147bbe4057fa" you will get a new ec2 instance that sits idle.
MS -> MS (polling through function sanitizeGitHubRunner) "{...} is idle and not running 6h0m0s, so not will delete ..."

So eventually, after paying for a runner for 6 hours this gets cleaned up.

The text was updated successfully, but these errors were encountered:

alfred-stokespace · 2024-05-13T20:58:24Z

As a mitigation for this, I'm going to do the following in my private fork.

I'm creating a new uncached client method


func NewUncachedClient(token string) (*github.Client, error) {
	oauth2Transport := &oauth2.Transport{
		Source: oauth2.StaticTokenSource(
			&oauth2.Token{AccessToken: token},
		),
	}

	if !config.Config.IsGHES() {
		return github.NewClient(&http.Client{Transport: oauth2Transport}), nil
	}

	return github.NewEnterpriseClient(config.Config.GitHubURL, config.Config.GitHubURL, &http.Client{Transport: oauth2Transport})
}

which will get called from starter.go's checkRegisteredRunner function

func (s *Starter) checkRegisteredRunner(ctx context.Context, runnerName string, target datastore.Target) error {
	client, err := gh.NewUncachedClient(target.GitHubToken)

which will then be used lower down ...

if _, err := gh.ExistGitHubRunner(cctx, client, owner, repo, runnerName); err == nil {
      // success to register runner to GitHub
      return nil
} else if !errors.Is(err, gh.ErrNotFound) {

and we'll see whether that blows everything up or not :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Httpcache is causing excessive runner creation #205

Httpcache is causing excessive runner creation #205

alfred-stokespace commented May 13, 2024

alfred-stokespace commented May 13, 2024

Httpcache is causing excessive runner creation #205

Httpcache is causing excessive runner creation #205

Comments

alfred-stokespace commented May 13, 2024

alfred-stokespace commented May 13, 2024