You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Took quite a bit of digging to get a reliable test case that got me to a smoking gun...
First, the symptom....
For very short jobs, as in jobs that only ran for a 20 - 50 seconds (for example a job that errors out super fast on tag check), these jobs would produce a runner, great! The job would complete, good! Buutt... I'd end up eventually getting another runner that would replace that runner. When I'd check in the db i'd find a job in the jobs table for that same job which already ran.
As I turned debug on I kept finding these over and over...
2024/05/13 15:30:36 [DEBUG] myshoes-e06dbeee-2566-46af-86f1-6ea1f933aca3 is not found in GitHub, will retry... (second: 70s)
and these were getting produced despite the fact that I was watch with a gh api ... /repos/$org/$repo/actions/runners cli script the runner get registered, be online, then be removed.
yet still I kept seeing that message saying myshoes-e06dbeee-2566-46af-86f1-6ea1f933aca3 is not found in GitHub, will retry...
So I threw a bunch of logging in and finally nailed down to a test with to a minimum of this call
func Test(t *testing.T) {
config.Load()
config.Config.Debug = true
client, err := NewClient(os.Getenv("TEST_PAT"))
if err != nil {
t.Fatalf("%v", err)
}
count := 0
for count < 300 {
count++
ctx := context.Background()
var opts = &github.ListOptions{
Page: 0,
PerPage: 100,
}
runners, _, err := client.Actions.ListRunners(ctx, "REDACTED", "REDACTEDREPO", opts)
if err != nil {
t.Fatalf("failed to list repository runners %v", err)
}
for _, r := range runners.Runners {
fmt.Println("API Response:" + *r.Name + "\n")
}
time.Sleep(400 * time.Millisecond)
}
}
Running against my shell script...
until [ "" == "!" ]; do
echo "### SSSTTTAAARRRTTT"
date
gh api \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
--hostname REDACTED \
-q '.runners[] | "\(.name, .status)"' \
/repos/$org/$repo/actions/runners
echo "### EEENNNDDD"
sleep 2
done
and sure enough! I saw that the bash process was accurate but the client you are producing was reporting stale information.
difference is, I'm not using the cache transport, I'm just using the outh2Transport now.
and when I put bash against the go test, they matched up! Good.
So I'm thinking is happening is the following...
Fast Job Race Condition
GHE -> MS (new job)
MS -> AWS (new EC2) 2024/05/13 14:02:50 instance create successfully! (job: 80813ad7-dbad-4f4f-aaba-147bbe4057fa, cloud ID: i-0a3d5b43ed7e398ad)
MS -> GHE (poll) is not found in GitHub, will retry...
EC2 -> GHE (register)
MS -> GHE (poll) is not found in GitHub, will retry...
GHE -> GHE (schedules on EC2)
MS -> GHE (poll) is not found in GitHub, will retry...
GHE -> GHE (unregisters worker)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
EC2 -> GHE (starts working on job)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
EC2 -> GHE (finishes the job)
MS -> GHE (poll) is not found in GitHub, will retry... (second: {increases})
... 5 minutes of this ...
... myshoes now thiks the job never got picked up and tries to create a new runner ---
MS -> AWS (new EC2) 2024/05/13 14:07:56 instance create successfully! (job: 80813ad7-dbad-4f4f-aaba-147bbe4057fa, cloud ID: i-03b972b4d9ec4ca14)
... if your shoes-aws implementation doesn't catch the return of "80813ad7-dbad-4f4f-aaba-147bbe4057fa" you will get a new ec2 instance that sits idle.
MS -> MS (polling through function sanitizeGitHubRunner) "{...} is idle and not running 6h0m0s, so not will delete ..."
So eventually, after paying for a runner for 6 hours this gets cleaned up.
The text was updated successfully, but these errors were encountered:
Took quite a bit of digging to get a reliable test case that got me to a smoking gun...
First, the symptom....
For very short jobs, as in jobs that only ran for a 20 - 50 seconds (for example a job that errors out super fast on tag check), these jobs would produce a runner, great! The job would complete, good! Buutt... I'd end up eventually getting another runner that would replace that runner. When I'd check in the db i'd find a job in the jobs table for that same job which already ran.
As I turned debug on I kept finding these over and over...
and these were getting produced despite the fact that I was watch with a
gh api ... /repos/$org/$repo/actions/runners
cli script the runner get registered, be online, then be removed.yet still I kept seeing that message saying
myshoes-e06dbeee-2566-46af-86f1-6ea1f933aca3 is not found in GitHub, will retry...
So I threw a bunch of logging in and finally nailed down to a test with to a minimum of this call
Running against my shell script...
and sure enough! I saw that the bash process was accurate but the client you are producing was reporting stale information.
So, I dug deeper! and found this...
So, I decide, hey lets deactivate that and see what happens...
this
became this
difference is, I'm not using the cache transport, I'm just using the outh2Transport now.
and when I put bash against the go test, they matched up! Good.
So I'm thinking is happening is the following...
So eventually, after paying for a runner for 6 hours this gets cleaned up.
The text was updated successfully, but these errors were encountered: