Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker logs showing same jobId for different workflows/jobs and it's causing the registration of the runner to fail and the job to become orphaned #3621

Open
NorseGaud opened this issue Dec 12, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@NorseGaud
Copy link

NorseGaud commented Dec 12, 2024

Describe the bug

We install the latest runner into a macOS VM and it registers it to the org level so that jobs in a repo under the org can use it. This works 99% of the time, but then randomly we get a problem where the actions-runner in the VM doesn't register fully and the job runs indefinitely.

The runner logs show

[2024-12-12 16:30:47Z INFO ProcessInvokerWrapper] Finished process 729 with exit code 102, and elapsed time 00:00:11.4918642.
[2024-12-12 16:30:47Z INFO JobDispatcher] Worker finished for job a4127350-4039-54ee-311c-0227a78ac3cb. Code: 102
[2024-12-12 16:30:47Z INFO JobDispatcher] finish job request for job a4127350-4039-54ee-311c-0227a78ac3cb with result: Failed
[2024-12-12 16:30:47Z INFO Terminal] WRITE LINE: 2024-12-12 16:30:47Z: Job testJob completed with result: Failed
[2024-12-12 16:30:47Z INFO JobDispatcher] Stop renew job request for job a4127350-4039-54ee-311c-0227a78ac3cb.
[2024-12-12 16:30:47Z INFO JobDispatcher] job renew has been cancelled, stop renew job request 7662.
[2024-12-12 16:30:47Z INFO JobNotification] Entering JobCompleted Notification
[2024-12-12 16:30:47Z INFO JobNotification] Entering EndMonitor
[2024-12-12 16:30:47Z INFO MessageListener] Received job status event. JobState: Online
[2024-12-12 16:30:47Z INFO JobDispatcher] Fire signal for one time used runner.
[2024-12-12 16:30:47Z INFO Runner] Job has finished at backend, the runner will exit since it is running under onetime use mode.
[2024-12-12 16:30:47Z INFO Runner] Stop message queue looping.
[2024-12-12 16:30:47Z WARN GitHubActionsService] GET request to https://broker.actions.githubusercontent.com/message?sessionId=90e53455-5052-4e98-8c94-92d824f3f5f0&status=Busy&runnerVersion=2.321.0&os=macOS&architecture=ARM64&disableUpdate=true has been cancelled.
[2024-12-12 16:30:47Z ERR  BrokerServer] Catch exception during request
[2024-12-12 16:30:47Z ERR  BrokerServer] System.Threading.Tasks.TaskCanceledException: The operation was canceled.
 ---> System.Threading.Tasks.TaskCanceledException: The operation was canceled.
 ---> System.IO.IOException: Unable to read data from the transport connection: Operation canceled.
 ---> System.Net.Sockets.SocketException (89): Operation canceled
   --- End of inner exception stack trace ---
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource<System.Int32>.GetResult(Int16 token)
   at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](CancellationToken cancellationToken, Int32 estimatedSize)
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](Memory`1 buffer, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at System.Net.Http.HttpConnection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.AuthenticationHelper.SendWithNtAuthAsync(HttpRequestMessage request, Uri authUri, Boolean async, ICredentials credentials, Boolean isProxyAuth, HttpConnection connection, HttpConnectionPool connectionPool, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.AuthenticationHelper.SendWithAuthAsync(HttpRequestMessage request, Uri authUri, Boolean async, ICredentials credentials, Boolean preAuthenticate, Boolean isProxyAuth, Boolean doRequestAuth, HttpConnectionPool pool, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at GitHub.Services.Common.RawHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at GitHub.Services.Common.VssHttpRetryMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   at Sdk.WebApi.WebApi.RawHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken)
   at Sdk.WebApi.WebApi.RawHttpClientBase.SendAsync[T](HttpRequestMessage message, Boolean readErrorBody, Object userState, CancellationToken cancellationToken)
   at Sdk.WebApi.WebApi.RawHttpClientBase.SendAsync[T](HttpMethod method, IEnumerable`1 additionalHeaders, Uri requestUri, HttpContent content, IEnumerable`1 queryParameters, Boolean readErrorBody, Object userState, CancellationToken cancellationToken)
   at GitHub.Actions.RunService.WebApi.BrokerHttpClient.GetRunnerMessageAsync(Nullable`1 sessionId, String runnerVersion, Nullable`1 status, String os, String architecture, Nullable`1 disableUpdate, CancellationToken cancellationToken)
   at GitHub.Runner.Common.BrokerServer.<>c__DisplayClass7_0.<<GetRunnerMessageAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at GitHub.Runner.Common.RunnerService.RetryRequest[T](Func`1 func, CancellationToken cancellationToken, Int32 maxRetryAttemptsCount, Func`2 shouldRetry)

Per https://github.com/orgs/community/discussions/146348#discussioncomment-11546552, I notice that the actions-runner logs, even for jobs that are unrelated, are all showing the same jobId used:

Every 2.0s: grep -r '"jobId":' /tmp/anklet-vm-*                                        Nathans-MacBook-Pro.local: Thu Dec 12 10:34:08 2024

/tmp/anklet-vm-314c9bc4-3a12-479a-8945-fc5750c7574a/Worker_20241212-163036-utc.log:  "jobId": "a4127350-4039-54ee-311c-0227a78ac3cb",
/tmp/anklet-vm-46d212e2-48c1-4c16-9bd3-24af3c0420d4/Worker_20241212-163225-utc.log:  "jobId": "b5f34950-b701-5746-fee6-8f2a0b68b3f1",
/tmp/anklet-vm-9e1ab1c2-3601-4ca8-ab5e-77f3fd2350f6/Worker_20241212-163314-utc.log:  "jobId": "a4127350-4039-54ee-311c-0227a78ac3cb",
/tmp/anklet-vm-ad54e62a-054f-484f-8053-1c15a7d7a4c8/Worker_20241212-163344-utc.log:  "jobId": "a4127350-4039-54ee-311c-0227a78ac3cb",
/tmp/anklet-vm-cc309b23-4f58-49b4-9406-b429a43b6a84/Worker_20241212-163041-utc.log:  "jobId": "a0de1d4e-2247-5d6c-d7e9-3e3aaa8d4597",
/tmp/anklet-vm-e7b1e26a-947b-469a-a305-934aa0cb5a85/Worker_20241212-163149-utc.log:  "jobId": "23d0dcb9-468e-5c44-e20e-7d90f1a0de55",

It looks like the jobId is unique for a while, then starts being reused. You can see a4127350-4039-54ee-311c-0227a78ac3cb is reused here.

The way I configure the runners inside of the VMs is like the following. I made sure the Token is unique for each, as well as the name, labels, etc.

./config.sh --name anklet-vm-314c9bc4-3a12-479a-8945-fc5750c7574a --token XXXXX --url https://github.com/veertuinc --no-default-labels --labels anka-template:d792c6f6-198c-470f-9526-9c998efe7ab4,anka-template-tag:vanilla+port-forward-22+brew-git,unique-id:12300647430-2 --disableupdate --unattended --ephemeral

Here are the two jobs that are using the same jobId

THE JOB THAT NEVER GOT REGISTERED
{
  "time": "2024-12-12T10:30:15.555763-06:00",
  "level": "DEBUG",
  "msg": "executing",
  "command": "anka -j run anklet-vm-314c9bc4-3a12-479a-8945-fc5750c7574a bash -c ./register-runner.bash anklet-vm-314c9bc4-3a12-479a-8945-fc5750c7574a BJ7F3UI2A5Z3EWK5B3R... https://github.com/veertuinc anka-template:d792c6f6-198c-470f-9526-9c998efe7ab4,anka-template-tag:vanilla+port-forward-22+brew-git,unique-id:12300647430-2 ",
  "ankletVersion": "dev",
  "pluginName": "RUNNER1",
  "plugin": "github",
  "owner": "veertuinc",
  "repo": "anklet",
  "workflowJobID": 34329323592,
  "workflowJobName": "t1-with-tag-1-matrix-nodes-2 (2)",
  "workflowJobRunID": 12300647430,
  "workflowName": "t1-with-tag-1-matrix-nodes-2",
  "jobURL": "https://github.com/veertuinc/anklet/actions/runs/12300647430/job/34329323592",
  "ankaTemplate": "d792c6f6-198c-470f-9526-9c998efe7ab4",
  "ankaTemplateTag": "vanilla+port-forward-22+brew-git",
  "vmName": "anklet-vm-314c9bc4-3a12-479a-8945-fc5750c7574a"
}
THE NEXT REGISTRATION THAT HAD THE SAME jobId (and worked)
{
  "time": "2024-12-12T10:30:20.453382-06:00",
  "level": "DEBUG",
  "msg": "executing",
  "command": "anka -j run anklet-vm-cc309b23-4f58-49b4-9406-b429a43b6a84 bash -c ./register-runner.bash anklet-vm-cc309b23-4f58-49b4-9406-b429a43b6a84 BJ7F3UMDJHL2MEQV6... https://github.com/veertuinc anka-template:d792c6f6-198c-470f-9526-9c998efe7ab4,anka-template-tag:vanilla+port-forward-22+brew-git ",
  "ankletVersion": "dev",
  "pluginName": "RUNNER2",
  "plugin": "github",
  "owner": "veertuinc",
  "repo": "anklet",
  "workflowJobID": 34329321785,
  "workflowJobName": "testJob",
  "workflowJobRunID": 12300647002,
  "workflowName": "t1-failure-tag-1-in-vm",
  "jobURL": "https://github.com/veertuinc/anklet/actions/runs/12300647002/job/34329321785",
  "ankaTemplate": "d792c6f6-198c-470f-9526-9c998efe7ab4",
  "ankaTemplateTag": "vanilla+port-forward-22+brew-git",
  "vmName": "anklet-vm-cc309b23-4f58-49b4-9406-b429a43b6a84"
}

Here are the _diags for those runners

anklet-vm-cc309b23-4f58-49b4-9406-b429a43b6a84.zip
anklet-vm-314c9bc4-3a12-479a-8945-fc5750c7574a.zip

Really unsure where to go with this one. Why are the jobId reused? Is this something I can control? Or is this a bug in the actions-runner?

To Reproduce

Run a ton of jobs in the same repo with a self-hosted runner that's ephemeral and registered after a queued webhook is created.

Runner Version and Platform

v2.321.0
macOS

@NorseGaud
Copy link
Author

bump

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant