Self hosted runner occasionally crashes on "SessionConflictException" after multiple "A session for this runner already exists" #3624

liriID · 2024-12-15T12:59:33Z

Describe the bug
We are using self hosted runners, deployed on EKS and we recently started experiencing repeating crashes of our runners. This doesn't seem to be related to any change besides having the previously long running containers restarted.

Our Dockerfile:

FROM debian:bookworm

ARG RUNNER_VERSION

ENV GITHUB_PERSONAL_TOKEN=""

RUN apt-get update && // ... install lots of background dependencies and packages

USER github
WORKDIR /home/github

RUN curl -O -L https://github.com/actions/runner/releases/download/v$RUNNER_VERSION/actions-runner-linux-x64-$RUNNER_VERSION.tar.gz && \
    tar xzf ./actions-runner-linux-x64-$RUNNER_VERSION.tar.gz && \
    sudo ./bin/installdependencies.sh

COPY --chown=github:github entrypoint.sh ./entrypoint.sh
RUN sudo chmod u+x ./entrypoint.sh

ENTRYPOINT ["/home/github/entrypoint.sh"]

With entrypoint.sh being:

#!/bin/sh

registration_url="https://api.github.com/orgs/acme/actions/runners/registration-token"
echo "Requesting registration URL at '${registration_url}'"

payload=$(curl --http1.1 -sX POST -H "Authorization: token ${GITHUB_PERSONAL_TOKEN}" "$registration_url")
export RUNNER_TOKEN=$(echo $payload | jq .token --raw-output)

./config.sh \
    --name $(hostname) \
    --token ${RUNNER_TOKEN} \
    --url https://github.com/acme \
    --unattended \
    --replace

remove() {
    ./config.sh remove --unattended --token "${RUNNER_TOKEN}"
}

trap 'remove; exit 130' INT
trap 'remove; exit 143' TERM

./run.sh "$*" &

wait $!

To Reproduce
After runner for some amount of time, the runner container will crash and restart. Logs typically look similar to the following:

Requesting registration URL at 'https://api.github.com/orgs/acme/actions/runners/registration-token'

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication


√ Connected to GitHub

# Runner Registration




A runner exists with the same name
√ Successfully replaced the runner
√ Runner connection is good

# Runner settings


√ Settings Saved.


√ Connected to GitHub

A session for this runner already exists.
2024-12-14 22:04:06Z: Runner connect error: The actions runner runner-5969d445d5-7vx8v already has an active session.. Retrying until reconnected.

√ Connected to GitHub

A session for this runner already exists.

√ Connected to GitHub

A session for this runner already exists.

√ Connected to GitHub

A session for this runner already exists.

√ Connected to GitHub

A session for this runner already exists.

√ Connected to GitHub

A session for this runner already exists.

√ Connected to GitHub

A session for this runner already exists.

√ Connected to GitHub

A session for this runner already exists.
Stop retry on SessionConflictException after retried for 240 seconds.
Failed to create session. The actions runner runner-5969d445d5-7vx8v already has an active session.
Runner listener exit with Session Conflict error, stop the service, no retry needed.
Exiting runner...

Expected behavior
Runner should not crashloop.

Runner Version and Platform

Runner version 2.320.0, running on debian:bookworm based Docker, on top of EKS version v1.30.6-eks-7f9249a.

Runner and Worker's Diagnostic Logs

We don't have diagnostic logs of crashed runners because this info is not persisted. If relevant, we could persist it for debugging.
As far as I can see in running containers, nothing besides INFO logs appear.

The text was updated successfully, but these errors were encountered:

ChristopherHX · 2024-12-17T16:01:14Z

Some unofficial hints to improve reliability of your container.

# don't use a runner name that persists,
# e.g. don't use hostname alone append a uuid
# A crashed runner doesn't exit the session
# The session is blocked until the Actions Service cleans up
# This takes more than 5 minutes, instant container restart with same runner name might reset the timer!
# I assume `-replace` doesn't kill a stale session of a given runner name
./config.sh \
    --name "$(hostname)-$(uuidgen)" \
    --token ${RUNNER_TOKEN} \
    --url https://github.com/acme \
    --unattended \
    --replace

remove() {
    # you need to regenerate RUNNER_TOKEN if it is ca. 1 hour old!!! or you get peermission denied
    # at least the configure command has a `--pat` option to let the runner make the rest api call, but yes caching this token saves rate limited api calls if reused to remove it in less that 1h.
    ./config.sh remove --unattended --token "${RUNNER_TOKEN}"
}

liriID added the bug Something isn't working label Dec 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self hosted runner occasionally crashes on "SessionConflictException" after multiple "A session for this runner already exists" #3624

Self hosted runner occasionally crashes on "SessionConflictException" after multiple "A session for this runner already exists" #3624

liriID commented Dec 15, 2024

ChristopherHX commented Dec 17, 2024

Self hosted runner occasionally crashes on "SessionConflictException" after multiple "A session for this runner already exists" #3624

Self hosted runner occasionally crashes on "SessionConflictException" after multiple "A session for this runner already exists" #3624

Comments

liriID commented Dec 15, 2024

Runner Version and Platform

Runner and Worker's Diagnostic Logs

ChristopherHX commented Dec 17, 2024