Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fallback_ips: Gather facts from reachable hosts only #11006

Closed
wants to merge 169 commits into from

Conversation

Rickkwa
Copy link
Contributor

@Rickkwa Rickkwa commented Mar 13, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

When working with a large fleet of nodes, there is inevitably going to be some unreachable nodes. The linked issue describes that when kubespray-defaults runs and there exists an unreachable node, then it causes the play to unexpectedly exit early.

This is a fix to fallback_ips.yml to let it finish the rest of the role when unreachable node(s) exist.

Which issue(s) this PR fixes:

Fixes #10993

Special notes for your reviewer:

Local setup (inventory and test playbook):

details Inventory
[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr

[kube_control_plane]
k8s1.local

[etcd]
k8s1.local

[kube_node]
k8s3.local  # unreachable
k8s2.local

[calico_rr]

Playbook

---
- name: Prepare nodes for upgrade
  hosts: k8s_cluster:etcd:calico_rr
  gather_facts: False
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray-defaults }

  post_tasks:
    - name: Task that runs after
      debug:
        var: fallback_ips

Showing the original output before applying my PR:

details
PLAY [Prepare nodes for upgrade] ********************************************************************************************************************************************************************************

TASK [kubespray-defaults : Gather ansible_default_ipv4 from all hosts] ******************************************************************************************************************************************
ok: [k8s1.local] => (item=k8s1.local)
[WARNING]: Unhandled error in Python interpreter discovery for host k8s1.local: Failed to connect to the host via ssh: ssh: connect to host k8s3.local port 22: Connection timed out
failed: [k8s1.local -> k8s3.local] (item=k8s3.local) => {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}
ok: [k8s1.local -> k8s2.local] => (item=k8s2.local)
fatal: [k8s1.local -> {{ item }}]: UNREACHABLE! => {"changed": false, "msg": "All items completed", "results": [{"ansible_facts": {"ansible_default_ipv4": {"address": "10.88.111.29", "alias": "eth0", "broadcast": "10.88.111.255", "gateway": "10.88.111.254", "interface": "eth0", "macaddress": "bc:24:11:41:88:12", "mtu": 1500, "netmask": "255.255.252.0", "network": "10.88.108.0", "prefix": "22", "type": "ether"}, "discovered_interpreter_python": "/usr/bin/python3"}, "ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"fact_path": "/etc/ansible/facts.d", "filter": ["ansible_default_ipv4"], "gather_subset": ["!all", "network"], "gather_timeout": 10}}, "item": "k8s1.local"}, {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}, {"ansible_facts": {"ansible_default_ipv4": {"address": "10.88.111.30", "alias": "eth0", "broadcast": "10.88.111.255", "gateway": "10.88.111.254", "interface": "eth0", "macaddress": "bc:24:11:be:42:a6", "mtu": 1500, "netmask": "255.255.252.0", "network": "10.88.108.0", "prefix": "22", "type": "ether"}, "discovered_interpreter_python": "/usr/bin/python3"}, "ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"fact_path": "/etc/ansible/facts.d", "filter": ["ansible_default_ipv4"], "gather_subset": ["!all", "network"], "gather_timeout": 10}}, "item": "k8s2.local"}]}
...ignoring

NO MORE HOSTS LEFT **********************************************************************************************************************************************************************************************

PLAY RECAP ******************************************************************************************************************************************************************************************************
k8s1.local                 : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=1

After my PR:

details
PLAY [Prepare nodes for upgrade] ********************************************************************************************************************************************************************************

TASK [kubespray-defaults : Determine reachable hosts] ***********************************************************************************************************************************************************
ok: [k8s1.local] => (item=k8s1.local)
[WARNING]: Unhandled error in Python interpreter discovery for host k8s1.local: Failed to connect to the host via ssh: ssh: connect to host k8s3.local port 22: Connection timed out
failed: [k8s1.local -> k8s3.local] (item=k8s3.local) => {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}
ok: [k8s1.local -> k8s2.local] => (item=k8s2.local)
fatal: [k8s1.local -> {{ item }}]: UNREACHABLE! => {"changed": false, "msg": "All items completed", "results": [{"ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"data": "pong"}}, "item": "k8s1.local", "ping": "pong"}, {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}, {"ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"data": "pong"}}, "item": "k8s2.local", "ping": "pong"}]}
...ignoring

TASK [kubespray-defaults : Gather ansible_default_ipv4 from reachable hosts] ************************************************************************************************************************************
ok: [k8s1.local] => (item=k8s1.local)
ok: [k8s1.local -> k8s2.local] => (item=k8s2.local)

TASK [kubespray-defaults : Create fallback_ips_base] ************************************************************************************************************************************************************
ok: [k8s1.local -> localhost]

TASK [kubespray-defaults : Set fallback_ips] ********************************************************************************************************************************************************************
ok: [k8s1.local]
ok: [k8s2.local]
ok: [k8s3.local]

TASK [Task that runs after] *************************************************************************************************************************************************************************************
ok: [k8s1.local] => {
    "fallback_ips": {
        "k8s1.local": "10.88.111.29",
        "k8s2.local": "10.88.111.30",
        "k8s3.local": "127.0.0.1"
    }
}
ok: [k8s3.local] => {
    "fallback_ips": {
        "k8s1.local": "10.88.111.29",
        "k8s2.local": "10.88.111.30",
        "k8s3.local": "127.0.0.1"
    }
}
ok: [k8s2.local] => {
    "fallback_ips": {
        "k8s1.local": "10.88.111.29",
        "k8s2.local": "10.88.111.30",
        "k8s3.local": "127.0.0.1"
    }
}

PLAY RECAP ******************************************************************************************************************************************************************************************************
k8s1.local                 : ok=5    changed=0    unreachable=0    failed=0    skipped=3    rescued=0    ignored=1
k8s2.local                 : ok=2    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0
k8s3.local                 : ok=2    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0

Also to note, I originally tried making this change to add ignore_errors: true:

details
# roles/kubespray-defaults/tasks/fallback_ips.yml
 ---
 # Set 127.0.0.1 as fallback IP if we do not have host facts for host
 # ansible_default_ipv4 isn't what you think.
 # Thanks https://medium.com/opsops/ansible-default-ipv4-is-not-what-you-think-edb8ab154b10

 - name: Gather ansible_default_ipv4 from all hosts
   setup:
     gather_subset: '!all,network'
     filter: "ansible_default_ipv4"
   delegate_to: "{{ item }}"
   delegate_facts: yes
   when: hostvars[item].ansible_default_ipv4 is not defined
   loop: "{{ (groups['k8s_cluster'] | default([]) + groups['etcd'] | default([]) + groups['calico_rr'] | default([])) | unique }}"
   run_once: yes
   ignore_unreachable: true
+  ignore_errors: true
   tags: always

 - name: Create fallback_ips_base
   set_fact:
     fallback_ips_base: |
       ---
       {% for item in (groups['k8s_cluster'] | default([]) + groups['etcd'] | default([]) + groups['calico_rr'] | default([])) | unique %}
-      {% set found = hostvars[item].get('ansible_default_ipv4') %}
+      {% set found = hostvars[item].get('ansible_default_ipv4', {}) %}
       {{ item }}: "{{ found.get('address', '127.0.0.1') }}"
       {% endfor %}
   delegate_to: localhost
   connection: local
   delegate_facts: yes
   become: no
   run_once: yes

 - name: Set fallback_ips
   set_fact:
     fallback_ips: "{{ hostvars.localhost.fallback_ips_base | from_yaml }}"
PLAY [Prepare nodes for upgrade] ********************************************************************************************************************************************************************************

TASK [kubespray-defaults : Gather ansible_default_ipv4 from all hosts] ******************************************************************************************************************************************
ok: [k8s1.local] => (item=k8s1.local)
[WARNING]: Unhandled error in Python interpreter discovery for host k8s1.local: Failed to connect to the host via ssh: ssh: connect to host k8s3.local port 22: Connection timed out
failed: [k8s1.local -> k8s3.local] (item=k8s3.local) => {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}
ok: [k8s1.local -> k8s2.local] => (item=k8s2.local)
fatal: [k8s1.local -> {{ item }}]: UNREACHABLE! => {"changed": false, "msg": "All items completed", "results": [{"ansible_facts": {"ansible_default_ipv4": {"address": "10.88.111.29", "alias": "eth0", "broadcast": "10.88.111.255", "gateway": "10.88.111.254", "interface": "eth0", "macaddress": "bc:24:11:41:88:12", "mtu": 1500, "netmask": "255.255.252.0", "network": "10.88.108.0", "prefix": "22", "type": "ether"}, "discovered_interpreter_python": "/usr/bin/python3"}, "ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"fact_path": "/etc/ansible/facts.d", "filter": ["ansible_default_ipv4"], "gather_subset": ["!all", "network"], "gather_timeout": 10}}, "item": "k8s1.local"}, {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}, {"ansible_facts": {"ansible_default_ipv4": {"address": "10.88.111.30", "alias": "eth0", "broadcast": "10.88.111.255", "gateway": "10.88.111.254", "interface": "eth0", "macaddress": "bc:24:11:be:42:a6", "mtu": 1500, "netmask": "255.255.252.0", "network": "10.88.108.0", "prefix": "22", "type": "ether"}, "discovered_interpreter_python": "/usr/bin/python3"}, "ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"fact_path": "/etc/ansible/facts.d", "filter": ["ansible_default_ipv4"], "gather_subset": ["!all", "network"], "gather_timeout": 10}}, "item": "k8s2.local"}]}
...ignoring

TASK [kubespray-defaults : Create fallback_ips_base] ************************************************************************************************************************************************************
ok: [k8s1.local -> localhost]

TASK [kubespray-defaults : Set fallback_ips] ********************************************************************************************************************************************************************
ok: [k8s1.local]
ok: [k8s3.local]
ok: [k8s2.local]

TASK [Task that runs after] *************************************************************************************************************************************************************************************
ok: [k8s1.local] => {
    "fallback_ips": {
        "k8s1.local": "127.0.0.1",
        "k8s2.local": "127.0.0.1",
        "k8s3.local": "127.0.0.1"
    }
}
ok: [k8s3.local] => {
    "fallback_ips": {
        "k8s1.local": "127.0.0.1",
        "k8s2.local": "127.0.0.1",
        "k8s3.local": "127.0.0.1"
    }
}
ok: [k8s2.local] => {
    "fallback_ips": {
        "k8s1.local": "127.0.0.1",
        "k8s2.local": "127.0.0.1",
        "k8s3.local": "127.0.0.1"
    }
}

PLAY RECAP ******************************************************************************************************************************************************************************************************
k8s1.local                 : ok=4    changed=0    unreachable=0    failed=0    skipped=3    rescued=0    ignored=1
k8s2.local                 : ok=2    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0
k8s3.local                 : ok=2    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0

But as you can see, while it got past the early exit, it looks like it doesn't save the discovered facts. And so the output is all wrong. That's why I decided to first filter out the unreachable hosts in a separate task before running setup.

Does this PR introduce a user-facing change?:

None

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 13, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @Rickkwa. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 13, 2024
@cyclinder
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 13, 2024
@Rickkwa
Copy link
Contributor Author

Rickkwa commented Apr 12, 2024

Do you need any more info from me to move this forward? I'm hoping this can get into the 2.25 release, whenever that is.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Rickkwa
Once this PR has been reviewed and has the lgtm label, please assign mzaian for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

VannTen and others added 16 commits June 26, 2024 18:42
* upgrade ansible version

Needed for with_first_found to work correctly:
ansible/ansible#70772 fixed in 2.16

* Remove unused google cloud cloud_playbook

* Fix dpkg_selection on non-existing packages

Needed since ansible-core>2.16, see:
ansible/ansible@f10d11b
Should make it easier to understand what's going on when testing locally
and in CI.
* Remove leftover files for Coreos

Coreos was replaced by flatcar in 058438a but the file was copied
instead of moved.

* Remove workarounds for resolved ansible issues

* boostrap: Use first_found to include per distro

Using directly ID and VARIANT_ID with first_found allow for less manual
includes.
Distro "families" are simply handled by symlinks.

* boostrap: don't set ansible_python_interpreter

- Allows users to override the chosen python_interpreter with group_vars
  easily (group_vars have lesser precedence than facts)
- Allows us to use vars at the task scope to use a virtual env

Ansible python discovery has improved, so those workarounds should not
be necessary anymore.
Special workaround for Flatcar, due to upstream ansible not willing to
support it.
* Move fedora ansible python install to bootstrap-os

* /bin/dir is set in bootstrap-os

* Removing ansible_os_family workarounds

Support for these distributions was merged in Ansible, no need to
override it ourselves now.
ansible/ansible#69324 openEuler
ansible/ansible#77275 UnionTech OS Server 20
ansible/ansible#78232 Kylin

* Don't unconditionnaly set VARIANT_ID=coreos in os-release

WTF, this is so wrong.
Furthermore, is_fedora_coreos is already handled in boostrap-os

* Handle Clearlinux generically

Followup of 4eec302 (since we're using
package module anyway, let's get rid of the custom task)
ant31 and others added 20 commits June 26, 2024 18:42
kubernetes-sigs#11169)

* add the ability to configure extra args to the different cinder-csi-plugin containers

* endfor block added to be syntactically correct jinja
The pull request adding the pre-commit hook config was merged.
@Rickkwa Rickkwa force-pushed the 10993-fallback-ips branch from f8c42aa to 499b395 Compare June 26, 2024 22:44
@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jun 26, 2024
@k8s-ci-robot
Copy link
Contributor

Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages.

The list of commits with invalid commit messages:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed do-not-merge/contains-merge-commits size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 26, 2024
@Rickkwa
Copy link
Contributor Author

Rickkwa commented Jun 26, 2024

Gah, I'm bad at rebasing. Let me re-open a new one.

@Rickkwa Rickkwa closed this Jun 26, 2024
@k8s-ci-robot
Copy link
Contributor

@Rickkwa: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubespray-yamllint 499b395 link true /test pull-kubespray-yamllint

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/bug Categorizes issue or PR as related to a bug. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fallback_ips.yml exits early when there is an unreachable host in the inventory