Migrate existing Google Cloud alerts from click-ops to git-ops model #1624

spiffxp · 2021-02-08T16:36:18Z

Discussed in k8s-infra meeting 2020-02-03

We have some slack alerting setup today, but it's been configured by humans clicking around on the Google Cloud website (aka "click-ops"). It would be ideal if we could drive that configuration automatically via files checked into git (aka "git-ops").

This is likely similar to or overlaps with making a gitops-driven workflow for Google Cloud Monitoring dashboards (#1376)

/wg k8s-infra
/sig release
/area release-eng
FYI @kubernetes/release-engineering since #k8s-infra-alerts contains container image promoter alerts
/priority important-longterm

spiffxp · 2021-02-08T17:18:44Z

/help

k8s-ci-robot · 2021-02-08T17:18:45Z

@spiffxp:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rikatz · 2021-02-17T16:48:34Z

I'm with low bandwidth now, but If we have some time (not urgent) I can take a look into this to see how to manage the alerts and dashboards with Gitops :)

rikatz · 2021-02-17T16:48:38Z

/assign

rikatz · 2021-03-23T00:05:15Z

So far:

Pulumi needs some coding, and also some account, discarded for now
Made a simple program to create alert, and also to fetch it and the channel so @hasheddan can take a look into how viable is this on Crossplane -> https://gist.github.com/rikatz/d9dc76d27b8371d590924508bd2be6c0
Will make some attempts with terraform as well, tomorrow

My thoughts on this specific part: I really like the idea of using crossplane (k8s objects) to manage our cloud env, but I guess a lot of folks are familiar already with Terraform (although I agree with Justin, migration between versions sometimes is...annoying...)

Will create some simple .tf tomorrow with the same approach, trying to create notification channels and alert policies, and seeing how this reflects on stack driver.

rikatz · 2021-04-06T12:50:56Z

@ameukam will work on this, using @thockin tests to monitor certificates renew and expiration as an example.

rikatz · 2021-04-06T19:33:25Z

#1877 <- Created a PR with a really simple Terraform that adds an uptime check and the current alert policy.

We can improve this, like adding latency/uptime alerting (like for cs.k8s.io and others), etc.

fejta-bot · 2021-07-05T20:04:53Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

ameukam · 2021-07-05T20:44:03Z

/remove-lifecycle stale

spiffxp · 2021-07-16T18:24:48Z

A good first step would be understanding how to export whatever existing alerts we have as part of audit/audit-gcp.sh

spiffxp · 2021-08-17T21:00:47Z

https://github.com/GoogleCloudPlatform/oss-test-infra/tree/master/prow/oss/terraform/modules/alerts good prior art to start from

spiffxp · 2021-08-17T21:02:18Z

/milestone v1.23
I think it would be really handy to use this at a bare minimum for uptime checks on the apps we run on aaa

ameukam · 2021-12-14T22:12:05Z

/milestone clear

k8s-triage-robot · 2022-03-14T22:12:18Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ameukam · 2022-03-15T05:15:13Z

/remove-lifecycle stale
/lifecycle frozen
/milestone clear

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Feb 8, 2021

k8s-ci-robot assigned rikatz Feb 17, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 5, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 5, 2021

spiffxp mentioned this issue Jul 19, 2021

cip-auditor: alerts are noisy #2364

Open

k8s-ci-robot added this to the v1.23 milestone Aug 17, 2021

k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. and removed wg/k8s-infra labels Sep 29, 2021

ameukam mentioned this issue Oct 8, 2021

[Umbrella issue] How we monitor k8s-infra ? #2588

Closed

k8s-ci-robot removed this from the v1.23 milestone Dec 14, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2022

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 15, 2022

riaankleinhans moved this to Migrate away issues in registry.k8s.io (SIG K8S Infra) Jan 4, 2023

riaankleinhans added this to registry.k8s.io (SIG K8S Infra) Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate existing Google Cloud alerts from click-ops to git-ops model #1624

Migrate existing Google Cloud alerts from click-ops to git-ops model #1624

spiffxp commented Feb 8, 2021

spiffxp commented Feb 8, 2021

k8s-ci-robot commented Feb 8, 2021

rikatz commented Feb 17, 2021

rikatz commented Feb 17, 2021

rikatz commented Mar 23, 2021

rikatz commented Apr 6, 2021

rikatz commented Apr 6, 2021

fejta-bot commented Jul 5, 2021

ameukam commented Jul 5, 2021

spiffxp commented Jul 16, 2021

spiffxp commented Aug 17, 2021

spiffxp commented Aug 17, 2021

ameukam commented Dec 14, 2021

k8s-triage-robot commented Mar 14, 2022

ameukam commented Mar 15, 2022

Migrate existing Google Cloud alerts from click-ops to git-ops model #1624

Migrate existing Google Cloud alerts from click-ops to git-ops model #1624

Comments

spiffxp commented Feb 8, 2021

spiffxp commented Feb 8, 2021

k8s-ci-robot commented Feb 8, 2021

rikatz commented Feb 17, 2021

rikatz commented Feb 17, 2021

rikatz commented Mar 23, 2021

rikatz commented Apr 6, 2021

rikatz commented Apr 6, 2021

fejta-bot commented Jul 5, 2021

ameukam commented Jul 5, 2021

spiffxp commented Jul 16, 2021

spiffxp commented Aug 17, 2021

spiffxp commented Aug 17, 2021

ameukam commented Dec 14, 2021

k8s-triage-robot commented Mar 14, 2022

ameukam commented Mar 15, 2022