[Feature]: <Operator/Administraor> can understand the health of Korifi Components as well as have custom metrics for CF Specific CRDs #3665

vipinvkmenon · 2024-12-16T00:08:55Z

Blockers/Dependencies

Currently, there are no metrics exposed from the Korifi API. Some metrics are from the controllers but not from the Korifi API Pod. II did not see a /metrics for the Korifi API...I could be wrong and missing it as well...If so that port and endpoint please :)? )

While I agree that the CF-Korifi Architecture is completely different from CF on Bosh, there would custom metrics (just like CF in the Bosh deployment) that would overlap, for eg.... Total LRPs (equivalent to total Pods), jobs, etc.
It would be ideal for these to be converted to CF-specific metrics, rather than getting it directly from the Kube Controller. Some metrics are

Background

As a CF-Operator
I want custom metric that are specific to CF rather than manually mapping or using all the generic metrics of Kubernetes
So that I can understand the overall health of my CF as a Platform.

Acceptance Criteria

GIVEN Korifi Deployment
WHEN I query /metrics of the Korifi API Pod
THEN I see the custom metrics that are specific to CF.

Dev Notes

No response

The text was updated successfully, but these errors were encountered:

danail-branekov · 2024-12-17T09:30:32Z

Hi @vipinvkmenon

We believe that in the k8s world there are solutions (such as open telemetry) that would be much more superior and flexible to whatever we come up in Korifi. That is why we have always considered observability and telemetry out of scope for Korifi.

Of course, Korifi should implement metrics endpoints as defined by the CF API (such as getting process stats) but anything outside of the specification should be probably achieved via k8s native and superior tools.

We are open for a discussion, of course. If you are willing to spend some time yourself, you could come up with a proposal and why not PRs. You could also consider building a separate component that provides the metrics you see useful, and if you decide to opensource it, the community could benefit from your work.

What do you think?

cc @georgethebeatle @zabanov-lab

vipinvkmenon · 2024-12-17T13:18:06Z

We believe that in the k8s world, there are solutions (such as open telemetry) that would be much more superior and flexible to whatever we come up in Korifi. That is why we have always considered observability and telemetry out of scope for Korifi.

I completely agree with this and there is no confusion or question on that aspect.

Of course, Korifi should implement metrics endpoints as defined by the CF API (such as getting process stats) but anything outside of the specification should be probably achieved via k8s native and superior tools.

Exactly. The Korifi API needs a metrics endpoint that gives specific metrics like the CF API, primarily I believe many of the metrics that are emitted by the cloud controller for example would make sense in the korifi-api as well.

Routing metrics is another such example but this could be mapped against the metrics coming off from contour and envoy most likely, but there would be custom metrics like for exampleroute_registration_latency which probably would need to be generated.

Many of these metrics are present around in the metrics server and probably in envoy in its terms and conventions. So probably another aspect will also be to map them against the equivalent metrics that operators are used to using and seeing in the traditional CF Deployment.

chombium · 2024-12-17T13:43:11Z

Many of these metrics are present around in the metrics server and probably in envoy in its terms and conventions. So probably another aspect will also be to map them against the equivalent metrics that operators are used to using and seeing in the traditional CF Deployment.

I don't think that we need everything that the cf-on-vms users are used to use, but we need a monitoring and operations guide for Korifi. I guess we have most of the things that we need buried somewhere deep down in Kubernetes, but we need to describe them, add context and meaning to them. That way, when we have proper documentation of what the metrics mean for the Korifi components, we can talk about monitoring and operational procedures.

danail-branekov · 2024-12-17T13:43:21Z

The Korifi API needs a metrics endpoint that gives specific metrics like the CF API

Could you point us to the metrics you refer to? Reading Accessing metrics
from the cloud foundry documentation, I understand that the cli talks to the log cache. Log cache is a completly different API which Korifi is not intended to implement. As a matter of fact Korifi does implement a couple of the logcache endpoints in a very naive way in order making pushing apps work without a dependency to a logcache implementation. However, this is just a very naive and temporary solution.

Maybe the correct solution here is to implement the logcache api (as the cf cli currently assumes that it is there) for k8s in a separate component and just make korifi's /v3/info and/or /v3 endpoint advertise it

danail-branekov · 2024-12-17T13:53:31Z

when we have proper documentation of what the metrics mean for the Korifi components, we can talk about monitoring and operational procedures

Honestly, as of today we do not have an idea how to really implement observability properly and we (Korifi maintainers) do not have the capacity to explore it right now. However, any thoughts and proposals are welcome.

vipinvkmenon · 2024-12-17T17:03:19Z

What I meant was from the perspective of components and operational metrics like the one for cloud-controller and Routing: https://docs.cloudfoundry.org/running/all_metrics.html#cc

Yea most of the component metrics of CF are no longer relevant here as they are going to be replaced by a bunch of controllers and CRDs but many of the metrics from these components were used for the operational aspects of the Landscapes.

For e.g The diego metric about the total amount for example would have helped to understand if the current number of diego cells is enough...I guess a similar analogy here of course would be the worker nodes in the data plan. But that analogy needs to be built up and mapped. So that's what I meant from an operational aspect

This will be an evolving topic, I understand that. its probably not the focus now. Added the ticket for future references.

chombium · 2024-12-18T11:08:48Z

@vipinvkmenon I guess we'll have to combine the things we get with the k8s monitoring tools with the things that we get from the workloads themselves(either Korifi CRDs or CF apps) and add them some meaning in context of Korifi.

I've done a comparison of cf logs output in CF-for-VMs and Korifi and we have to follow the same route there as well. I've documented my findings in the Log Cache API feature issue.

korifi-bot added this to Korifi - Backlog Dec 16, 2024

github-project-automation bot moved this to 🧊 Icebox in Korifi - Backlog Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: <Operator/Administraor> can understand the health of Korifi Components as well as have custom metrics for CF Specific CRDs #3665

[Feature]: <Operator/Administraor> can understand the health of Korifi Components as well as have custom metrics for CF Specific CRDs #3665

vipinvkmenon commented Dec 16, 2024

danail-branekov commented Dec 17, 2024

vipinvkmenon commented Dec 17, 2024

chombium commented Dec 17, 2024

danail-branekov commented Dec 17, 2024

danail-branekov commented Dec 17, 2024

vipinvkmenon commented Dec 17, 2024

chombium commented Dec 18, 2024

[Feature]: <Operator/Administraor> can understand the health of Korifi Components as well as have custom metrics for CF Specific CRDs #3665

[Feature]: <Operator/Administraor> can understand the health of Korifi Components as well as have custom metrics for CF Specific CRDs #3665

Comments

vipinvkmenon commented Dec 16, 2024

Blockers/Dependencies

Background

Acceptance Criteria

Dev Notes

danail-branekov commented Dec 17, 2024

vipinvkmenon commented Dec 17, 2024

chombium commented Dec 17, 2024

danail-branekov commented Dec 17, 2024

danail-branekov commented Dec 17, 2024

vipinvkmenon commented Dec 17, 2024

chombium commented Dec 18, 2024