Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: <Operator/Administraor> can understand the health of Korifi Components as well as have custom metrics for CF Specific CRDs #3665

Open
vipinvkmenon opened this issue Dec 16, 2024 · 7 comments

Comments

@vipinvkmenon
Copy link

Blockers/Dependencies

Currently, there are no metrics exposed from the Korifi API. Some metrics are from the controllers but not from the Korifi API Pod. II did not see a /metrics for the Korifi API...I could be wrong and missing it as well...If so that port and endpoint please :)? )

While I agree that the CF-Korifi Architecture is completely different from CF on Bosh, there would custom metrics (just like CF in the Bosh deployment) that would overlap, for eg.... Total LRPs (equivalent to total Pods), jobs, etc.
It would be ideal for these to be converted to CF-specific metrics, rather than getting it directly from the Kube Controller. Some metrics are

Background

As a CF-Operator
I want custom metric that are specific to CF rather than manually mapping or using all the generic metrics of Kubernetes
So that I can understand the overall health of my CF as a Platform.

Acceptance Criteria

GIVEN Korifi Deployment
WHEN I query /metrics of the Korifi API Pod
THEN I see the custom metrics that are specific to CF.

Dev Notes

No response

@danail-branekov
Copy link
Member

Hi @vipinvkmenon

We believe that in the k8s world there are solutions (such as open telemetry) that would be much more superior and flexible to whatever we come up in Korifi. That is why we have always considered observability and telemetry out of scope for Korifi.

Of course, Korifi should implement metrics endpoints as defined by the CF API (such as getting process stats) but anything outside of the specification should be probably achieved via k8s native and superior tools.

We are open for a discussion, of course. If you are willing to spend some time yourself, you could come up with a proposal and why not PRs. You could also consider building a separate component that provides the metrics you see useful, and if you decide to opensource it, the community could benefit from your work.

What do you think?

cc @georgethebeatle @zabanov-lab

@vipinvkmenon
Copy link
Author

We believe that in the k8s world, there are solutions (such as open telemetry) that would be much more superior and flexible to whatever we come up in Korifi. That is why we have always considered observability and telemetry out of scope for Korifi.

I completely agree with this and there is no confusion or question on that aspect.

Of course, Korifi should implement metrics endpoints as defined by the CF API (such as getting process stats) but anything outside of the specification should be probably achieved via k8s native and superior tools.

Exactly. The Korifi API needs a metrics endpoint that gives specific metrics like the CF API, primarily I believe many of the metrics that are emitted by the cloud controller for example would make sense in the korifi-api as well.

Routing metrics is another such example but this could be mapped against the metrics coming off from contour and envoy most likely, but there would be custom metrics like for exampleroute_registration_latency which probably would need to be generated.

Many of these metrics are present around in the metrics server and probably in envoy in its terms and conventions. So probably another aspect will also be to map them against the equivalent metrics that operators are used to using and seeing in the traditional CF Deployment.

@chombium
Copy link

Many of these metrics are present around in the metrics server and probably in envoy in its terms and conventions. So probably another aspect will also be to map them against the equivalent metrics that operators are used to using and seeing in the traditional CF Deployment.

I don't think that we need everything that the cf-on-vms users are used to use, but we need a monitoring and operations guide for Korifi. I guess we have most of the things that we need buried somewhere deep down in Kubernetes, but we need to describe them, add context and meaning to them. That way, when we have proper documentation of what the metrics mean for the Korifi components, we can talk about monitoring and operational procedures.

@danail-branekov
Copy link
Member

The Korifi API needs a metrics endpoint that gives specific metrics like the CF API

Could you point us to the metrics you refer to? Reading Accessing metrics
from the cloud foundry documentation, I understand that the cli talks to the log cache. Log cache is a completly different API which Korifi is not intended to implement. As a matter of fact Korifi does implement a couple of the logcache endpoints in a very naive way in order making pushing apps work without a dependency to a logcache implementation. However, this is just a very naive and temporary solution.

Maybe the correct solution here is to implement the logcache api (as the cf cli currently assumes that it is there) for k8s in a separate component and just make korifi's /v3/info and/or /v3 endpoint advertise it

@danail-branekov
Copy link
Member

when we have proper documentation of what the metrics mean for the Korifi components, we can talk about monitoring and operational procedures

Honestly, as of today we do not have an idea how to really implement observability properly and we (Korifi maintainers) do not have the capacity to explore it right now. However, any thoughts and proposals are welcome.

@vipinvkmenon
Copy link
Author

What I meant was from the perspective of components and operational metrics like the one for cloud-controller and Routing: https://docs.cloudfoundry.org/running/all_metrics.html#cc

Yea most of the component metrics of CF are no longer relevant here as they are going to be replaced by a bunch of controllers and CRDs but many of the metrics from these components were used for the operational aspects of the Landscapes.

For e.g The diego metric about the total amount for example would have helped to understand if the current number of diego cells is enough...I guess a similar analogy here of course would be the worker nodes in the data plan. But that analogy needs to be built up and mapped. So that's what I meant from an operational aspect

This will be an evolving topic, I understand that. its probably not the focus now. Added the ticket for future references.

@chombium
Copy link

@vipinvkmenon I guess we'll have to combine the things we get with the k8s monitoring tools with the things that we get from the workloads themselves(either Korifi CRDs or CF apps) and add them some meaning in context of Korifi.

I've done a comparison of cf logs output in CF-for-VMs and Korifi and we have to follow the same route there as well. I've documented my findings in the Log Cache API feature issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🧊 Icebox
Development

No branches or pull requests

3 participants