-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tlscommon and httpcommon diagnostics hooks #3587
Conversation
This pull request is now in conflicts. Could you fix it @michel-laterman? 🙏
|
72f99da
to
4d3110a
Compare
@lucabelluccini, this is an example of the hooks in use; do you think we should add one for apm instrumentation so fleet-server would produce another request-trace? |
changelog/fragments/1718034684-Add-tlscommon-and-httpcommon-hooks.yaml
Outdated
Show resolved
Hide resolved
changelog/fragments/1718034684-Add-tlscommon-and-httpcommon-hooks.yaml
Outdated
Show resolved
Hide resolved
Sorry for the silly question, but what do you mean "add one for APM Instrumentation"? Having a trace header might be helpful on some environments to cross reference / find more easily the request in case of troubleshooting. |
In this case the trace request would be using the httpcommon diagnostics hook to make a request to the APM hosts to test connectivity. |
Co-authored-by: Blake Rouse <[email protected]>
This pull request is now in conflicts. Could you fix it @michel-laterman? 🙏
|
@@ -557,6 +557,7 @@ components: | |||
type: string | |||
enum: | |||
- CPU | |||
- CONN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the discussion in elastic/elastic-agent#4880, we want to have the HTTP connection request diagnostics as an optional value that is enabled by default.
This pull request is now in conflicts. Could you fix it @michel-laterman? 🙏
|
buildkite test this |
Quality Gate failedFailed conditions |
@michel-laterman let me know once this is ready to be merged and i'll bypass the sonarcloud check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Thanks for the fixes on the context timeout inside of the diagnostic hook.
What is the problem this PR solves?
TLS information provided by fleet-server in diagnostics bundles is lacking.
How does this PR solve the problem?
Add custom hooks to use diag hooks added in elastic/elastic-agent-libs#207 to provide additional files that contain information about the TLS certs used by the server's API, TLS infomation used when connecting to elasticsearch, and a full trace to each specified elasticsearch host.
For example, when enrolling into a cloud deployment with no explicit cert/key for fleet-server, the new files the hooks add are :
fleet-server-output-request.txt
fleet-server-api-tls.txt
fleet-server-output-tls.txt
Testing
DEV=true SNAPSHOT=true make release-linux/amd64
sudo elastic-agent diagnostics
Bundle should contain
fleet-server-api-tls.txt
,fleet-server-output-tls.txt
, andfleet-server-output-request.txt
filesUse the
--skip-conn
flag to skip collecting output request diagnostics (thefleet-server-output-request.txt
file)Design Checklist
I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration files./changelog/fragments
using the changelog tool