Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tlscommon and httpcommon diagnostics hooks #3587

Merged
merged 13 commits into from
Jun 24, 2024

Conversation

michel-laterman
Copy link
Contributor

@michel-laterman michel-laterman commented May 29, 2024

What is the problem this PR solves?

TLS information provided by fleet-server in diagnostics bundles is lacking.

How does this PR solve the problem?

Add custom hooks to use diag hooks added in elastic/elastic-agent-libs#207 to provide additional files that contain information about the TLS certs used by the server's API, TLS infomation used when connecting to elasticsearch, and a full trace to each specified elasticsearch host.

For example, when enrolling into a cloud deployment with no explicit cert/key for fleet-server, the new files the hooks add are :

fleet-server-output-request.txt

2024/05/29 00:18:41.571489 No TLS settings
2024/05/29 00:18:41.571492 Proxy disable=false url=<nil>
2024/05/29 00:18:41.571541 Request 0 to https://1df578882a2d4d0190aabbbe4ac77268.us-central1.gcp.qa.cld.elstc.co:443 starting
2024/05/29 00:18:41.571576 GetConn called for "1df578882a2d4d0190aabbbe4ac77268.us-central1.gcp.qa.cld.elstc.co:443"
2024/05/29 00:18:41.710378 Connection started to tcp:34.71.9.202:443
2024/05/29 00:18:41.834875 Connection to tcp:34.71.9.202:443 done, err: <nil>
2024/05/29 00:18:41.984139 TLS handshake starting
2024/05/29 00:18:41.984148 TLS handshake done. state={Version:772 HandshakeComplete:true DidResume:false CipherSuite:4865 NegotiatedProtocol: NegotiatedProtocolIsMutual:true ServerName:1df578882a2d4d0190aabbbe4ac77268.us-central1.gcp.qa.cld.elstc.co PeerCertificates:[0xc00021b600 0xc00021bb80 0xc00021c100] VerifiedChains:[] SignedCertificateTimestamps:[] OCSPResponse:[] TLSUnique:[] ekm:0x5557d33c1f20} err=<nil>
2024/05/29 00:18:41.984267 GotConn for "34.71.9.202:443"
2024/05/29 00:18:41.984371 Wrote request headers
2024/05/29 00:18:41.984385 Wrote request err=<nil>
2024/05/29 00:18:42.111030 Response started
2024/05/29 00:18:42.111184 request 0 successful.

fleet-server-api-tls.txt

tlscommon.ServerConfig: Start diagnostics 2024-05-29 00:18:41.569890423 +0000 UTC
tlscommon.ServerConfig: verification_mode=full
tlscommon.ServerConfig: client_auth=none
tlscommon.ServerConfig: ca_sha256=[]
tlscommon.ServerConfig: CertificateSettings: checking certificate keypair
tlscommon.ServerConfig: CertificateSettings: certificate keypair OK.
tlscommon.ServerConfig: CertificateSettings: cert 0 - Subject: CN=fleet-server-dev,O=elastic-fleet
	Issuer: CN=localhost,O=elastic-fleet
	NotBefore: 2024-05-29 00:18:15 +0000 UTC
	NotAfter: 2034-05-29 00:18:15 +0000 UTC
	Fingerprint: EvAr8erNJZJ0BoXor1jnCl5X2WCU79ISMgwMITOEJBs=
	SAN IP: []
	SAN DNS: [fleet-server-dev]
tlscommon.ServerConfig: CertificateAuthorities: certificate_authorities provided.
tlscommon.ServerConfig: CertificateAuthorities: - cert 0 Subject: CN=localhost,O=elastic-fleet
	IsCa: true
	BasicConstraintsValid: true
	NotBefore: 2024-05-29 00:18:15 +0000 UTC
	NotAfter: 2034-05-29 00:18:15 +0000 UTC
	Fingerprint: cSawoNEa2cSPV4Mwce4mRysrcBGtHBGcSeoBL+Cr48Q=

fleet-server-output-tls.txt

error: nil tlscommon.Config

Testing

  1. Use elastic-agent built with the changes in: Add conn param and conn-skip flag to diagnostics command elastic-agent#4946
  2. Compile the corresponding fleet-server binary from this pr with make, i.e. DEV=true SNAPSHOT=true make release-linux/amd64
  3. install/enroll the elastic-agent with a policy that has the fleet-server integration
  4. collect a diagnostics bundle from the agent sudo elastic-agent diagnostics
    Bundle should contain fleet-server-api-tls.txt, fleet-server-output-tls.txt, and fleet-server-output-request.txt files
    Use the --skip-conn flag to skip collecting output request diagnostics (the fleet-server-output-request.txt file)

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Copy link
Contributor

mergify bot commented Jun 10, 2024

This pull request is now in conflicts. Could you fix it @michel-laterman? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b add-tls-diags upstream/add-tls-diags
git merge upstream/main
git push upstream add-tls-diags

@michel-laterman michel-laterman changed the title WIP adding TLS diagnostics from elastic-agent-libs Add tlscommon and httpcommon diagnostics hooks Jun 10, 2024
@michel-laterman michel-laterman marked this pull request as ready for review June 10, 2024 16:01
@michel-laterman michel-laterman requested a review from a team as a code owner June 10, 2024 16:01
@michel-laterman
Copy link
Contributor Author

@lucabelluccini, this is an example of the hooks in use; do you think we should add one for apm instrumentation so fleet-server would produce another request-trace?

internal/pkg/config/output.go Show resolved Hide resolved
internal/pkg/config/output.go Show resolved Hide resolved
internal/pkg/server/agent.go Show resolved Hide resolved
internal/pkg/server/agent.go Show resolved Hide resolved
internal/pkg/server/agent.go Show resolved Hide resolved
internal/pkg/server/agent.go Show resolved Hide resolved
internal/pkg/server/agent.go Show resolved Hide resolved
internal/pkg/server/agent.go Show resolved Hide resolved
@lucabelluccini
Copy link
Contributor

@lucabelluccini, this is an example of the hooks in use; do you think we should add one for apm instrumentation so fleet-server would produce another request-trace?

Sorry for the silly question, but what do you mean "add one for APM Instrumentation"?
Do you mean instrumenting the test request with APM so the request is sent with a specific trace header which would not be present otherwise?

Having a trace header might be helpful on some environments to cross reference / find more easily the request in case of troubleshooting.

@michel-laterman
Copy link
Contributor Author

@lucabelluccini, this is an example of the hooks in use; do you think we should add one for apm instrumentation so fleet-server would produce another request-trace?

Sorry for the silly question, but what do you mean "add one for APM Instrumentation"? Do you mean instrumenting the test request with APM so the request is sent with a specific trace header which would not be present otherwise?

Having a trace header might be helpful on some environments to cross reference / find more easily the request in case of troubleshooting.

In this case the trace request would be using the httpcommon diagnostics hook to make a request to the APM hosts to test connectivity.
There is no explicit http header that is a part of that hook

Copy link
Contributor

mergify bot commented Jun 10, 2024

This pull request is now in conflicts. Could you fix it @michel-laterman? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b add-tls-diags upstream/add-tls-diags
git merge upstream/main
git push upstream add-tls-diags

@ycombinator ycombinator requested review from pchila and removed request for blakerouse, andrzej-stencel and pchila June 12, 2024 18:14
internal/pkg/config/output.go Outdated Show resolved Hide resolved
@@ -557,6 +557,7 @@ components:
type: string
enum:
- CPU
- CONN
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the discussion in elastic/elastic-agent#4880, we want to have the HTTP connection request diagnostics as an optional value that is enabled by default.

internal/pkg/config/output.go Outdated Show resolved Hide resolved
internal/pkg/server/agent.go Show resolved Hide resolved
Copy link
Contributor

mergify bot commented Jun 20, 2024

This pull request is now in conflicts. Could you fix it @michel-laterman? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b add-tls-diags upstream/add-tls-diags
git merge upstream/main
git push upstream add-tls-diags

@michel-laterman
Copy link
Contributor Author

buildkite test this

Copy link

Quality Gate failed Quality Gate failed

Failed conditions
39.7% Coverage on New Code (required ≥ 40%)

See analysis details on SonarQube

@jlind23
Copy link
Contributor

jlind23 commented Jun 21, 2024

@michel-laterman let me know once this is ready to be merged and i'll bypass the sonarcloud check.

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Thanks for the fixes on the context timeout inside of the diagnostic hook.

@michel-laterman michel-laterman merged commit 3c5b375 into elastic:main Jun 24, 2024
7 of 8 checks passed
@michel-laterman michel-laterman deleted the add-tls-diags branch June 24, 2024 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants