-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Postgres read/write primary node seeming timeouts led to runtime traffic failures on existing runtime proxies #13597
Comments
Hi @bungle , could you take a look at this issue? thanks. |
@bungle, did you have a chance to look at this issue? |
@bungle @nowNick @Water-Melon Had this happen again, not sure how much progress you were able to make in refactoring how Kong uses the DB to improve resiliency to DB outages but we found a odd case that happened on 11/30 and I wanted to just describe what we observed if maybe it points to anything in Kong's logic as well. Maybe your existing PR improvements could have fixed this too but I wanted to bring it up incase its not a situation you think has been improved, still on Kong 3.7.1 here: Env setup for Postgres topology: prod_dmz: Kong runtimes in dc1 and dc2, where both point to the primary as primary runtime for admin API R/Ws and we also point each dc1/dc2 kong instances to a _RO secondary DB for the read only queries to be distributed to localized dc if needed for extra lookups, not totally sure the nuance to how kong determines if it wants to read from primary or read from secondary _RO configured postgres nodes fully. I just know its meant to be used as a field to let some pressure off a single primary PG node. This happened on 11/30 in a production environment running Kong: Was studying two proxy paths taking active traffic 200s success, then we noticed the server2.company.com had some OS patching done that caused a server reboot and lasted 11 seconds in that dc, and in that time active proxy traffic to those proxies started throwing the typical 500s and Kong reporting -1 reverse proxy latency indicating failures at the gateway running in dc1. Same occurred few hrs later when dc2 server3.company.com got patched and rebooted we noticed Kong's runtime in dc2 also threw some 500s. We now capture pod logs of Kongs runtime to splunk too so we historically can see how Kong was spitting out during that time. The logs looked like so relative over the time of the secondary reboot schedule aligning with the Kong stdout critical error logs:
Now this makes sense for postgres errors to be happening like this when a client tries to use a db server going through a boot loop to reboot to process txs right? Because server goes down, well then tcp connection refused happens to the port on that host not being reachable Then we get errors around db starting back up as the host becomes back online and the process needs to kick off. Then we have some brief errors where the pg process has kicked off but its not ready just yet to handle the inbound connections. Then the errors all stop and things for Kong go back to normal. My question is why does active traffic get impacted when a RO node also goes down? Why does the cache not handle the existing route/services/plugin configs saved to cache? Why does Kong's inability to fetch upstreams cause traffic failure, is this referring to usptreams the Kong admin api resource or upsteams as in kong's internal knowledge of mapped service resource fqdns to an ip? We don't even use the "usptream" resource in Kong in this deployment other than a heath check field where we try to PATHCH write to 1 dummy upstream resource in kong to make sure admin-api is functional, I mentioned that in the original post tho. One thing that differs about this one on 11/30 vs my original post here is we were not even dealing with a primary PG node outage in this case, it was only the _RO respective PG DB hosts in each dc getting patched that led to these blips of Kong created proxy failure impacts in each dc around their separate times in each dc. |
@jeremyjpj0916 Are you using Kong Enterprise Edition or OSS (Open Source Software)? |
Is there an existing issue for this?
Kong version (
$ kong version
)3.7.1
Current Behavior
So on 8/8/2024 it seems something in our network or postgres cluster itself caused Kong in the network zone it was in to have to fail some calls with HTTP 500 level errors and failed to reverse proxy on service path(/api/cel/myservice/member/v1), pod logs showed this:
Oddly all tx logs for the HTTP 500 logs and no proxy routing always showed Kongs time internally taking about 5000ms but the PG timeout field was set to 8s so I wonder why the 5s timeout plays into? Some default Kong PG connection timeout under the hood not configurable? DB Update frequency somehow playing a role set to 5s?:
Seems all worker processes in the kong node during the time of impact were involved too. Note we persist the cache ttl to never expire so services/routes etc. all should have been cached. The proxy throwing the 500 was essentially running a healthy 200 tx flow right up until impact.
Proxy was a service+route with acl plugin and a varient of Kong oauth2 cc flow plugin for auth too and they were using a existing valid token from prior calls.
We run a sh script in bg from the Kong node that keeps track of Kongs ability to write to the db since Kong doesn't expose any logic to track that otherwise too(hence this uuid you see in logs b6e87b44-f251-4ede-b277-030c58f48c1b, thats the uuid of the healthcheck_db_put singular upstream in this env):
So while we don't use upstream resources in these kong nodes, and services all get configured to a existing hostname we do just put 1 dummy record out to the upstreams so it keeps a configured upstreams resource that shows the write to primary node is functional.
Expected Behavior
My expected behavior is that even when the primary postgres write node goes down and an existing oauth2 cc authorized consumer keeps sending their token across to call a proxy that the cached resources will keep working and allowing traffic even if the primary r/w node is in a brief moment of network connectivity issues.
It seems somehow that sometimes a given node of kong can get in a state where that scenario does not happen though and it has either lost its cached records or starts doing something unexpected to fail the proxy call at the 5s mark with a response size of 616 bytes guessing its the usual kong default thrown back at the client saying "unexpected error occurred" in json format like it usually will.
What makes this more interesting is that during this time only 1 of the 6 Kong nodes that talks to this 3 node postgres cluster(1 primary 2 secondary nodes) w all the same configurations started showing this 500 status error behavior. Hence why I was wondering if maybe in traditional mode some resiliency issues are present with Kong these days running against a postgres cluster where the cache isn't stable properly... or better yet at least if it does need to somehow read from a pg node it would be willing to try with other secondary read only nodes to get the data it wants before failing the proxy tx.
Image above shows the node with the 3k tx was the one throwing the 500s same interval as the other nodes (they were not handling most the traffic though as you can see was almost all routing through only 1 node at the time which was producing the 500s).
Further more I am willing to speculate the pg cluster itself was healthy during this window but something in between two network zones caused connectivity issues preventing Kong from reaching that PG primary node in a 8-9 minute span where the errors were being seen.
Could the PUT calls to the upstreams resource as a type of write health check sometimes be causing other Kong resources to cache evict or rebuild on some kinda odd race condition that doesn't always happen? I really don't know how you could even reproduce or debug this situation but best to just make sure Kong is as resilient as possible to db connectivity issues in traditional mode as well. Also still curious what yalls thoughts are with the 5 second mark being the sweet spot on the failed txs when my PG timeout setting is 8 seconds(maybe thats just for read/writes and 5s is some internal connection default?)
Steps To Reproduce
Anything else?
No response
The text was updated successfully, but these errors were encountered: