-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chunkservers get disconnected #230
Comments
I’d suggest to try to reproduce the problem on a single lode setup with two chunk servers and one meta server. Top level makefile has test target that creates and runs such setup. The test fails in case if any chunk server disconnects during the test. This test runs successfully on travis. I’ve run a few tests with few different settings, and was not able to reproduce the problem that you’re describing. Another possible experiment that comes to mind might be to run iperf between the client and chunk server nodes to see if iperf would cause chunk server disconnects. 2.0 code presently deployed on the Quantcast production cluster. I have not heard any reports of chunk servers spurious disconnects, or disconnects that appear to be correlated with IO activity. 2.0 code might be more sensitive to the temporary network outage / latencies with default configuration. This is needed to minimize meta server primary / backup failover time. The meta server CPU cost of chunk server re-connect with 2.0 code is a small fraction of prior QFS versions, as typically only partial chunk server inventory synchronization with meta server is required on re-connect. — Mike. |
Thanks for your suggestions.
Also tried iperf. We've got 10G fiber links between our nodes, so ran Is there a way to set the metaserver/chunkserver configuration to be less sensitive to network latency? There are no dropped packets, so doesn't seem like an outage, maybe some minor latency. |
I'm experimenting with timeout settings in MetaServer using this reference: https://github.com/quantcast/qfs/blob/master/conf/MetaServer.prp
but still getting the Connection reset by peer errors which look like this on the ChunkServers:
|
I’d suggest to add the following parameter to the meta server configuration file: metaServer.chunkServer.minInactivityInterval = 16 and revert timeout parameters to defaults by changing the parameters values in the meta server configuration file to the following: metaServer.chunkServer.chunkAllocTimeout = 40 and sending HUP signal to the meta server process, or removing these parameters and restarting meta server. |
Thanks, these settings kind of work. Now there are no chunkserver disconnect errors, nodes don't end up in the dead nodes history. But there are some write errors that complain about invalid write id:
Does this mean some ChunkServers are still temporarily lost? I've done several different tests and so far. One of them was setting up a qfs 2.0 test cluster on servers that we have been using with qfs 1.2 with no problem (no chunkserver disconnect, no write error). The write errors occurred on those servers as well. For the record, this is our full qfs 2.0 MetaServer config.prp:
and this is how our ChunkServer config.prp files look like:
Can this also be related to qfs 2.0 being more sensitive to latency? Can we adjust our settings to avoid the write errors? |
Qfscat message log time stamps suggest that 1MB transfer took more than 30 minutes, i.e., the average transfer rate was less than 10485768/(3060) = 4.66 Kbps. Given than chunk server and client network inactivity timeouts are less than few minutes this suggest that the data was slowly “trickling” from that client to the chunk server. The QFS chunk [write] lease timeout is 5 minutes. The chunk server renews chunk lease after it successfully receives chunk write RPC from QFS client. [At 10 Mbps 1MB RPC network transfer would take slightly over 1 second.] In this case chunk write RPC network transfer took more than 30 minutes, the chunk write lease timed out, and the client had to “re-allocate” the chunk and re-issue the RPC. While it is possible to change chunk lease timeout and rebuild the code, I think that more changes would be required in order to make the code work with very low bandwidth network transfers. The present code assumption is that the available bandwidth between client and chunk servers is at lest on order of 10 Mbps. Assuming typical modern data center deployment with at least 1Gbps network, the results of qfscat experiment might suggest a network HW problem. At the moment I can only think of the following QFS client change that could be relevant. In theory (and in my testing) the change would increase network utilization and write throughput by pipelining write RPCs: The change isn’t 2.0 specific, but is only in 2.x releases. This change was deployed in production at Quantcast more than a year ago. So far I have not heard any reports that would suggest that the change is causing problems. If the underlying problem in this case a HW problem, then change in transfer bandwidth and / or timing could have effect on the frequency of the problem manifestation — abnormally low transfer rate in this case. |
We have deployed qfs 2.0 on a cluster of 200 dedicated servers with Debian 9 x86-64. Each have 10 hard drives, we're running one chunkserver per hard drive. We only use one metaserver on a separate server.
These servers have nothing else running on them. They don't suffer from high load or any resource bottlenecks.
Now there's a problem: every time we start writing to the cluster chunkservers get temporarily disconnected and end up in the dead nodes history. It doesn't matter how much data we write, a few megabytes are enough to trigger the problem.
This is how it looks like from the metaserver:
08-07-2018 01:08:34.061 DEBUG - (NetConnection.cc:108) netconn: 1262 read: Connection reset by peer 104 08-07-2018 01:08:34.061 ERROR - (ChunkServer.cc:1304) 10.10.1.5 21014 / 10.10.1.5:42806 chunk server down reason: communication error socket: good: 0 status: Connection reset by peer 104 -104 08-07-2018 01:08:34.061 INFO - (LayoutManager.cc:6019) server down: 10.10.1.5 21014 block count: 36595 master: 0 replay: 0 reason: communication error; Connection reset by peer 104 chunk-server-bye: 10.10.1.5 21014 logseq: 0 0 893823745 chunks: 36595 checksum: 0 2030060475277 36752 log: in flight: 0 repl delay: 7200 completion: no
and this is how it looks like on a chunkserver:
08-07-2018 01:08:34.036 ERROR - (MetaServerSM.cc:592) 10.10.1.3 20100 meta server inactivity timeout, last request received: 17 secs ago timeout: inactivity: 40 receive: 16 08-07-2018 01:08:34.036 ERROR - (MetaServerSM.cc:1043) 10.10.1.3 20100 closing meta server connection due to receive timeout
The issue is easy to reproduce, a simple
qfscat | qfsput
triggers it. When we stop all write activity, there's no disconnect error as long as there's no write to the cluster. I.e. if we don't write for days, there's no chunkserver error for days.One thing to note: when the disconnect error occurs, only one single chunkserver from a given node is disconnected, the remaining 9 chunkservers running on that node are fine. There's no pattern to which chunkservers get disconnected, meaning, it's always random. One node node it's chunkserver01, on another node it's chunckserver09 and so on.
When there's write activity, all nodes are affected with the disconnect error. The issue is almost evenly spread on nodes, doesn't depend on rack location. No nodes have strangely low or high amount of triggered errors compared to others.
None of the nodes have any lost TX/RX packets. There are no hardware issues and networking works properly.
This is a tcpdump packet capture about the issue:
891162 2018-08-14 13:57:21.006309 10.10.1.3 → 10.10.1.100 TCP 122 20100 → 46630 [PSH, ACK] Seq=53247 Ack=2318595 Win=32038 Len=52 TSval=873627156 TSecr=3858378234 891163 2018-08-14 13:57:21.006469 10.10.1.100 → 10.10.1.3 TCP 70 46630 → 20100 [RST, ACK] Seq=2318595 Ack=53299 Win=18888 Len=0 TSval=3858387169 TSecr=873627156 891164 2018-08-14 13:57:21.006540 10.10.1.100 → 10.10.1.3 TCP 78 45238 → 20100 [SYN, ECN, CWR] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3858387169 TSecr=0 WS=2 891165 2018-08-14 13:57:21.006608 10.10.1.3 → 10.10.1.100 TCP 78 20100 → 45238 [SYN, ACK, ECN] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=873627157 TSecr=3858387169 WS=2
We have been using qfs 1.2 on a similarly sized cluster with no problem, no such chunkserver disconnect errors occurred.
Is this a regression in qfs 2.0? What can we do to help this troubleshooting and get fixed?
The text was updated successfully, but these errors were encountered: