-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re-balancing chunks bug #237
Comments
I'd guess the implicit assumption here is that file replication is greater than 1, possibly 2. If replication is 1 then, if I read the description correctly, the system works as expected. If replication is greater than 1 then it isn't. Though in both cases re-replication would not occur, only replication check might need to delete extra replicas in the case when actual replication is greater than the target. |
Our problem is that we loose redundancy in the case described above for files with replication >1 |
A similar issue happens for evacuation. We evacuated our entire cluster to another one that temporarily joined in order to upgrade disks and on this move we lost several hundreds files, just from the move. |
I will add some more details to the issue above: |
In general it might not be trivial to interpret trace messages and make correct conclusion. If I remember right, there are multiple reasons for chunk deletion to be issued and the corresponding trace messages to appear in the meta server debug trace log. For example, chunk replica deletion could be issued due replication or recovery failure. In such case chunk replica deletion can be issued in order to ensure that a possibly partial replica gets deleted. I’d recommend to start from inspecting chunk server log, looking for “media / disk” IO failures. Inspecting chunk and meta server counters available in meta server web UI might also help to diagnose the problem. I do not recall encountering a problem similar to the one described here in the last few years on any of the actively used file systems with a few petabytes per day read and write load (append with replication 2 and RS 6+3), and in end to end testing with and without failure injection. It is possible that the problem is due to the specifics of the system configuration or use pattern. For example presently file system has no continuous background verification mechanism that verifies all chunk replicas. Existing “chunk directory” / “disk” health check only includes periodically creating, writing, reading back, and deleting a small file. It is conceivable that lack of such mechanism might manifest itself as noticeable replication / recovery failures due to latent undetected media read failures. Another possibility that comes to mind is that replication / recovery parameters do not match HW configuration capabilities / IO bandwidth, resulting in IO failures / timeouts. Though in such a case after replication / recovery activity stops recent enough code should be able to re-discover replicas that became unavailable due to IO load. If the problem is due to the meta server replica / layout management bug, then perhaps the ideal way would be to try to reproduce the problem in the simple / small (3 chunk server nodes for example) / controlled test setup. Exiting test scrips can be used to create such a setup, for example src/test-scripts/run_endurance_mc.sh. |
I created separated repository based on It contains docker file that runs build and part of tests. It runs
I tried to run
It seems that chunk was evacuated, but after evacuation finished and chunk was removed from evacuated server, The issue doesn't happen every time, on my laptop with macOS it happed about 50% of times. |
… On Aug 2, 2020, at 9:05 AM, hagrid-the-developer ***@***.***> wrote:
I created separated repository based on 2.2.1 (https://github.com/hagrid-the-developer/qfs/ <https://github.com/hagrid-the-developer/qfs/>) that tries to simulate the issue and also to fix it (but probably only partly).
It contains docker file that runs build and part of tests. It runs recoverytest.sh for 3 chunks servers, not 2. Finally it runs test-lost-chunks.sh, that:
Creates 25 files.
Runs evacuation of 127.0.0.1:20400
Waits some time (100s)
Prints files in chunk directories.
I tried to run docker build . -t qfstest:v2.2.1 with disabled commit ***@***.*** <avodaniel/qfs@3528d7e> that tries to fix the issue.
I was investigating chunk 131217 from file abc-010.xyz. Related logs are in subdirectory failure (https://github.com/hagrid-the-developer/qfs/tree/fix-rebalancing-issue/failure <https://github.com/hagrid-the-developer/qfs/tree/fix-rebalancing-issue/failure>), from failure/metaserver-drf.err:
07-29-2020 17:51:39.752 INFO - (LayoutManager.cc:11084) starting re-replication: chunk: 131217 from: 127.0.0.1 20400 to: 127.0.0.1 20401 reason: evacuation
...
07-29-2020 17:51:39.774 INFO - (LayoutManager.cc:11920) replication done: chunk: 131217 version: 1 status: 0 server: 127.0.0.1 20401 OK replications in flight: 9
...
07-29-2020 17:51:39.774 DEBUG - (LayoutManager.cc:11503) re-replicate: chunk: <32,131217> version: 1 offset: 0 eof: 10485760 replicas: 2 retiring: 0 target: 1 rlease: 0 hibernated: 0 needed: -1
07-29-2020 17:51:39.774 INFO - (LayoutManager.cc:12470) <32,131217> excludes: srv: other: 3 all: 3 rack: other: 2 all: 2 keeping: 127.0.0.1 20400 20400 0.983188 discarding: 127.0.0.1 20401 20401 0.98691
...
07-29-2020 17:51:39.774 DEBUG - (LayoutManager.cc:3713) -srv: 127.0.0.1 20400 chunk: 131217 version: 1 removed: 1
...
07-29-2020 17:51:39.775 DEBUG - (LayoutManager.cc:3713) -srv: 127.0.0.1 20401 chunk: 131217 version: 1 removed: 1
It seems that chunk was evacuated, but after evacuation finished and chunk was removed from evacuated server, LayoutManager::CanReplicateChunkNow() was called and it decided that chunk had too many replicas and that replica in the new destination should be removed, because evacuated server had a lot of free space. Unfortunately, there is a short time window, when it is difficult to distinguished evacuated server from normal server. That's why I tried to introduce other set of chunks that should be or are evacuated but haven't been removed yet in commit ***@***.*** <avodaniel/qfs@3528d7e>.
The issue doesn't happen every time, on my laptop with macOS it happed about 50% of times.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#237 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAREIWMQVV75MP4HLFQAB33R6WFDPANCNFSM4I3CINGA>.
|
@mikeov I tried to apply the change 1b5a3f5 to my branch avodaniel/qfs@b3963db and it seems that the issue is still present. It also seems that
|
… On Aug 6, 2020, at 10:10 AM, hagrid-the-developer ***@***.***> wrote:
@mikeov <https://github.com/mikeov> I tried to apply the change 1b5a3f5 <1b5a3f5> to my branch ***@***.*** <avodaniel/qfs@b3963db> and it seems that the issue is still present. It also seems that wait is always 0. Related logs from one test run are in https://github.com/hagrid-the-developer/qfs/tree/fix-rebalancing-issue/failure.002 <https://github.com/hagrid-the-developer/qfs/tree/fix-rebalancing-issue/failure.002>, important part of metaserver-drf.err:
08-05-2020 20:26:43.563 INFO - (LayoutManager.cc:11084) starting re-replication: chunk: 131217 from: 127.0.0.1 20400 to: 127.0.0.1 20402 reason: evacuation
...
08-05-2020 20:26:43.592 DEBUG - (ChunkServer.cc:2199) 127.0.0.1 20402 cs-reply: -seq: 1805093697390353841 log: 0 0 3891 status: 0 replicate chunk: 131217 version: 1 file: 32 fileSize: -1 path: recovStripes: 0 seq: 1805093697390353841 from: 127.0.0.1 20400 to: 127.0.0.1 20402
...
08-05-2020 20:26:43.593 INFO - (LayoutManager.cc:11936) replication done: chunk: 131217 version: 1 status: 0 server: 127.0.0.1 20402 OK replications in flight: 9
08-05-2020 20:26:43.593 DEBUG - (LayoutManager.cc:11517) re-replicate: chunk: <32,131217> version: 1 offset: 0 eof: 10485760 replicas: 2 retiring: 0 target: 1 rlease: 0 wait: 0 hibernated: 0 needed: -1
08-05-2020 20:26:43.593 INFO - (LayoutManager.cc:12486) <32,131217> excludes: srv: other: 3 all: 3 rack: other: 3 all: 3 keeping: 127.0.0.1 20400 20400 0.889967 discarding: 127.0.0.1 20402 20402 0.894062
...
08-05-2020 20:26:43.594 DEBUG - (LayoutManager.cc:3741) CLIF done: status: 0 down: 0 log-chunk-in-flight: 127.0.0.1 20400 logseq: 0 0 3954 type: STL chunk: -1 version: 0 remove: 1 chunk-stale-notify: sseq: -1 size: 1 ids: 131217
08-05-2020 20:26:43.594 DEBUG - (LayoutManager.cc:3713) -srv: 127.0.0.1 20400 chunk: 131217 version: 1 removed: 1
...
08-05-2020 20:26:43.595 DEBUG - (LayoutManager.cc:3741) CLIF done: status: 0 down: 0 log-chunk-in-flight: 127.0.0.1 20402 logseq: 0 0 3957 type: DEL chunk: 131217 version: 0 remove: 1 meta-chunk-delete: chunk: 131217 version: 0 staleId: 0
08-05-2020 20:26:43.595 DEBUG - (LayoutManager.cc:3713) -srv: 127.0.0.1 20402 chunk: 131217 version: 1 removed: 1
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#237 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAREIWNKSYC5XJVNVCLIWX3R7LPZRANCNFSM4I3CINGA>.
|
Thanks @mikeov, it seems it really helped. We will do more tests on a larger cluster. |
Great. Thank you for update.
— Mike.
… On Aug 13, 2020, at 11:22 PM, hagrid-the-developer ***@***.***> wrote:
Thanks @mikeov <https://github.com/mikeov>, it seems it really helped. We will do more tests on a larger cluster.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#237 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAREIWN6VBLFY3QJJYSYWCTSATJ3BANCNFSM4I3CINGA>.
|
It seems that chunks really don't get lost now. But sometimes evacuation is really slow. Usually all chunks are moved quickly and then it takes very long time to transfer the last chunk: there are long delays between copies of its pieces (like 5minutes). And sometimes it takes long time until file |
Re-replication for re-balance utilization doesn't check that the destination chunk server for re-balancing already has the chunk stored on it. If the chunk initially had 2 replicas, one on the re-balance source chunk server and one on the re-replication destination chunk server, after the rebalancing the file will only have one replica, on the destination one.
I can provide additional logs from the environment I'm using it to prove it if needed.
The text was updated successfully, but these errors were encountered: