-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MemberDowngrade failpoint #19038
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Siyuan Zhang <[email protected]>
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: siyuanfoundation The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I am getting the following error sometimes. @serathius do you know what this error usually comes from?
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted filessee 24 files with indirect coverage changes @@ Coverage Diff @@
## main #19038 +/- ##
==========================================
- Coverage 68.86% 68.78% -0.09%
==========================================
Files 420 420
Lines 35623 35623
==========================================
- Hits 24532 24502 -30
- Misses 9668 9699 +31
+ Partials 1423 1422 -1 Continue to review full report in Codecov by Sentry.
|
} | ||
v3_6 := semver.Version{Major: 3, Minor: 6} | ||
// only current version cluster can be downgraded. | ||
return config.ClusterSize > 1 && v.Compare(v3_6) >= 0 && (config.Version == e2e.CurrentVersion && member.Config().ExecPath == e2e.BinPath.Etcd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only cluster size > 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 the same question.
return nil, err | ||
} | ||
targetVersion := semver.Version{Major: v.Major, Minor: v.Minor - 1} | ||
numberOfMembersToDowngrade := rand.Int()%len(clus.Procs) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not downgrade all members? Trying to think if there are any benefits on testing partial downgrade. Technically all partial upgrades are just subset of procedure of full downgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the benefit is to verify that the correctness should never be broken no matter how many members are downgraded.
I am thinking that we should explicitly verify the two cases: full downgrade and partial downgrade.
Most likely the WAL record is somehow corrupted again. Added more debug info in #19067. |
Overall looks good to me, please mark this PR as ready to review when you feel comfortable. |
Updated the error message. I tried |
Did not get time to dig into the robustness test's use case of reading WAL file, but my immediate feeling is the reason might be due to an inappropriate snapshot {Index: 0, Term: 0} is always set for the WAL. etcd/tests/robustness/report/wal.go Line 110 in e0bbea9
If the v2 snapshot files have already been rotated, in other words, the very first snapshot files have been purged, then it means there are data loss (you are not reading WAL records right following the v2 snapshot Index) from v2store perspective (although there isn't data loss from v3store perspective), then you will definitely see this error. etcd/server/storage/wal/wal.go Lines 488 to 493 in e0bbea9
A thought to double check this... set a big value for both |
@ahrtr Thanks for the suggestion. |
The WAL files should haven't rotated, otherwise it will fail to find the WAL files which match the snap index. etcd/server/storage/wal/wal.go Lines 405 to 408 in 0966b4d
Please use the
Also I tried to test it locally, but couldn't reproduce it. Please provide detailed step, I may try it when I get bandwidth.
|
Refer to |
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.
#17118