Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MemberDowngrade failpoint #19038

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

siyuanfoundation
Copy link
Contributor

Signed-off-by: Siyuan Zhang <[email protected]>
@k8s-ci-robot
Copy link

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: siyuanfoundation
Once this PR has been reviewed and has the lgtm label, please assign jmhbnz for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@siyuanfoundation
Copy link
Contributor Author

siyuanfoundation commented Dec 10, 2024

I am getting the following error sometimes. @serathius do you know what this error usually comes from?

Error:      	Received unexpected error:
        	            	failed to read WAL, cannot be repaired, err: wal: slice bounds out of range, snapshot[Index: 0, Term: 0], current entry[Index: 8193, Term: 4], len(ents): 7458

Copy link

codecov bot commented Dec 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.78%. Comparing base (854bdd6) to head (a809747).
Report is 2 commits behind head on main.

Additional details and impacted files

see 24 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19038      +/-   ##
==========================================
- Coverage   68.86%   68.78%   -0.09%     
==========================================
  Files         420      420              
  Lines       35623    35623              
==========================================
- Hits        24532    24502      -30     
- Misses       9668     9699      +31     
+ Partials     1423     1422       -1     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 854bdd6...a809747. Read the comment docs.

}
v3_6 := semver.Version{Major: 3, Minor: 6}
// only current version cluster can be downgraded.
return config.ClusterSize > 1 && v.Compare(v3_6) >= 0 && (config.Version == e2e.CurrentVersion && member.Config().ExecPath == e2e.BinPath.Etcd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only cluster size > 1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the same question.

return nil, err
}
targetVersion := semver.Version{Major: v.Major, Minor: v.Minor - 1}
numberOfMembersToDowngrade := rand.Int()%len(clus.Procs) + 1
Copy link
Member

@serathius serathius Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not downgrade all members? Trying to think if there are any benefits on testing partial downgrade. Technically all partial upgrades are just subset of procedure of full downgrade.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the benefit is to verify that the correctness should never be broken no matter how many members are downgraded.

I am thinking that we should explicitly verify the two cases: full downgrade and partial downgrade.

@ahrtr
Copy link
Member

ahrtr commented Dec 16, 2024

I am getting the following error sometimes. @serathius do you know what this error usually comes from?

Error:      	Received unexpected error:
        	            	failed to read WAL, cannot be repaired, err: wal: slice bounds out of range

Most likely the WAL record is somehow corrupted again. Added more debug info in #19067.

@ahrtr
Copy link
Member

ahrtr commented Dec 16, 2024

Overall looks good to me, please mark this PR as ready to review when you feel comfortable.

@siyuanfoundation
Copy link
Contributor Author

siyuanfoundation commented Dec 16, 2024

I am getting the following error sometimes. @serathius do you know what this error usually comes from?

Error:      	Received unexpected error:
        	            	failed to read WAL, cannot be repaired, err: wal: slice bounds out of range

Most likely the WAL record is somehow corrupted again. Added more debug info in #19067.

Updated the error message.
I get this error almost always when I run the robustness test for 20 times.
Does that mean the downgrade would likely corrupt WAL?

I tried member.Stop() instead of member.Kill() and did not change the results.

@serathius @ahrtr

@ahrtr
Copy link
Member

ahrtr commented Dec 16, 2024

Updated the error message. I get this error almost always when I run the robustness test for 20 times. Does that mean the downgrade would likely corrupt WAL?

I tried member.Stop() instead of member.Kill() and did not change the results.

Did not get time to dig into the robustness test's use case of reading WAL file, but my immediate feeling is the reason might be due to an inappropriate snapshot {Index: 0, Term: 0} is always set for the WAL.

w, err := wal.OpenForRead(lg, walDir, walpb.Snapshot{Index: 0})

If the v2 snapshot files have already been rotated, in other words, the very first snapshot files have been purged, then it means there are data loss (you are not reading WAL records right following the v2 snapshot Index) from v2store perspective (although there isn't data loss from v3store perspective), then you will definitely see this error.

offset := e.Index - w.start.Index - 1
if offset > uint64(len(ents)) {
// return error before append call causes runtime panic
return nil, state, nil, fmt.Errorf("%w, snapshot[Index: %d, Term: %d], current entry[Index: %d, Term: %d], len(ents): %d",
ErrSliceOutOfRange, w.start.Index, w.start.Term, e.Index, e.Term, len(ents))
}

A thought to double check this... set a big value for both --max-snapshots and --max-wals to ensure no snapshot files and WAL files rotation happens.

@siyuanfoundation
Copy link
Contributor Author

@ahrtr Thanks for the suggestion.
I tried "--max-wals=1000", "--max-snapshots=1000", still seeing the same failure. So it is probably not due to rotation?

@ahrtr
Copy link
Member

ahrtr commented Dec 18, 2024

@ahrtr Thanks for the suggestion. I tried "--max-wals=1000", "--max-snapshots=1000", still seeing the same failure. So it is probably not due to rotation?

The WAL files should haven't rotated, otherwise it will fail to find the WAL files which match the snap index.

nameIndex, ok := searchIndex(lg, names, snap.Index)
if !ok {
return nil, -1, fmt.Errorf("wal: file not found which matches the snapshot index '%d'", snap.Index)
}

Please use the etcd-dump-logs to analyse the WAL files, to double check whether the WAL records are incremental by 1. Run command below,

./etcd-dump-logs --start-index=0 path/2/data-dir

Also I tried to test it locally, but couldn't reproduce it. Please provide detailed step, I may try it when I get bandwidth.

go test -run TestRobustnessExploratory -v --count 100 --failfast --timeout 1h

@ahrtr
Copy link
Member

ahrtr commented Dec 18, 2024

Please use the etcd-dump-logs to analyse the WAL files, to double check whether the WAL records are incremental by 1. Run command below,

./etcd-dump-logs --start-index=0 path/2/data-dir

Refer to
#19082

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants