Add MemberDowngrade failpoint #19038

siyuanfoundation · 2024-12-10T21:14:16Z

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

Signed-off-by: Siyuan Zhang <[email protected]>

k8s-ci-robot · 2024-12-10T21:14:18Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2024-12-10T21:14:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: siyuanfoundation
Once this PR has been reviewed and has the lgtm label, please assign jmhbnz for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

siyuanfoundation · 2024-12-10T21:22:41Z

I am getting the following error sometimes. @serathius do you know what this error usually comes from?

Error:      	Received unexpected error:
        	            	failed to read WAL, cannot be repaired, err: wal: slice bounds out of range, snapshot[Index: 0, Term: 0], current entry[Index: 8193, Term: 4], len(ents): 7458

codecov · 2024-12-10T21:33:01Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.78%. Comparing base (854bdd6) to head (a809747).
Report is 2 commits behind head on main.

Additional details and impacted files

see 24 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19038      +/-   ##
==========================================
- Coverage   68.86%   68.78%   -0.09%     
==========================================
  Files         420      420              
  Lines       35623    35623              
==========================================
- Hits        24532    24502      -30     
- Misses       9668     9699      +31     
+ Partials     1423     1422       -1

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 854bdd6...a809747. Read the comment docs.

serathius · 2024-12-11T10:00:41Z

tests/robustness/failpoint/cluster.go

+	}
+	v3_6 := semver.Version{Major: 3, Minor: 6}
+	// only current version cluster can be downgraded.
+	return config.ClusterSize > 1 && v.Compare(v3_6) >= 0 && (config.Version == e2e.CurrentVersion && member.Config().ExecPath == e2e.BinPath.Etcd)


Why only cluster size > 1?

+1 the same question.

serathius · 2024-12-11T10:02:36Z

tests/robustness/failpoint/cluster.go

+		return nil, err
+	}
+	targetVersion := semver.Version{Major: v.Major, Minor: v.Minor - 1}
+	numberOfMembersToDowngrade := rand.Int()%len(clus.Procs) + 1


Why not downgrade all members? Trying to think if there are any benefits on testing partial downgrade. Technically all partial upgrades are just subset of procedure of full downgrade.

I think the benefit is to verify that the correctness should never be broken no matter how many members are downgraded.

I am thinking that we should explicitly verify the two cases: full downgrade and partial downgrade.

ahrtr · 2024-12-16T10:42:47Z

I am getting the following error sometimes. @serathius do you know what this error usually comes from?
Error:      	Received unexpected error:
        	            	failed to read WAL, cannot be repaired, err: wal: slice bounds out of range

Most likely the WAL record is somehow corrupted again. Added more debug info in #19067.

ahrtr · 2024-12-16T12:58:09Z

Overall looks good to me, please mark this PR as ready to review when you feel comfortable.

siyuanfoundation · 2024-12-16T18:28:19Z

I am getting the following error sometimes. @serathius do you know what this error usually comes from?
Error:      	Received unexpected error:
        	            	failed to read WAL, cannot be repaired, err: wal: slice bounds out of range
Most likely the WAL record is somehow corrupted again. Added more debug info in #19067.

Updated the error message.
I get this error almost always when I run the robustness test for 20 times.
Does that mean the downgrade would likely corrupt WAL?

I tried member.Stop() instead of member.Kill() and did not change the results.

@serathius @ahrtr

ahrtr · 2024-12-16T19:03:42Z

Updated the error message. I get this error almost always when I run the robustness test for 20 times. Does that mean the downgrade would likely corrupt WAL?

I tried member.Stop() instead of member.Kill() and did not change the results.

Did not get time to dig into the robustness test's use case of reading WAL file, but my immediate feeling is the reason might be due to an inappropriate snapshot {Index: 0, Term: 0} is always set for the WAL.

etcd/tests/robustness/report/wal.go

Line 110 in e0bbea9

w, err := wal.OpenForRead(lg, walDir, walpb.Snapshot{Index: 0})

If the v2 snapshot files have already been rotated, in other words, the very first snapshot files have been purged, then it means there are data loss (you are not reading WAL records right following the v2 snapshot Index) from v2store perspective (although there isn't data loss from v3store perspective), then you will definitely see this error.

etcd/server/storage/wal/wal.go

Lines 488 to 493 in e0bbea9

    
           offset := e.Index - w.start.Index - 1 
        
           if offset > uint64(len(ents)) { 
        
           	// return error before append call causes runtime panic 
        
           	return nil, state, nil, fmt.Errorf("%w, snapshot[Index: %d, Term: %d], current entry[Index: %d, Term: %d], len(ents): %d", 
        
           		ErrSliceOutOfRange, w.start.Index, w.start.Term, e.Index, e.Term, len(ents)) 
        
           }

A thought to double check this... set a big value for both --max-snapshots and --max-wals to ensure no snapshot files and WAL files rotation happens.

siyuanfoundation · 2024-12-17T21:50:14Z

@ahrtr Thanks for the suggestion.
I tried "--max-wals=1000", "--max-snapshots=1000", still seeing the same failure. So it is probably not due to rotation?

ahrtr · 2024-12-18T15:35:00Z

@ahrtr Thanks for the suggestion. I tried "--max-wals=1000", "--max-snapshots=1000", still seeing the same failure. So it is probably not due to rotation?

The WAL files should haven't rotated, otherwise it will fail to find the WAL files which match the snap index.

etcd/server/storage/wal/wal.go

Lines 405 to 408 in 0966b4d

    
           nameIndex, ok := searchIndex(lg, names, snap.Index) 
        
           if !ok { 
        
           	return nil, -1, fmt.Errorf("wal: file not found which matches the snapshot index '%d'", snap.Index) 
        
           }

Please use the etcd-dump-logs to analyse the WAL files, to double check whether the WAL records are incremental by 1. Run command below,

./etcd-dump-logs --start-index=0 path/2/data-dir

Also I tried to test it locally, but couldn't reproduce it. Please provide detailed step, I may try it when I get bandwidth.

go test -run TestRobustnessExploratory -v --count 100 --failfast --timeout 1h

ahrtr · 2024-12-18T15:35:50Z

Please use the etcd-dump-logs to analyse the WAL files, to double check whether the WAL records are incremental by 1. Run command below,
./etcd-dump-logs --start-index=0 path/2/data-dir

Refer to
#19082

Add MemberDowngrade failpoint

a809747

Signed-off-by: Siyuan Zhang <[email protected]>

k8s-ci-robot added do-not-merge/work-in-progress area/robustness-testing labels Dec 10, 2024

k8s-ci-robot added area/testing size/L labels Dec 10, 2024

serathius reviewed Dec 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MemberDowngrade failpoint #19038

Add MemberDowngrade failpoint #19038

siyuanfoundation commented Dec 10, 2024

k8s-ci-robot commented Dec 10, 2024

k8s-ci-robot commented Dec 10, 2024

siyuanfoundation commented Dec 10, 2024 •

edited

Loading

codecov bot commented Dec 10, 2024

serathius Dec 11, 2024

ahrtr Dec 16, 2024

serathius Dec 11, 2024 •

edited

Loading

ahrtr Dec 16, 2024

ahrtr commented Dec 16, 2024

ahrtr commented Dec 16, 2024

siyuanfoundation commented Dec 16, 2024 •

edited

Loading

ahrtr commented Dec 16, 2024

siyuanfoundation commented Dec 17, 2024

ahrtr commented Dec 18, 2024

ahrtr commented Dec 18, 2024

Add MemberDowngrade failpoint #19038

Are you sure you want to change the base?

Add MemberDowngrade failpoint #19038

Conversation

siyuanfoundation commented Dec 10, 2024

k8s-ci-robot commented Dec 10, 2024

k8s-ci-robot commented Dec 10, 2024

siyuanfoundation commented Dec 10, 2024 • edited Loading

codecov bot commented Dec 10, 2024

Codecov Report

serathius Dec 11, 2024

Choose a reason for hiding this comment

ahrtr Dec 16, 2024

Choose a reason for hiding this comment

serathius Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

ahrtr Dec 16, 2024

Choose a reason for hiding this comment

ahrtr commented Dec 16, 2024

ahrtr commented Dec 16, 2024

siyuanfoundation commented Dec 16, 2024 • edited Loading

ahrtr commented Dec 16, 2024

siyuanfoundation commented Dec 17, 2024

ahrtr commented Dec 18, 2024

ahrtr commented Dec 18, 2024

siyuanfoundation commented Dec 10, 2024 •

edited

Loading

serathius Dec 11, 2024 •

edited

Loading

siyuanfoundation commented Dec 16, 2024 •

edited

Loading