Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chore][tracker]: save most recent (archive) write index to disk #36799

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

VihasMakwana
Copy link
Contributor

This PR stores the most recent index to disk. Much similar to what happens for persistent queue. It also adds Batch methods to operator.Persister, as saving the metadata and saving the index should be a transaction and it can only be achieved via Batch.

For eg. if user has configured archiving to store 100 poll cycles, let's assume:

  • For first collector run, it stores 10 cycles and archiveIndex is 11 (pointing to the next index).
  • When the collector is restarted, we will restore the archiveIndex from disk and continue from index 11

Link to tracking issue

Related #32727

Testing

Added UT for checking index


if err := persister.Set(ctx, key, buf.Bytes()); err != nil {
ops = append(ops, storage.SetOperation(key, buf.Bytes()))
if err := persister.Batch(ctx, ops...); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For existing usage, this will be a no-op.

pkg/stanza/fileconsumer/internal/tracker/tracker.go Outdated Show resolved Hide resolved
// It's best if we reset the index or else we might end up writing invalid keys
t.set.Logger.Warn("the read index was found, but it exceeds the bounds. Starting from 0")
t.archiveIndex = 0
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to check for this case.

However, I wonder if we can handle it better than restarting from zero. What would it take to search the archive for the most recently updated?

I think we could maintain some kind of data structure which notes the time each archive was written. Maybe just map[index]time.Time. Then when we first create the tracker, we can load this up and find the most recent timestamp. We can also check for the case where pollsToArchive has changed and then rewrite the storage to align with the new value.

For example, if we previously saved 10 archives and find that pollsToArchive is now 5, we can find the 5 most recent indices based on the timestamp structure, then rewrite the archive files so that these are 0-4. We should probably even delete the extras from storage as well.

Copy link
Contributor Author

@VihasMakwana VihasMakwana Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@djaglowski This solution does makes sense to me, but it becomes tricky when we eventually overwrite old archive data, as it is a ring buffer.
We might need to load the filesets in memory.
I'll find a few ways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it becomes tricky when we eventually overwrite old archive data, as it is a ring buffer.

Can you elaborate?

We might need to load the filesets in memory.

If it's more than one at a time then it defeats the point of the archive.

@mx-psi mx-psi removed their assignment Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants