-
-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle sparse files #89
Comments
So this is a request for |
Closing for now, as it's unclear to me. |
Sorry I never got to this. And, well, yes. All the major filesystems expose the concept of sparse files, where the file is fragmented on disk and entire sections are marked as being zeroes without actually being stored. For example, take an 8GiB disk image that is freshly formatted from a zeroed-out disk, with one or two files written to it that are both a few KiB. The actual image will only take up a few KiB on disk because the OS simply marks the file as being all zero except for a few sections. While hexyl going through the file naively would take several minutes of reading the zero bytes, ultimately it would truncate these sections in the actual output. Reading these files would become feasible if hexyl used the OS subroutines to query the file for its zero ranges, then used this information to skip reading those unnecessarily. I should also add, on filesystems where sparse files, the system calls that request for zeroed out sections will just always return nothing, making any implementations not have to worry about those cases. |
Unfortunately, due to the type signature of the |
@clarfonthey Is this really the case? Can you provide any benchmark/timing results? Ideally, in a reproducable setup. |
I mean, it's pretty easy to just |
Sure. I was questioning if it is really taking several minutes or not. In fact, for an 8G sparse file, But okay, the problem is real. If not for 8G, then for 800G. And I trust you that this is a valid use case. So the next question is: how could we potentially implement this (ignoring for now what @sharifhsn talked about above). Would the idea be to do something like the following? If we are in squeezing mode (i.e. we already detected a "full line" of zeros), and haven't read any data for X bytes: switch to a special "fast-forward" mode which would call (since that call could 'fail' silently and simply return the current position, we would have to make sure to only switch into fast-forward mode once until we continue to find non-zero bytes) @clarfonthey Are you aware of any tools that do implement something similar? Benchmark:
(xxd had to be excluded since it does not seem to have a squeezing mode... and therefore extremely slow) |
xxd actually has a squeezing mode, it calls it autoskip and exposes it with |
Thank you. I did not know that. |
Because hexyl truncates repeating sections, it would be nice to be able to have hexyl quickly skip over these sections instead of scanning them byte-by-byte.
The text was updated successfully, but these errors were encountered: