Handle sparse files #89

clarfonthey · 2020-05-17T23:32:54Z

Because hexyl truncates repeating sections, it would be nice to be able to have hexyl quickly skip over these sections instead of scanning them byte-by-byte.

sharkdp · 2020-05-19T12:23:42Z

So this is a request for hexyl to be faster?

sharkdp · 2020-05-26T08:14:11Z

Closing for now, as it's unclear to me.

clarfonthey · 2020-05-28T22:42:42Z

Sorry I never got to this. And, well, yes.

All the major filesystems expose the concept of sparse files, where the file is fragmented on disk and entire sections are marked as being zeroes without actually being stored.

For example, take an 8GiB disk image that is freshly formatted from a zeroed-out disk, with one or two files written to it that are both a few KiB. The actual image will only take up a few KiB on disk because the OS simply marks the file as being all zero except for a few sections.

While hexyl going through the file naively would take several minutes of reading the zero bytes, ultimately it would truncate these sections in the actual output. Reading these files would become feasible if hexyl used the OS subroutines to query the file for its zero ranges, then used this information to skip reading those unnecessarily.

I should also add, on filesystems where sparse files, the system calls that request for zeroed out sections will just always return nothing, making any implementations not have to worry about those cases.

sharifhsn · 2022-10-16T13:27:52Z

Unfortunately, due to the type signature of the print_all function, this isn't really possible. In order to access sparse file search on Windows and Linux (haven't checked MacOS) the data type must be a file. However, the print_all function accepts any type that implements Read, which makes it impossible to check whether it's reading a file or not. One way to change this is to make print_all accept the Input enum, but that would limit any generic usage of hexyl to read any data except a File or Stdin.

sharkdp · 2022-12-07T20:58:58Z

While hexyl going through the file naively would take several minutes of reading the zero bytes, ultimately it would truncate these sections in the actual output. Reading these files would become feasible if hexyl used the OS subroutines to query the file for its zero ranges, then used this information to skip reading those unnecessarily.

@clarfonthey Is this really the case? Can you provide any benchmark/timing results? Ideally, in a reproducable setup.

clarfonthey · 2022-12-07T21:07:12Z

I mean, it's pretty easy to just truncate -s 2G test and then run hexyl test to see that while the file takes up no space on disk, it will loop for quite a while.

sharkdp · 2022-12-07T21:59:49Z

Sure. I was questioning if it is really taking several minutes or not. In fact, for an 8G sparse file, hexyl takes 5.329 s ± 0.042 s (benchmark below).

But okay, the problem is real. If not for 8G, then for 800G. And I trust you that this is a valid use case.

So the next question is: how could we potentially implement this (ignoring for now what @sharifhsn talked about above). Would the idea be to do something like the following?

If we are in squeezing mode (i.e. we already detected a "full line" of zeros), and haven't read any data for X bytes: switch to a special "fast-forward" mode which would call lseek(2) with the current position as the offset parameter and whence = SEEK_DATA to jump over the hole?

(since that call could 'fail' silently and simply return the current position, we would have to make sure to only switch into fast-forward mode once until we continue to find non-zero bytes)

@clarfonthey Are you aware of any tools that do implement something similar?

Benchmark:

▶ truncate -s 3G test; echo "needle" >> test; truncate -s 8G test

▶ hyperfine --warmup=1 --runs=5 'hexyl test' 'hexdump test'
Benchmark 1: hexyl test
  Time (mean ± σ):      5.329 s ±  0.042 s    [User: 4.099 s, System: 1.220 s]
  Range (min … max):    5.287 s …  5.388 s    5 runs
 
Benchmark 2: hexdump test
  Time (mean ± σ):     16.443 s ±  0.071 s    [User: 14.715 s, System: 1.713 s]
  Range (min … max):   16.346 s … 16.544 s    5 runs
 
Summary
  'hexyl test' ran
    3.09 ± 0.03 times faster than 'hexdump test'

(xxd had to be excluded since it does not seem to have a squeezing mode... and therefore extremely slow)

wasamasa · 2023-01-10T09:41:49Z

xxd actually has a squeezing mode, it calls it autoskip and exposes it with -a and -autoskip.

sharkdp · 2023-01-11T06:30:02Z

Thank you. I did not know that. xxd -a test takes almost 3 minutes (2:57 min), i.e. it's still an order of magnitude slower than both hexdump and hexyl.

sharkdp closed this as completed May 26, 2020

sharkdp reopened this May 29, 2020

sharifhsn mentioned this issue Dec 8, 2022

experiment: Sparse file support #185

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle sparse files #89

Handle sparse files #89

clarfonthey commented May 17, 2020

sharkdp commented May 19, 2020

sharkdp commented May 26, 2020

clarfonthey commented May 28, 2020 •

edited

Loading

sharifhsn commented Oct 16, 2022

sharkdp commented Dec 7, 2022

clarfonthey commented Dec 7, 2022

sharkdp commented Dec 7, 2022

wasamasa commented Jan 10, 2023

sharkdp commented Jan 11, 2023

Handle sparse files #89

Handle sparse files #89

Comments

clarfonthey commented May 17, 2020

sharkdp commented May 19, 2020

sharkdp commented May 26, 2020

clarfonthey commented May 28, 2020 • edited Loading

sharifhsn commented Oct 16, 2022

sharkdp commented Dec 7, 2022

clarfonthey commented Dec 7, 2022

sharkdp commented Dec 7, 2022

wasamasa commented Jan 10, 2023

sharkdp commented Jan 11, 2023

clarfonthey commented May 28, 2020 •

edited

Loading