Much lower compression on new pool with largely same settings #16360

mpeter50 · 2024-07-17T00:02:59Z

mpeter50
Jul 17, 2024

I have a pool that I was using for a longer period to store various kinds of data, but I started moving some of them to a new pool, made of disks I was using in the first pool before I upgraded its disks.

Both pools use ashift=12, (on all datasets) lz4 compression, recordsize=1M.
I was copying a lot of data with rsync, with the intetion of eventually deleting them from the origin pool, but I have noticed that nearly all of my files are larger than they are on the origin pool. After looking into it, I see that for these files the contents are the same, and so the logical sizes too, however the physical sizes are larger on the new pool.

If I take a file that is well compressed on the origin pool, and make a copy on the same pool, I see that the copy is also well compessed by looking at the difference of the logical size and the size on disk.
If I copy that same file to the new pool, it is only very slightly compressed.

I should note that I created the pool with checksum=blake3 and recordsize=4M (limit increased), but when I noticed this difference, I have set these back to the values I was using on the origin pool, deleted all copied files, and started over. But that did not result in any difference.
I'm copying the data to a non-root dataset, but that inherits props from the root, and I have always set the mentioned props there.

I have compared the dataset and pool properties. The configurable native properties are basically the same, the only notable difference is the size of the pool, the vdev structure (the origin pool is a single raidz1 vdev, this new pool does not have any parity, its sole vdev is a Device Mapper virtual block device), the name and description, and the allowed features. Namely, in the old pool I dont have the following features enabled, but I have in the new one: zilsaxattr, head_errlog, blake3, block_cloning, and vdev_zaps_v2.

What is the cause of this difference? I would prefer to keep the compression capability of my old pool, because it has a meaningful impact on how soon would I need to upgrade.

This is what I'm using:
Proxmox 8 based on Debian 12
Linux kernel 6.8.8-2-pve
apophis@node804:/mnt/yellow_pool/Media $ zfs --version
zfs-2.2.4-pve1
zfs-kmod-2.2.4-pve1

Answered by rincebrain

Jul 17, 2024

Well, in the example dataset you included, assuming it's the same data on the old and the new, the old said 744G including snapshots, 394G live, with a compression ratio of 1.00x, and the new says 144G and a compression ratio of 1.00x.

So I'm not sure the problem here is one of compression differences. I really think it's just raidz deflateratio surprising you.

Pick a particular file on the old and the new which differ in apparent space savings and examine them closely in zdb.

View full answer

mpeter50 · 2024-07-17T00:13:44Z

mpeter50
Jul 17, 2024
Author

I have attached exports of pool properties, and of one of the affected dataset's: props.zip

Please ignore the size and free stat props, I'm in the process of copying all files again to have a bigger picture on the current effectiveness of compression.

0 replies

amotin · 2024-07-17T00:37:12Z

amotin
Jul 17, 2024
Collaborator

Both pools use ashift=12,

Are you sure that old pool really had ashift=12? You should not look on the respective pool property, since it affects only newly added vdevs and its change later does not matter. Look into on zdb output, which should show real ashift for each individual leaf vdev.

(on all datasets) lz4 compression, recordsize=1M.

What is the average file size? If files are very small, then recordsize may not matter if it is rarely/never reached.

the origin pool is a single raidz1 vdev, this new pool does not have any parity, its sole vdev is a Device Mapper virtual block device

Pool topology may also matter. For example, RAIDZ1 rounds up each allocation to a multiple of 2 allocation sizes, which may be only 1KB if ashift=9, but much more significant 8KB if ashift=12, while single vdev or mirrors round up only to 1 allocation size. Mentioned Device Mapper does not tell me anything, since depending on its characteristics ZFS may increase ashift up to 16KB, if needed (again, look into zdb output). If you use Device Mapper to create some sort of RAID under ZFS -- please don't, since you loose ZFS' ability to recover data from multiple copies.

1 reply

mpeter50 Jul 17, 2024
Author

sorry, forgot to place the reply as a reply to your comment. my reply is a new top level comment.

mpeter50 · 2024-07-17T00:59:24Z

mpeter50
Jul 17, 2024
Author

Are you sure that old pool really had ashift=12? You should not look on the respective pool property, since it affects only newly added vdevs and its change later does not matter. Look into on zdb output, which should show real ashift for each individual leaf vdev.

Thats a good point, thanks!
In my case, the output of zdb mypool (mypool is the name of my origin pool) looks like this:

Cached configuration:
        version: 5000
        name: 'mypool'
        state: 0
        txg: 16797449
        pool_guid: 9066662378798912206
        errata: 0
        comment: 'First ZFS pool for general NAS'
        hostname: 'myhostname'
        com.delphix:has_per_vdev_zaps
        vdev_children: 1
        vdev_tree:
            type: 'root'
            id: 0
            guid: 9066662378798912206
            create_txg: 4
            children[0]:
                type: 'raidz'
                id: 0
                guid: 14516829785936615426
                nparity: 1
                metaslab_array: 65
                metaslab_shift: 34
                ashift: 12
                asize: 32006194855936
                is_log: 0
[...]

The vdev with guid of 14516829785936615426 is my only vdev.
Additionaly, from zpool history this is how I made the pool originally (thoguh with smaller storage capacity):

zpool create -f -o ashift=12 -O compression=lz4 -O recordsize=1M -m /mnt/zdata mypool raidz1 [...]

What is the average file size? If files are very small, then recordsize may not matter if it is rarely/never reached.

On the dataset for which I have included the proprties export, most files range from 500 MB to 10+ (below 20) GB.
Total logical size is 406 GB for 213 files, which on the origin pool is compressed to 393 GB. On the new pool, this is compressed to 405 GB (435 777 767 424 bytes).
Sorry for not including size examples before. When sending I did not yet figure out that it is as simple as giving aggregate size examples. Sometimes the easisest things are the hardest..

Mentioned Device Mapper does not tell me anything, since depending on its characteristics ZFS may increase ashift up to 16KB

ashift=12 was explicitely defined at pool creation. That being said, here is my zpool history for the new pool:

2024-07-16.15:27:39 zpool create -o ashift=12 -O compression=lz4 -O checksum=blake3 -O recordsize=4M -m /mnt/yellow_pool yellow_pool /dev/mapper/yellow_pool_zfs_1
2024-07-16.19:46:03 zfs set atime=off yellow_pool
2024-07-16.19:52:20 zfs set exec=off primarycache=metadata yellow_pool
[created 4 datasets]
2024-07-17.01:11:00 zfs set recordsize=1M yellow_pool
2024-07-17.01:15:49 zfs inherit checksum yellow_pool

If it has significance, the Device Mapper virtual block device is a LUKS device of a full disk, no partition table.

If you use Device Mapper to create some sort of RAID under ZFS -- please don't, since you loose ZFS' ability to recover data from multiple copies.

I dont, this is only a single layer of full disk encryption. This pool is intentionally without parity, it will only store replaceable data, backups, and other such things.

2 replies

amotin Jul 17, 2024
Collaborator

ashift=12 was explicitely defined at pool creation.

Good, but it is still only a request, which ZFS may ignore if underlying device needs bigger value, so make sure to check with zdb.

mpeter50 Jul 17, 2024
Author

Oh, thats good to know. Here is the relevant ZDB output:

$ sudo zdb yellow_pool

Cached configuration:
        version: 5000
        name: 'yellow_pool'
        state: 0
        txg: 4
        pool_guid: 4486881383270724382
        errata: 0
        comment: 'mycomment'  
        hostname: 'myhostname'
        com.delphix:has_per_vdev_zaps
        vdev_children: 1
        vdev_tree:
            type: 'root'
            id: 0
            guid: 4486881383270724382
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 18031409987003214390
                path: '/dev/mapper/yellow_pool_zfs_1'
                whole_disk: 0
                metaslab_array: 256
                metaslab_shift: 34
                ashift: 12
                asize: 4000765444096
                is_log: 0
                create_txg: 4
                com.delphix:vdev_zap_leaf: 130       
                com.delphix:vdev_zap_top: 131        
        features_for_read:
            com.delphix:hole_birth
            com.delphix:embedded_data
            com.klarasystems:vdev_zaps_v2

[...]

vdev with guid 4486881383270724382 is the only vdev. ashift is 12.

Also, I have just noticed, is it ok to read the value from the "cached configuration" section when the pool is online?
ashift is also 12 in the MOS config, both for this and the origin pool.

rincebrain · 2024-07-17T08:11:28Z

rincebrain
Jul 17, 2024
Collaborator

Looking at the output of zdb -Lbbbbbb [pool] might be informative, since you basically want to know which blocks are taking up wildly different space on each pool.

Do both of them agree the result is logically the same size, even if actually not?

The awkward part about doing math on raidz is, all the numbers you get from things like zfs list are kind of skewed...and the skewing does funny things to numbers above 128k. cf. #14420.

12 replies

mpeter50 Jul 17, 2024
Author

I'm not familiar with zdb, so please bear with me.

First I have ran sudo zdb -Lbbbbbb mypool | less, with 6 bs. Later I piped this to a file.
Where should I look in the output? I see

a long list of blocks
then bp statistics
then block type and psize statistics
and finally a block size histogram

First I was looking at the bp logical number in the bp statistics, but that is probably not what you meant.
In any case, I want to note in case its important, that I'm copying a subset of the origin pool's files to a new, smaller pool. In one case it is a full datatset that I copied, in other cases whole directories in a dataset.

From reading a little from the zdb man page, I also want to note that my pool is online, but not active in the sense that no processes should write to it. atime tracking has been turned off recently, so even reading should not cause changes as far as I can tell.

If it would be easier if I would upload a part of the zdb output, please let me know. The total size of the output for the origin pool is 6,5 GB, and 253 MB for the new pool.

rincebrain Jul 17, 2024
Collaborator

Block type and statistics is what I was suggesting looking at.

e.g. if you have a bunch of L0 plain files of very different average asize between the two pools, that might be something to look at.

Finding a specific file with a vast difference and diving into why would also be an informative thing to do.

mpeter50 Jul 17, 2024
Author

Block type and statistics is what I was suggesting looking at.

e.g. if you have a bunch of L0 plain files of very different average asize between the two pools, that might be something to look at.

If in the long block list the asize is the 3rd comma separated value in the DVA, and it is in hexadecimal form, then the average asize of the L0 ZFS plain file type of blocks in decimal form is 1283563 for the origin pool, and 1344779 for the new pool.

Statistics for the L0 ZFS plain file block type look like the following.

The new pool:

Blocks	LSIZE	PSIZE	ASIZE	  avg	 comp	%Total	Type
...
  921K	1.17T	1.15T	1.15T	1.28M	 1.02	 99.99	    L0 ZFS plain file

On the histogram, only psize 257 has more than one stars, it has 40. Its block count is 921632.

The origin pool:

Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
...
 22.9M  22.7T   21.0T   28.0T   1.22M    1.08    99.99      L0 ZFS plain file

On the histogram, only psize 257 has more than one stars, it has 40, again. Its block count is 23342764.

On both pools, every 8th psize under the L0 ZFS plain file type has an outstanding number of blocks compared to the other lines near it, but that is probably expected.

Other than that, when looking at plain files in the new pool's zdb output, what I have found is that almost all of them are listed as uncompressed.
From line 542 to 11355, the majority of lines contain [L0 ZFS plain file] blake3 uncompressed, with an occasional [L1 ZFS plain file] blake3 lz4. Occasional means, around every ~1030 lines.
From line 11355 to 11427, the majority of lines contain [L0 ZFS plain file] fletcher4 uncompressed.
From line 11429 to 17559, the majority of lines contain [L0 ZFS plain file] blake3 uncompressed, with an occasional [L1 ZFS plain file] blake3 lz4.
From line 17563 to 20504, the majority of lines contain [L0 ZFS plain file] fletcher4 uncompressed, with occasional [L1 ZFS plain file] fletcher4 lz4.
From line 20505 to 20571, the lines contain [L0 ZFS plain file] fletcher4 lz4.
From line 20572 to 53465, the majority of lines contain [L0 ZFS plain file] fletcher4 uncompressed, with occasional [L0 ZFS plain file] fletcher4 lz4.
And so on and so forth.

In total, there are

943 503 [L0 ZFS plain file] lines
- 100 843 of them contain blake3 as the next word
- 842 454 of them contain fletcher4 as the next word
- 026 388 of them contain lz4 as the 2nd next word
- 917 117 of them contain uncompressed as the 2nd next word
1 413 [L1 ZFS plain file] lines
- 0 377 of them contain blake3 as the next word
- 1 036 of them contain fletcher4 as the next word
- 1 413 of them contain lz4 as the 2nd next word
86 [L2 ZFS plain file] lines
0 [L3 ZFS plain file] lines

This seems like as if the files would not be compressed, but only a little part of them.
Also, a majority of them are chekcsummed with fletcher4, although I have set blake3 for the root dataset at pool creation.
I see in man zfsprops that fletcher4 is the default.

I feel like I have massively overlooked something very simple that resulted in my custom starting properties to not get to be used, but I dont see what could that be if thats the case.

rincebrain Jul 17, 2024
Collaborator

Well, in the example dataset you included, assuming it's the same data on the old and the new, the old said 744G including snapshots, 394G live, with a compression ratio of 1.00x, and the new says 144G and a compression ratio of 1.00x.

So I'm not sure the problem here is one of compression differences. I really think it's just raidz deflateratio surprising you.

Pick a particular file on the old and the new which differ in apparent space savings and examine them closely in zdb.

Answer selected by mpeter50

mpeter50 Jul 18, 2024
Author

Well, thats actually surprising.
I haven't heard about raidz deflation before, but your idea seems to be the case.

First of all, it slipped my attention that the dataset I brought as an example was never even compressible to begin with. I thought it was, because I saw the smaller disk usage stat on all the files I checked. The compressratio dataset property made it clear it wasnt actually compressed.
With that in mind, I did some tests with files that I knew were actually compressible. Copying them to the new pool, they are being compressed, they end up taking less space than the amount of data they hold. Not as little space as on the origin pool, but here comes what you said into the picture.
I did another test, where i made a copy of the file on the origin pool, then turned off compression on it, and made another copy. The copy made after disabling the compression is still said to take up less space than the amount of data it holds, but not as much: this is a 10.06 GB file, the disk usage of the compressed copy is 9,16 GB, and that of the uncompressed copy is 9,77 GB.
The same phenomenon can be observed with new files that were never on the pool, so this cant even be attributed to deduplication or block sharing.

And it seems that the other datasets that I have already copied into the new pool weren't compressible either. Thats how I though compression isn't working. I work with a lot of video files that are compressible for some reason, and forgot that some files may not be like that.

Thanks for the help of all of you, and sorry for taking your time!

mpeter50 Jul 18, 2024
Author

So basically most of my files were marked as uncompressed because they were not compressible, and ZFS stored them in their original form for efficiency.

But, whats the case with those fletcher4 checksummed files? Why did it use fletcher4 instead of blake3, which was set from pool creation?

rincebrain Jul 18, 2024
Collaborator

That, I can't speculate.

At least in my experience, changing the checksum setting is quite reliable.

If it's almost all older txg records, I would wonder if something had fletcher4 set explicitly for a while, and check zpool history for if you ever changed it again after pool creation or on new dataset creation.

mpeter50 Jul 18, 2024
Author

This is the full history of the pool:

History for 'yellow_pool':
2024-07-16.15:27:39 zpool create -o ashift=12 -o comment=Smaller pool of former 4T drives -O compression=lz4 -O checksum=blake3 -O recordsize=4M -m /mnt/yellow_pool yellow_pool /dev/mapper/yellow_pool_zfs_1
2024-07-16.19:46:03 zfs set atime=off yellow_pool
2024-07-16.19:52:20 zfs set exec=off primarycache=metadata yellow_pool
2024-07-16.19:54:00 zfs create yellow_pool/Backups
2024-07-16.19:54:07 zfs create yellow_pool/Documents
2024-07-16.19:54:10 zfs create yellow_pool/Shared
2024-07-16.19:54:15 zfs create yellow_pool/Media
2024-07-17.01:11:00 zfs set recordsize=1M yellow_pool
2024-07-17.01:15:49 zfs inherit checksum yellow_pool
2024-07-17.19:01:43 zpool import -c /etc/zfs/zpool.cache -aN
2024-07-18.00:49:11 zpool import -c /etc/zfs/zpool.cache -aN

Should I open a new discussion for this? That could be turned into an issue if needed.
In the meantime I'm recreating the pool, to have at least all starting data contiguosly (I was moving data around a lot), and also because I forgot to format the LUKS device before creating the pool originally. I still have the zdb -Lbbbbbb output, though.

IvanVolosyuk · 2024-07-17T09:00:05Z

IvanVolosyuk
Jul 17, 2024

I wonder how you copied the data to the new pool. If you used zfs send but didn't specify -L option - larger blocks could have been split into 128k blocks as a result causing worse compression.

2 replies

mpeter50 Jul 17, 2024
Author

I tried 2 mehtods with the same result:

rsync with 2 tons of parameters
through Samba, initiated by Total Commander from a Windows machine

rincebrain Jul 17, 2024
Collaborator

Neither of those should suffer from that caveat, so I'd suggest comparing zdb -Lbbbbb results and seeing who differs the most.

My guess is that it's raidz deflation behavior to blame for surprising outcomes, but we'll see for sure with that data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Much lower compression on new pool with largely same settings #16360

{{title}}

Replies: 5 comments 17 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Much lower compression on new pool with largely same settings #16360

mpeter50 Jul 17, 2024

Replies: 5 comments · 17 replies

mpeter50 Jul 17, 2024 Author

amotin Jul 17, 2024 Collaborator

mpeter50 Jul 17, 2024 Author

mpeter50 Jul 17, 2024 Author

amotin Jul 17, 2024 Collaborator

mpeter50 Jul 17, 2024 Author

rincebrain Jul 17, 2024 Collaborator

mpeter50 Jul 17, 2024 Author

rincebrain Jul 17, 2024 Collaborator

mpeter50 Jul 17, 2024 Author

rincebrain Jul 17, 2024 Collaborator

mpeter50 Jul 18, 2024 Author

mpeter50 Jul 18, 2024 Author

rincebrain Jul 18, 2024 Collaborator

mpeter50 Jul 18, 2024 Author

IvanVolosyuk Jul 17, 2024

mpeter50 Jul 17, 2024 Author

rincebrain Jul 17, 2024 Collaborator

mpeter50
Jul 17, 2024

Replies: 5 comments 17 replies

mpeter50
Jul 17, 2024
Author

amotin
Jul 17, 2024
Collaborator

mpeter50 Jul 17, 2024
Author

mpeter50
Jul 17, 2024
Author

amotin Jul 17, 2024
Collaborator

mpeter50 Jul 17, 2024
Author

rincebrain
Jul 17, 2024
Collaborator

mpeter50 Jul 17, 2024
Author

rincebrain Jul 17, 2024
Collaborator

mpeter50 Jul 17, 2024
Author

rincebrain Jul 17, 2024
Collaborator

mpeter50 Jul 18, 2024
Author

mpeter50 Jul 18, 2024
Author

rincebrain Jul 18, 2024
Collaborator

mpeter50 Jul 18, 2024
Author

IvanVolosyuk
Jul 17, 2024

mpeter50 Jul 17, 2024
Author

rincebrain Jul 17, 2024
Collaborator