Reading Dedup data is slow #13664

Finix1979 · 2022-07-18T06:13:20Z

Finix1979
Jul 18, 2022

Environment:
ZFS 2.1(master)

Create a zpool (primarycache and secondarycache are metadata)
zpool status
pool: zpool
state: ONLINE
config:

 NAME         STATE     READ WRITE CKSUM
 zpool        ONLINE       0     0     0
   sdf        ONLINE       0     0     0
   sdg        ONLINE       0     0     0
 dedup
   nvme0n1p3  ONLINE       0     0     0
 logs
   nvme0n1p1  ONLINE       0     0     0
 cache
   nvme0n1p2  ONLINE       0     0     0

Create two zfs.
zfs create zpool/f64d -o dedup=on -o recordsize=64k -o atime=off -o xattr=sa
zfs create zpool/f64 -o dedup=off -o recordsize=64k -o atime=off -o xattr=sa
Generate two files in both filesystems
dd if=/dev/urandom of=/zpool/f64d/dedup bs=64k count=10000
dd if=/dev/urandom of=/zpool/f64/nondedup bs=64k count=10000
Test read speed
dd if=/zpool/f64d/dedup of=/dev/null bs=64k count=100000 status=progress
speed is about: 50MB/s

dd if=/zpool/f64/nondedup of=/dev/null bs=64k count=100000 status=progress
speed is about: 250MB/s

The speed of reading files in a deduped filesystem is about 1/5 compared with a non-deduped filesystem. I'm curious what caused that difference so much if DDT is in ARC for sure. I could not find any clue so far.

My analysis:
The first thing I suspect is the amount of data read from disk is different. I trace I/O from the block layer but can not find any difference. The amount is exactly what I asked for. In the above example, the disk only responds 10000read I/O for each 64K.

Then I traced ZFS read logic from SPL to DMU layer, but still can not find any difference.

At last, I try to trace ZIO pipeline. The difference is just in zio_ddt_read_start and zio_ddt_read_done. zio_ddt_read_start will create an extra zio. The rest of zio logic is almost the same as normal filesystem reading. Then I suspect context switch caused by taskq , but after tracing I could say both dedup and non-dedup reading have the same taskq trigger, i,e, taskq is only trigger by physical IO completion, and all parent zio(s) are run in stack environment in my test case.

The rough dedup read zio pipeline looks like:
zio(root) -> zio(1) -> zio(1)[zio_read_bp_init] -> zio(1)[zio_ddt_read_start] -> zio(2)[zio_ready] -> zio(2)[zio_vdev_io_start] -> zio(3)[zio_vdev_io_start]
above is dd context.
After physical I/O complete successfully,
zio(3)[vdev_disk_physio_completion] -> zio(3)[zio_vdev_io_done] -> zio(3)[zio_vdev_io_assess] -> zio(3)[zio_checksum_verify] -> zio(3)[zio_done] -> zio(2)[zio_vdev_io_done] -> zio(2)[zio_vdev_io_assess] -> zio(2)[zio_done] -> zio(1)[zio_ddt_read_done] -> zio(1)[zio_ready] -> zio(1)[zio_done] -> zio(root)[zio_done]
above is in z_rd_int context

The non-dedup read zio pipeline looks like:
zio(root) -> zio(1) -> zio(1)[zio_read_bp_init] -> zio(1)[zio_ready] -> zio(1)[zio_vdev_io_start] -> zio(2)[zio_vdev_io_start]
above is dd context.
After physical I/O complete successfully,
zio(2)[vdev_disk_physio_completion] -> zio(2)[zio_vdev_io_done] -> zio(2)[zio_vdev_io_assess] -> zio(2)[zio_checksum_verify] -> zio(2)[zio_done] -> zio(1)[zio_vdev_io_done] -> zio(1)[zio_vdev_io_assess] -> zio(1)[zio_done] -> zio(root)[zio_done]
above is in z_rd_int context

Please let me know if there is anything I can dig further.

Answered by amotin

Jul 19, 2022

If you see no difference on disk (I would not expect there be unless you hit error to trigger recovery), have you looked on CPU usage? May be collect CPU profiles and compare them one to another and tor expectations?

Alternative idea is whether something could happen to space allocation in case of dedup, that would make reads not sequential in case of enabled dedup. With primarycache=metadata you get no prefetch on read, so maximally depend on a disk latency. You could compare disk read latency and/or read offsets.

View full answer

bghira · 2022-07-18T21:23:51Z

bghira
Jul 18, 2022

the read is synchronous. everything must wait for the DDT sync_read to finish before async_reads can proceed. does this align with your observations?

1 reply

Finix1979 Jul 19, 2022
Author

the read is synchronous. everything must wait for the DDT sync_read to finish before async_reads can proceed. does this align with your observations?

Thank you @bghira . I think both read operations fired by DD on dedup and non-dedup filesystem are synchronous. Furthermore, all DDT is in ARC, there is no async or sync read for DDT in my case for sure.

amotin · 2022-07-19T02:49:29Z

amotin
Jul 19, 2022
Collaborator

If you see no difference on disk (I would not expect there be unless you hit error to trigger recovery), have you looked on CPU usage? May be collect CPU profiles and compare them one to another and tor expectations?

Alternative idea is whether something could happen to space allocation in case of dedup, that would make reads not sequential in case of enabled dedup. With primarycache=metadata you get no prefetch on read, so maximally depend on a disk latency. You could compare disk read latency and/or read offsets.

1 reply

Finix1979 Jul 19, 2022
Author

Thank you @amotin .

You are right. Disk latency is the root cause.

deduped file's latency looks like:
./biolatency -D
Tracing block device I/O... Hit Ctrl-C to end.
^C

disk = b'sdf'
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 1527 |*********************** |
64 -> 127 : 2579 |****************************|
128 -> 255 : 828 | |
256 -> 511 : 42 | |
512 -> 1023 : 1 | |
1024 -> 2047 : 2 | |
2048 -> 4095 : 1 | |
4096 -> 8191 : 3 | |
8192 -> 16383 : 1 | |
16384 -> 32767 : 1 | |
32768 -> 65535 : 1 | |
65536 -> 131071 : 0 | |
131072 -> 262143 : 5 | |

disk = b'sdg'
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 1445 |********************* |
64 -> 127 : 2655 |****************************************|
128 -> 255 : 739 |*********** |
256 -> 511 : 135 |** |
512 -> 1023 : 4 | |
1024 -> 2047 : 5 | |
2048 -> 4095 : 6 | |
4096 -> 8191 : 8 | |
8192 -> 16383 : 6 | |
16384 -> 32767 : 2 | |
32768 -> 65535 : 0 | |
65536 -> 131071 : 0 | |
131072 -> 262143 : 4 | |

non-deduped file's latency looks like:
./biolatency -D
Tracing block device I/O... Hit Ctrl-C to end.
^C

disk = b'sdg'
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 3695 |****************************|
64 -> 127 : 1178 | |
128 -> 255 : 186 |** |
256 -> 511 : 5 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 3 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 3 | |
8192 -> 16383 : 2 | |
16384 -> 32767 : 3 | |

disk = b'sdf'
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 3622 |***************************|
64 -> 127 : 1132 | |
128 -> 255 : 151 | |
256 -> 511 : 10 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 1 | |
4096 -> 8191 : 2 | |
8192 -> 16383 : 2 | |
16384 -> 32767 : 5 | |

Then I use ZDB to check file disk layout and figure out that in my test environment, dedupped file is more fragmented than non-deduped one althrough they are created by sequential write.

dedup file layout(part):
0 L0 1:136430aff000:10000 10000L/10000P F=1 B=148584/148584 cksum=24b3d9cd9e4433da:c996d06555919c0a:6e91310ec47ef218:2bf60a959282e51e
10000 L0 1:136430b1f000:10000 10000L/10000P F=1 B=148584/148584 cksum=631549ce57297164:d6d051295f9a02d3:354b07d8b206c969:e913281936133b57
20000 L0 0:12f5883e6000:10000 10000L/10000P F=1 B=148584/148584 cksum=cb976671d45146c7:b7f870ed8b117183:a9e881e7a7d27044:853765213c47944
30000 L0 0:12f5883d6000:10000 10000L/10000P F=1 B=148584/148584 cksum=eec529139bc45fe9:6223249ecc1a2011:5f45ee871b05cb6e:4bc083b82261a28
40000 L0 0:12f5883c6000:10000 10000L/10000P F=1 B=148584/148584 cksum=1e639a1560d91cf8:f1fa39e336832b76:ba097c836d5e18d:6fc0fd4768a7ca6a
50000 L0 1:136430aaf000:10000 10000L/10000P F=1 B=148584/148584 cksum=20db45e729f3058d:979a882560b9e54b:65ae5127b93225e7:1ff6af80476a28d7
60000 L0 1:136430abf000:10000 10000L/10000P F=1 B=148584/148584 cksum=1b2964533ee0d639:33b3cc8dffd82988:6d2ea8d069f457df:81822679e94656eb
70000 L0 1:136430acf000:10000 10000L/10000P F=1 B=148584/148584 cksum=8990d7b46a3712b3:49d16fe7b8475aff:7e716eea23546225:4c0fb3d191f85c2b
80000 L0 0:12f5883f6000:10000 10000L/10000P F=1 B=148584/148584 cksum=ea01610e75732b30:ab74718f730bec4d:8830bfe32b36542e:2787f785aa2521b1
90000 L0 0:12f588416000:10000 10000L/10000P F=1 B=148584/148584 cksum=93a5d15c68f3501a:829c051ba3d786ec:77397f0e15b43040:c2efce1131fed005
a0000 L0 1:136430b4f000:10000 10000L/10000P F=1 B=148584/148584 cksum=ae12c6b8e1d0b82b:fe8233ad0da7d5ce:839e01d5d76a84b6:87b514e7505d708e
b0000 L0 1:136430b2f000:10000 10000L/10000P F=1 B=148584/148584 cksum=a31d9130cd623088:ced243dac26aae2b:c69b3b6a5a4cf5e0:f436c1d8129a4b86
c0000 L0 1:136430adf000:10000 10000L/10000P F=1 B=148584/148584 cksum=af7002958495ea5e:c64fb9bb2f0c9b7d:feeb544236f2de5d:bb98f9b5145b8647
d0000 L0 1:136430b3f000:10000 10000L/10000P F=1 B=148584/148584 cksum=5795e30c2b42056b:3c8c1b8d31157572:8bb926b6e63346ec:33458e9597f61350
e0000 L0 0:12f588466000:10000 10000L/10000P F=1 B=148584/148584 cksum=b501a8b583615a94:99744cf962bdd74d:4d264f369b0a38d6:ab4a30d6d5365124
f0000 L0 1:136430b7f000:10000 10000L/10000P F=1 B=148584/148584 cksum=f321bc2a08d68886:f72c82a1a17eacb1:e7ef30c4339a0eb2:7e22e1cf8d7b6aaf
100000 L0 1:136430b5f000:10000 10000L/10000P F=1 B=148584/148584 cksum=2aa7befdaf9dfc17:c105f44fd4e5fce5:63765e5da65ef2d2:479e4ba7a11e83a4

non-dedup file layout(part):
0 L0 1:132480933000:10000 10000L/10000P F=1 B=148582/148582 cksum=1fc8841ac195:3f96add8afef8ce:b9e1771ec724a913:806dba7b9907e630
10000 L0 1:132480943000:10000 10000L/10000P F=1 B=148582/148582 cksum=1ff8d4e800e5:3fdf3232e1d901c:f68343ba93bc8707:2a6e46077aa0323
20000 L0 1:1324809ba000:10000 10000L/10000P F=1 B=148582/148582 cksum=1fe4957dbc86:3fc08921e752148:fb28a1158e6633e1:8615202712804ceb
30000 L0 1:1324809ca000:10000 10000L/10000P F=1 B=148582/148582 cksum=201a260018a3:4086b7913b1c0c3:46df6dd07306fb23:59e1dfbfb4a245bb
40000 L0 1:13248098a000:10000 10000L/10000P F=1 B=148582/148582 cksum=1fe2d535bada:3f864888a14ca30:acb76b858fbc022c:743e59834e7119ac
50000 L0 1:13248099a000:10000 10000L/10000P F=1 B=148582/148582 cksum=1fcb838ce5ba:3fd6cc8555342c6:3b5a740bde009225:4c25678781b7d12
60000 L0 1:1324809aa000:10000 10000L/10000P F=1 B=148582/148582 cksum=1feed73e8b80:4022e2397a5d817:a409e512e7ed76e7:5d96707d47fcf84a
70000 L0 1:1324809da000:10000 10000L/10000P F=1 B=148582/148582 cksum=1ff512b31fb2:3f854682291a9d9:5b9fd87f5de5b7df:de8932a864a6963f
80000 L0 0:12e9f5956000:10000 10000L/10000P F=1 B=148582/148582 cksum=2039b03b6d7e:405b3cb896d7aeb:da305e6e7e57ad28:9113edb12aa7753f
90000 L0 0:12e9f58f8000:10000 10000L/10000P F=1 B=148582/148582 cksum=20006cffa110:3ff1b0708211fea:2af2688d38e52de5:b9b4c683278f6988
a0000 L0 0:12e9f5966000:10000 10000L/10000P F=1 B=148582/148582 cksum=200bd0b9f285:3ffbf4d3472f103:43c5a1632a847ff3:79ba2eb22d82321d
b0000 L0 0:12e9f58e8000:10000 10000L/10000P F=1 B=148582/148582 cksum=1fe4f185e2f4:3fd5deea9a87cb9:281f4a41297e391:f9bd187fb7177c4e
c0000 L0 0:12e9f5946000:10000 10000L/10000P F=1 B=148582/148582 cksum=20009801d524:400b30120d356f1:5591b01e68076479:2b639de3a8d84f09
d0000 L0 1:132480a3a000:10000 10000L/10000P F=1 B=148582/148582 cksum=1ff0219c1a37:3fbda5b77d61672:eff17f7b65bbb85e:791f533f3849d32a
e0000 L0 0:12e9f5976000:10000 10000L/10000P F=1 B=148582/148582 cksum=203edfaecc1c:40a7eee73691b6f:57b817d86deaf317:1b88c1e07fbc5b30
f0000 L0 1:132480a2a000:10000 10000L/10000P F=1 B=148582/148582 cksum=1fe5d677c982:3ff4a85c72d0c30:68a81f13f1d53a4b:e989aadeaf15028a
100000 L0 1:132480a0a000:10000 10000L/10000P F=1 B=148582/148582 cksum=1fd76979a19a:3fa7d2f13e4d859:ec4b355173e37a8f:3c5b030282c26cc
110000 L0 1:132480a1a000:10000 10000L/10000P F=1 B=148582/148582 cksum=20009b20191f:3fef7bf8c9573a8:37e1cacfa23a270a:c1f4faaaf648a71a
120000 L0 1:1324809ea000:10000 10000L/10000P F=1 B=148582/148582 cksum=1fd91fdd98b2:3f7ee2acdc30d80:813a236fac9410c3:950e02861099ebbe
130000 L0 1:132480a4a000:10000 10000L/10000P F=1 B=148582/148582 cksum=1feca4a10900:3febc3b85cda281:2e18cd5d13b2b7a1:a593d02418c281f5
140000 L0 1:1324809fa000:10000 10000L/10000P F=1 B=148582/148582 cksum=200209bfe230:40294142befe58f:a7a0d6b670e71002:a3332d6fbd99cb85
150000 L0 1:132480a5a000:10000 10000L/10000P F=1 B=148582/148582 cksum=20003a9f8b20:400a2a53c0cf552:5214ce6e11699600:48fa14a7d71de128

It seems top two vdev is switched more freqently and offset is disordered sometimes if dedup is enabled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading Dedup data is slow #13664

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Reading Dedup data is slow #13664

Finix1979 Jul 18, 2022

Replies: 2 comments · 2 replies

bghira Jul 18, 2022

Finix1979 Jul 19, 2022 Author

amotin Jul 19, 2022 Collaborator

Finix1979 Jul 19, 2022 Author

Finix1979
Jul 18, 2022

Replies: 2 comments 2 replies

bghira
Jul 18, 2022

Finix1979 Jul 19, 2022
Author

amotin
Jul 19, 2022
Collaborator

Finix1979 Jul 19, 2022
Author