file: Query physical block size and minimum I/O size #3046

tchaikov · 2025-10-11T05:53:45Z

Enhance block device initialization to query additional device characteristics beyond logical block size:

use physical_block_size for write alignment
use logical_block_size for read alignment

tchaikov · 2025-10-11T08:53:02Z

This table shows the parameters resulting from various device types being registered with the Linux kernel.

Table 9: Common device types and their resulting parameters

Device	logical	physical	min_io	opt_io	align_off
Disk 512/512	512	512	512	0	0
Disk 512/4KiB	512	4096	4096	0	0
Disk 512/4KiB, 1-aligned	512	4096	4096	0	3585
Disk 4 KiB/4 KiB	4096	4096	4096	0	0
RAID0, 64 KiB × 4 drives	512	512	65536	262144	0
RAID1, 16 KiB	512	512	16384	0	0
RAID5, 8 KiB × 3 drives, 1-aln.	512	512	8192	16384	3584

quoted from https://people.redhat.com/msnitzer/docs/linux-advanced-storage-6.1.pdf, section 1.5

avikivity · 2025-10-11T11:38:40Z

src/core/file.cc

+        // - minimum_io_size: preferred minimum I/O size the device can perform without performing read-modify-write
+        // - physical block size: smallest unit a physical storage device can write atomically
+        // - logical block size: smallest unit the the storage device can address (typically 512 bytes)
+        size_t block_size = std::ranges::max({logical_block_size, physical_block_size, minimum_io_size});


physical_block_size should be the write alignment, but not the read alignment. There's no downside to reading a 512 byte logical sector from a 4096 byte physical sector disk.

Wrt write alignment, even there it's iffy. Writing 4096 avoids RMW but can generate space amplification. With the current exposed parameters, physical_block_size for writes is the best match. We may want to expose another write alignment (choosing a name will be hard) to indicate a non-optimal block write block size that is smaller than the other one.

Thanks for pointing out the read/write differentiation! I've updated the implementation:

Read alignment: Now uses logical_block_size only (as you suggested - no downside to reading 512-byte sectors)

Write alignment: Now uses physical_block_size (not max(logical, physical, min_io))

You're right about the space amplification issue. I verified this in the kernel source - the Linux kernel only enforces logical_block_size alignment for O_DIRECT (see block/fops.c:blkdev_dio_invalid()):

return (iocb->ki_pos | iov_iter_count(iter)) & (bdev_logical_block_size(bdev) - 1);

This confirms that physical_block_size and min_io are optimization hints, not requirements. Using physical_block_size provides the best balance:

Avoids hardware-level RMW (4K physical sectors)

Prevents space amplification from RAID stripe alignment (min_io can be 64 KiB+)

For context, I found that RAID devices set min_io to the chunk/stripe size (see drivers/md/raid0.c:386, raid5.c:7748, raid10.c:4003), which would cause massive space waste if used for write alignment.

Regarding exposing another write alignment: This is an interesting idea. We could expose something like:

disk_write_dma_alignment = physical_block_size (current, safe default)

disk_write_dma_alignment_optimal = max(physical_block_size, min_io) (for apps willing to trade space for throughput)

However, I'm inclined to defer this until we have a concrete use case. Most Seastar applications probably want "do the right thing" rather than having to choose between alignment strategies.

avikivity · 2025-10-11T11:44:43Z

This table shows the parameters resulting from various device types being registered with the Linux kernel.

Table 9: Common device types and their resulting parameters

Device logical physical min_io opt_io align_off
Disk 512/512 512 512 512 0 0
Disk 512/4KiB 512 4096 4096 0 0
Disk 512/4KiB, 1-aligned 512 4096 4096 0 3585
Disk 4 KiB/4 KiB 4096 4096 4096 0 0
RAID0, 64 KiB × 4 drives 512 512 65536 262144 0

I don't understand min_io and opt_io for RAID0. The disks could just as easily read and write 512 byte blocks here.

RAID1, 16 KiB 512 512 16384 0 0

Nor this. Why is 16k more optimal than anything else?

RAID5, 8 KiB × 3 drives, 1-aln. 512 512 8192 16384 3584

16k makes sense here because it avoids a RMW. But I don't understand 8k.

quoted from https://people.redhat.com/msnitzer/docs/linux-advanced-storage-6.1.pdf, section 1.5

tchaikov · 2025-10-20T06:44:32Z

This table shows the parameters resulting from various device types being registered with the Linux kernel.
Table 9: Common device types and their resulting parameters
Device logical physical min_io opt_io align_off
Disk 512/512 512 512 512 0 0
Disk 512/4KiB 512 4096 4096 0 0
Disk 512/4KiB, 1-aligned 512 4096 4096 0 3585
Disk 4 KiB/4 KiB 4096 4096 4096 0 0
RAID0, 64 KiB × 4 drives 512 512 65536 262144 0

I don't understand min_io and opt_io for RAID0. The disks could just as easily read and write 512 byte blocks here.

RAID1, 16 KiB 512 512 16384 0 0

Nor this. Why is 16k more optimal than anything else?

RAID5, 8 KiB × 3 drives, 1-aln. 512 512 8192 16384 3584

16k makes sense here because it avoids a RMW. But I don't understand 8k.

quoted from https://people.redhat.com/msnitzer/docs/linux-advanced-storage-6.1.pdf, section 1.5

I looked into the Linux kernel source code to understand how these values are set. Here's what I found:

TL;DR: For RAID devices, min_io is set to the chunk/stripe unit size (the amount written to each individual disk), not the minimum physically possible I/O size.

In the Linux kernel, all RAID types set io_min to the chunk size:

RAID0 (https://github.com/torvalds/linux/blob/master/drivers/md/raid0.c#L386):
lim.io_min = mddev->chunk_sectors << 9;
lim.io_opt = lim.io_min * mddev->raid_disks;
RAID5 (https://github.com/torvalds/linux/blob/master/drivers/md/raid5.c#L7748):
lim.io_min = mddev->chunk_sectors << 9;
lim.io_opt = lim.io_min * (conf->raid_disks - conf->max_degraded);
RAID10 (https://github.com/torvalds/linux/blob/master/drivers/md/raid10.c#L4003):
lim.io_min = mddev->chunk_sectors << 9;
lim.io_opt = lim.io_min * raid10_nr_stripes(conf);

Answering Your Questions

RAID0 (64 KiB × 4 drives): "The disks could just as easily read and write 512 byte blocks here."

Yes, physically they can, but:

min_io = 65536 (64 KiB) is the chunk size - aligning I/O to chunk boundaries avoids fragmenting operations across multiple drives unnecessarily
opt_io = 262144 (256 KiB) is the full stripe width (64 KiB × 4) - this maximizes parallelism by utilizing all drives

You're right that smaller I/Os work, but they defeat the purpose of RAID0's striping.

RAID1 (16 KiB): "Why is 16k more optimal than anything else?"

I couldn't find this in the kernel code. RAID1 doesn't set io_min to a chunk size (it has no striping concept). This 16k value is likely either:

Inherited from the underlying disk's characteristics
Controller/configuration specific
Or possibly an artifact of how that specific example was set up

RAID1 should typically just mirror the underlying device topology.

RAID5 (8 KiB × 3 drives): "I don't understand 8k"

From the kernel code:

min_io = 8192 (8 KiB) is the chunk size - smallest alignment unit
opt_io = 16384 (16 KiB) is the data stripe width (8 KiB × 2 data drives, excluding parity)

The 8k ensures I/O aligns with chunk boundaries. The 16k (which you understood) is optimal because it writes a full data stripe.

tchaikov · 2025-10-20T07:25:20Z

v2:

use physical_block_size for write alignment
use logical_block_size for read alignment

mykaul · 2025-10-20T07:52:47Z

quoted from https://people.redhat.com/msnitzer/docs/linux-advanced-storage-6.1.pdf, section 1.5

This is a great document, but it's also ~15 years old. It's less relevant for SSD/NVMe, for example.

tchaikov · 2025-10-20T12:19:58Z

This is a great document, but it's also ~15 years old. It's less relevant for SSD/NVMe, for example.

Good point. While the document is dated, the kernel topology API it describes is still current and works for modern devices.

For modern SSDs/NVMe (verified from Linux kernel source drivers/nvme/host/core.c):

/* NOWS = Namespace Optimal Write Size */
if (id->nows)
    io_opt = bs * (1 + le16_to_cpu(id->nows));

logical_block_size and physical_block_size: Usually 512 or 4096 bytes
minimum_io_size: Set to physical_block_size (typically 512 or 4096)
optimal_io_size: Derived from NOWS (Namespace Optimal Write Size) field in NVMe spec
- If NOWS = 0 (not specified): optimal_io_size = 0
- If NOWS is set (e.g., 7): optimal_io_size = block_size * (1 + NOWS) = 512 * 8 = 4096
- Many consumer NVMe drives don't set NOWS, so optimal_io_size = 0
- Enterprise/datacenter NVMe drives often set NOWS to indicate write granularity

The key difference from the RAID examples is that RAID arrays have complex stripe/chunk geometry (64 KiB chunks × multiple drives) leading to much larger min_io/opt_io values. NVMe drives, when they set NOWS, typically indicate page-size granularity (4-16 KiB).

Important: Our implementation in this PR deliberately uses only physical_block_size for write alignment, not optimal_io_size or minimum_io_size. This avoids write amplification that would occur with large optimal_io_size values (e.g., forcing all writes to 256 KiB alignment for RAID0 would be wasteful). We only need to avoid hardware-level RMW, which physical_block_size handles.

So while the document's RAID examples show legacy use cases with their stripe-based optimal sizes, we use the simpler physical_block_size approach that works well for both modern SSDs and RAID without write amplification issues.

avikivity · 2025-10-21T22:18:52Z

v2:

use physical_block_size for write alignment

use logical_block_size for read alignment

We recently discovered that some disks lie about physical_block_size.

@robertbindar (or @tchaikov if you want), suggest adjusting iotune to detect the physical block size and write it in io_properties.yaml. Then the reactor can pick it up and use it to override what it detects from the disk.

avikivity · 2025-10-21T22:20:24Z

src/core/file.cc

+    // Configure DMA alignment requirements based on block device characteristics
+    // - Read alignment: logical_block_size (no performance penalty for reading 512-byte sectors)
+    // - Write alignment: physical_block_size (avoids hardware-level RMW)
+    _memory_dma_alignment = write_block_size;


memory_dma_alignment is fixed 512 IIRC regardless of logical/physical sector sizes.

I see it can be obtained via statx stx_dio_mem_align.

Interestingly it returns 4 for my nvme (and 512 for xfs files)

Interestingly it returns 4 for my nvme (and 512 for xfs files)

stx_dio_mem_align is always 4 for NVMe devices.

NVMe driver rets dma_alignment

file: /home/kefu/dev/linux/drivers/nvme/host/core.c

static void nvme_set_ctrl_limits(struct nvme_ctrl *ctrl, struct queue_limits *lim) { lim->max_hw_sectors = ctrl->max_hw_sectors; lim->max_segments = min_t(u32, USHRT_MAX, min_not_zero(nvme_max_drv_segments(ctrl), ctrl->max_segments)); lim->max_integrity_segments = ctrl->max_integrity_segments; lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1; lim->max_segment_size = UINT_MAX; lim->dma_alignment = 3; // <-- LINE 2080 }

block layer reads it and sets stx_dio_mem_align

File: /home/kefu/dev/linux/block/bdev.c

void bdev_statx(const struct path *path, struct kstat *stat, u32 request_mask) { ... if (request_mask & STATX_DIOALIGN) { stat->dio_mem_align = bdev_dma_alignment(bdev) + 1; // 3 + 1 = 4 stat->dio_offset_align = bdev_logical_block_size(bdev); stat->result_mask |= STATX_DIOALIGN; } ... }

so, it's indeed 4 bytes. will prepare a new revision of this pull request to use statx() to query memory buffer alignment for block devices, and fall back to physical_block_size if statx unavailable or unsupported

tchaikov · 2025-10-27T07:52:46Z

v2:

use physical_block_size for write alignment

use logical_block_size for read alignment

We recently discovered that some disks lie about physical_block_size.

@robertbindar (or @tchaikov if you want), suggest adjusting iotune to detect the physical block size and write it in io_properties.yaml. Then the reactor can pick it up and use it to override what it detects from the disk.

will see what i can do. the io performance is critical to crimson project. so this is important to us.

Implement empirical detection of physical block size by testing write performance at different alignments (512, 1K, 2K, 4K, 8K bytes). This addresses the issue raised in PR scylladb#3046 where some disks lie about their physical_block_size via sysfs. The detection algorithm: 1. Performs random write tests at each alignment 2. Measures IOPS for each alignment with detailed logging 3. Finds the smallest alignment where performance plateaus (when the next larger alignment doesn't improve IOPS by more than 5%) 4. Compares detected value with driver-reported value and warns if they differ 5. Writes the detected value to io_properties.yaml The reactor can then use this empirically-determined value to override what the disk reports, preventing write amplification due to hardware-level read-modify-write cycles. Signed-off-by: Kefu Chai <k.chai@proxmox.com>

Enhance block device initialization to properly differentiate between memory buffer alignment and disk I/O alignment requirements. Memory alignment (via statx): Query DIO memory alignment using statx() with STATX_DIOALIGN flag (kernel 4.18+). This returns the actual DMA buffer address alignment requirement from the device's queue->limits.dma_alignment. - For NVMe: typically 4 bytes (much less restrictive than block size) - Fallback: physical_block_size if statx unavailable or unsupported - Uses syscall(__NR_statx, ...) to avoid naming conflicts Disk I/O alignment: Query both logical and physical block sizes via ioctl to optimize for both performance and space efficiency: - Read alignment: logical_block_size (typically 512 bytes) * No performance penalty for reading 512-byte sectors from 4K disks * Allows fine-grained reads without forced alignment overhead - Write alignment: physical_block_size (typically 512 or 4096 bytes) * Avoids read-modify-write at the hardware level * For Advanced Format disks with 4K physical sectors * Prevents space amplification from over-alignment Kernel verification: The Linux kernel only enforces logical_block_size alignment for O_DIRECT operations (see block/fops.c:blkdev_dio_invalid()): return (iocb->ki_pos | iov_iter_count(iter)) & (bdev_logical_block_size(bdev) - 1); This confirms physical_block_size is not kernel-enforced but is an optimization hint to avoid hardware-level RMW. Testing on NVMe devices confirms: - stx_dio_mem_align: 4 bytes (vs. previous 512 bytes) - stx_dio_offset_align: 512 bytes - physical_block_size: 512 bytes This provides optimal alignment requirements: - Minimal memory allocation constraints (4-byte alignment) - Correct I/O offset alignment (512 bytes) - Optimal write performance (matches physical sector size) Signed-off-by: Kefu Chai <k.chai@proxmox.com>

…operties.yaml Some block devices report incorrect physical block sizes through the kernel's BLKPBSZGET ioctl. This can lead to suboptimal performance due to unnecessary read-modify-write operations at the hardware level. This commit implements support for overriding physical_block_size via io_properties.yaml configuration. The flow is: 1. iotune detects the actual physical block size and writes it to io_properties.yaml 2. At startup, Seastar parses the configuration into disk_params 3. During reactor initialization, _physical_block_size_overrides map is populated from disk_config 4. When opening block device files, make_file_impl() checks for overrides and uses them instead of kernel-reported values This allows users to specify the correct physical block size for devices that misreport their capabilities, ensuring optimal I/O alignment. Signed-off-by: Kefu Chai <k.chai@proxmox.com>

Implement empirical detection of physical block size by testing write performance at different alignments (512, 1K, 2K, 4K, 8K bytes). This addresses the issue raised in PR scylladb#3046 where some disks lie about their physical_block_size via sysfs. The detection algorithm: 1. Performs random write tests at each alignment 2. Measures IOPS for each alignment with detailed logging 3. Finds the smallest alignment where performance plateaus (when the next larger alignment doesn't improve IOPS by more than 5%) 4. Compares detected value with driver-reported value and warns if they differ 5. Writes the detected value to io_properties.yaml The reactor can then use this empirically-determined value to override what the disk reports, preventing write amplification due to hardware-level read-modify-write cycles. Signed-off-by: Kefu Chai <k.chai@proxmox.com>

avikivity reviewed Oct 11, 2025

View reviewed changes

tchaikov force-pushed the file-block-size branch 3 times, most recently from d5dda7a to b1deb82 Compare October 20, 2025 07:23

tchaikov requested a review from avikivity October 20, 2025 07:29

tchaikov force-pushed the file-block-size branch 2 times, most recently from e6157f5 to 05785ef Compare October 20, 2025 12:18

avikivity reviewed Oct 21, 2025

View reviewed changes

tchaikov force-pushed the file-block-size branch from 05785ef to e1a5389 Compare December 5, 2025 15:52

tchaikov force-pushed the file-block-size branch from e1a5389 to d2a5361 Compare December 8, 2025 08:07

tchaikov added 3 commits December 8, 2025 20:33

tchaikov force-pushed the file-block-size branch from d2a5361 to 2b161b2 Compare December 8, 2025 12:35

file: Query physical block size and minimum I/O size #3046

Are you sure you want to change the base?

file: Query physical block size and minimum I/O size #3046

Conversation

tchaikov commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tchaikov commented Oct 11, 2025

Uh oh!

avikivity Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

tchaikov Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

avikivity commented Oct 11, 2025

Uh oh!

tchaikov commented Oct 20, 2025

Uh oh!

tchaikov commented Oct 20, 2025

Uh oh!

mykaul commented Oct 20, 2025

Uh oh!

tchaikov commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avikivity commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avikivity Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

avikivity Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

avikivity Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

tchaikov Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

tchaikov commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tchaikov commented Oct 11, 2025 •

edited

Loading

tchaikov commented Oct 20, 2025 •

edited

Loading

avikivity commented Oct 21, 2025 •

edited

Loading