Skip to content

[CI-NO-BUILD] [vioscsi] Increase PHYS_SEGMENTS_LIMIT value #1316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

benyamin-codez
Copy link
Contributor

Changes the PHYS_SEGMENTS_LIMIT to 1,022 equal to QEMU's VIRTQUEUE_MAX_SIZE less 2 (for the CONTROL and EVENT queues).

This results in:

  1. adaptExt->max_segments = 1022
  2. ConfigInfo->NumberOfPhysicalBreaks = 1023
  3. ConfigInfo->MaximumTransferLength = 4,088KiB (1,022 * PAGE_SIZE = 4,088KiB)

Changes the PHYS_SEGMENTS_LIMIT to 1,022 equal to QEMU's VIRTQUEUE_MAX_SIZE less 2 (for the CONTROL and EVENT queues).

This results in:
1. adaptExt->max_segments = 1022
2. ConfigInfo->NumberOfPhysicalBreaks = 1023
3. ConfigInfo->MaximumTransferLength = 4,088KiB (1,022 * PAGE_SIZE = 4,088KiB)

Signed-off-by: benyamin-codez <115509179+benyamin-codez@users.noreply.github.com>
@benyamin-codez
Copy link
Contributor Author

benyamin-codez commented Feb 26, 2025

I wouldn't much care to attempt merging this until after the following PRs merge:

#1228
#1293 / #1294
#1296
#1214
#1215
#1216
#1306
#1313
#1309

Here is a pic showing the VQ Buffer Length and SRB DataTransferLength at the tail end of a 4MiB sequential read:

vioscsi_4088K_MaxXferLen_tailend

You will observe the transfer is divided into 4,088KiB and 8KiB (4,096KiB = 4MiB) parts.

@benyamin-codez
Copy link
Contributor Author

You will observe the transfer is divided into 4,088KiB and 8KiB (4,096KiB = 4MiB) parts.

Here we can see how viostor will have an advantage with 1,024 segments and aligned MaxXferLen of 4MiB,
per PR #1315.

@benyamin-codez
Copy link
Contributor Author

benyamin-codez commented Mar 4, 2025

@vrozenfe

From PR #1216... I though it best to continue this part here.

...proper support for adaptExt->scsi_config.seg_max which IMHO is a proper way to limit the number of SG elements for DMA.

The fix really needs to come via QEMU. We need it to report the backing's max_segments queue value for breaking cases such as the one in RHEL-56722. So this then would be safe if not ideal. It would then be up to sys engineers or admins during implementation / commissioning to tune this if necessary or desired.

There are no doubt many things that QEMU devs would need to consider when coming up with a fix. One consideration needs to be: when in non-pass-through mode with multiple backings with different max_segments queue values. I would suggest the HBA would need to be set to the lowest value.

Vadim, I think it's worthwhile raising this as a bug / feature request at QEMU.

I could have a poke around and maybe come up with a patch, but I think perhaps the QEMU storage team would likely come up with a more timely, considered and permanent solution.

This, I imagine would be something more elegant than parsing the output of a simple grep "" /sys/block/*/queue/max_*, mapping the right backing, making the necessary calculations and then setting the scsi_config.seg_max value.

Being aware of the VIRTQUEUE_MAX_SIZE and other limits (and maybe prepared to reconsider them), they may also choose to optimise the recommended value based on other considerations unknown to me.

There's also the pesky - 2, which interferes with viostor (cf. PR #1315), which maybe they can trim for virtio-blk...

In any case, *nix hosts (and maybe *bsd hosts too) presently have a clear advantage because they do not suffer from this limitation. This is not a limitation imposed by StorPort either, but rather by this project. It would be good to get this resolved and Windows hosts back on an "even" footing with the other platforms.

EDIT: by *nix, beastie and Windows hosts, I meant VMs...

@benyamin-codez
Copy link
Contributor Author

@vrozenfe

From PR #1289... I though it best to continue this part here (even though that one is viostor).

As I thought, 1024 for viostor (for 4MiB xfers) and 1022 for vioscsi (for 4,088KiB xfers).
PRs #1315 and #1316 respectively,

Perhaps in the meantime (while you work on IOMMU and DMA v3), those PRs might work well...?
I should also ask if there is another reason it shouldn't be done this way, i.e. the way I have in this PR?

@JonKohler
Copy link
Contributor

But that assumes that lets say 4MB transfers are even ideal, yea? Depending on the backing media/storage system, they may want to chunk that down anyhow.

In our case, a little inside baseball, our storage system presents a recommended max transfer size of 1MB, and we align segments and such based on that. Furthermore, our vhost-user-scsi aligns preallocations around that, so trying to do something like this //without// the indirect descriptor "advertisements" virtio-spec enhancement that I called out in another review would potentially be a breaking change for non-qemu-managed-datapath-folks (us and I'm sure others).

We get around that by wrapping our max breaks in our installer such that it just does the right thing for our platform (for now), but just a FYI on how others in the community use this code.

@benyamin-codez
Copy link
Contributor Author

benyamin-codez commented Mar 5, 2025

Thanks, Jon. I appreciate the insights.

I think it's worth pointing out that this is just the max possible value and it does not need to be set this high. Whatever max_segments is set to will determine the MaxXferLen. To make the point: it could be set to just 16 segments and 64Kib transfers like way back at the beginning of StorPort. Also, this maximum will not be realised if no block request is made for the requisite amount.

My proposed NoPB-coincident changes will - in the absence of the registry override - default to using whatever is in seg_max (when in indirect mode). A comparison is made with a max_segments value derived from max_sectors, and then whichever is the lesser is chosen. Without any QEMU changes, this is likely to default to 254 segments. If I'm not mistaken, the vioscsi default is presently 512 (0x200) segments with a MaxXferLen of 2MiB. My understanding is that we start to get problems beyond this. I presume this is mostly - if not entirely - for folks using non-qemu-managed-datapaths.

Usually the MaxXferLen should match the stripe size of the backing, but this would depend on where the host sits in the storage stack. If you have pass-through LUNs with large stripes, this would be an advantage (provided there was sufficiently fast data paths and cache). As for stripe size, the 1MiB you appear to be using is a common value, but I have seen values ranging from as low as 64KiB and up to 8MiB. Notably, if using BusTypeRAID (0x08) and performing striping itself, having the capacity to use large stripes would be of some utility.

Presently, the only way I'm aware of to set MaxXferLen in a StorPort compliant manner is via max_segments * PAGE_SIZE. However, in our case and imho, this is really max_segments * split_ring_buffer_size, which just so happens to be the same size (4KiB). A virtio-scsi PCI BAR exposes 1040 virtqueues of 8KiB each, 1024 of which are usable. We access them as a split ring, so 4KiB per transfer. The virtio-scsi impl. uses 2 queues for CONTROL and EVENT leaving 1022 queues for requests, whereas virtio-blk can use all 1024.

EDIT: The struck through part is incorrect. I was obtaining the VQ info incorrectly. Please see next posts.

As for alternate strategies, I'm a little in the dark (along with most I would imagine) until Vadim releases his work on IOMMU and v3 DMA because AFAIK the DDI for these aren't open. I guess, Vadim's work might make a great deal of sense when we get to see it.

Then we have the virtio-spec issue you previously mentioned (oasis-tcs/virtio-spec#122). It was an interesting idea and to me the real value was in the proposed negotiation between device and host as to what the maximum supported values would be. This is much better than just setting it "on paper". I would probably need to thoroughly read the mailing list to get a better understanding of the nuances and fringe cases. Having said that, aren't we effectively already doing some of this without the extra tunables and flags or updates to the standard..?

We already regularly exceed the Queue Size whenever max_segments is greater than 126 or 254, e.g. the vioscsi defaults I mentioned above. These values appear to be based on a fallacy that a queue must be tied exclusively to a (v)CPU. So whilst the value was bumped up from 128 to 256 (in SeaBIOS) to reduce block requests for larger transfers in Linux (towards the 1MiB boundary), AFAIK there does not appear to be a rational basis for this limit - at least not anymore. Windows never had the kernel limit that Linux had up until about 7 years ago [BUG_ON(total_sg > vq->vring.num)] and the SeaBIOS change was about 5 years back.

Anyway, I'm probably not saying anything you don't already know... 8^d

I'm currently considering wording for the QEMU bug report and collating some info for that.
I'll be sure to copy you so that so you can add anything relevant to the report wrt your kit.

However, again I think it important to stress that having a MaxXferLen in vioscsi that you maybe cannot utilise doesn't mean that it will stop your kit from working. You just need to ensure you set max_segments to 256 - and perhaps this is what you do with your installer already..?

I'm caught up with some competing priorities over the next few days, so let me have a think about this along the way...
Having some time away is also usually helpful... 8^D

@benyamin-codez
Copy link
Contributor Author

benyamin-codez commented Mar 9, 2025

A virtio-scsi PCI BAR exposes 1040 virtqueues, each with a default Queue Size of 8KiB.

Per the standard, this will result in the following allocation:

Virtqueue Part Formula Size
Descriptor Table 16 * (Queue Size) 131,072 B (128 KiB)
Available Ring 6 + 2 * (Queue Size) 16,390 B
Used Ring 6 + 8 * (Queue Size) 65,542 B
Total Size 213,004 B
Aligned Value 217,088 (212 KiB)

Everyone is very kind here - I'm very rarely pulled up for comments made in error.
Better if I am..! 8^D
I can appreciate everyone is busy though.

EDIT: Not only the struck through part, but all of this is incorrect. I was obtaining the VQ info incorrectly.

@benyamin-codez
Copy link
Contributor Author

benyamin-codez commented Mar 10, 2025

However, it looks like virtio_query_queue_allocation() returns a Queue Size of 256 bytes during allocation queries...
...which results in the following allocation:

Virtqueue Part Formula Size
Descriptor Table 16 * (Queue Size) 4,096 B (4 KiB)
Available Ring 6 + 2 * (Queue Size) 518 B
Used Ring 6 + 8 * (Queue Size) 2,054 B
Total Size 6,668 B
Aligned Value 8,192 B (8 KiB)

So I presume my trace is using an incorrect VIRTIO_PCI_COMMON_CFG_CAP struct... Checking...
Anyone feeling dizzy...?

EDIT: I was using the incorrect method to get the VQ information in my InitHW refactor. I was relying on data returned by StorPortGetBusData(), which typically only returns 256 bytes in the buffer. This is unlikely enough to hold all the PCI information I was trying to glean. I will instead refactor the affected draft PRs to iterate through the BAR offsets to get the correct data for each capability. Then after InitVirtIODevice(), I think I will also do the same with data saved by VirtIO routines into the VirtIODevice structure at adaptExt->vdev, and then perhaps verify with a compare between the two...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants