Skip to content

8351623: VectorAPI: Add SVE implementation of subword gather load operation #26236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

XiaohongGong
Copy link

@XiaohongGong XiaohongGong commented Jul 10, 2025

This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform.

Background

Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for byte/short types using int vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. int elements). Hence, the total size is 32 * elem_num bits, where elem_num is the number of loaded elements in the vector register.

Implementation

Challenges

Due to size differences between int indices (32-bit) and byte/short data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints.

For a 512-bit SVE machine, loading a byte vector with different vector species require different approaches:

  • SPECIES_64: Single operation with mask (8 elements, 256-bit)
  • SPECIES_128: Single operation, full register (16 elements, 512-bit)
  • SPECIES_256: Two operations + merge (32 elements, 1024-bit)
  • SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit)

Use ByteVector.SPECIES_512 as an example:

  • It contains 64 elements. So the index vector size should be 64 * 32 bits, which is 4 times of the SVE vector register size.
  • It requires 4 times of vector gather-loads to finish the whole operation.
byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...]
int[] idx = [0, 1, 2, 3, ..., 63, ...]

4 gather-load:
idx_v1 = [15 14 13 ... 1 0]    gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa]
idx_v2 = [31 30 29 ... 17 16]  gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb]
idx_v3 = [47 46 45 ... 33 32]  gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc]
idx_v4 = [63 62 61 ... 49 48]  gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd]
merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa]

Solution

The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end.

Here is the main changes:

  • Enhanced IR generation with architecture-specific patterns based on gather_scatter_needs_vector_index() matcher.
  • Added VectorSliceNode for result merging.
  • Added VectorMaskWidenNode for mask spliting and type conversion for masked gather-load.
  • Implemented SVE match rules for subword gather operations.
  • Added comprehensive IR tests for verification.

Testing:

  • Passed hotspot::tier1/2/3, jdk::tier1/2/3 tests
  • No regressions found

Performance:

The performance of corresponding JMH benchmarks improve 3-11x on an NVIDIA GRACE CPU, which is a 128-bit SVE2 architecture. Following is the performance data:

Benchmark                                                 SIZE Mode   Cnt Unit   Before      After   Gain
GatherOperationsBenchmark.microByteGather128              64   thrpt  30  ops/ms 13500.891 46721.307 3.46
GatherOperationsBenchmark.microByteGather128              256  thrpt  30  ops/ms  3378.186 12321.847 3.64
GatherOperationsBenchmark.microByteGather128              1024 thrpt  30  ops/ms   844.871  3144.217 3.72
GatherOperationsBenchmark.microByteGather128              4096 thrpt  30  ops/ms   211.386   783.337 3.70
GatherOperationsBenchmark.microByteGather128_MASK         64   thrpt  30  ops/ms 10605.664 46124.957 4.34
GatherOperationsBenchmark.microByteGather128_MASK         256  thrpt  30  ops/ms  2668.531 12292.350 4.60
GatherOperationsBenchmark.microByteGather128_MASK         1024 thrpt  30  ops/ms   676.218  3074.224 4.54
GatherOperationsBenchmark.microByteGather128_MASK         4096 thrpt  30  ops/ms   169.402   817.227 4.82
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  64   thrpt  30  ops/ms 10615.723 46122.380 4.34
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  256  thrpt  30  ops/ms  2671.931 12222.473 4.57
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  1024 thrpt  30  ops/ms   678.437  3091.970 4.55
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  4096 thrpt  30  ops/ms   170.310   813.967 4.77
GatherOperationsBenchmark.microByteGather128_NZ_OFF       64   thrpt  30  ops/ms 13524.671 47223.082 3.49
GatherOperationsBenchmark.microByteGather128_NZ_OFF       256  thrpt  30  ops/ms  3411.813 12343.308 3.61
GatherOperationsBenchmark.microByteGather128_NZ_OFF       1024 thrpt  30  ops/ms   847.919  3129.065 3.69
GatherOperationsBenchmark.microByteGather128_NZ_OFF       4096 thrpt  30  ops/ms   212.790   787.953 3.70
GatherOperationsBenchmark.microByteGather64               64   thrpt  30  ops/ms  8717.294 48176.937 5.52
GatherOperationsBenchmark.microByteGather64               256  thrpt  30  ops/ms  2184.345 12347.113 5.65
GatherOperationsBenchmark.microByteGather64               1024 thrpt  30  ops/ms   546.093  3070.851 5.62
GatherOperationsBenchmark.microByteGather64               4096 thrpt  30  ops/ms   136.724   767.656 5.61
GatherOperationsBenchmark.microByteGather64_MASK          64   thrpt  30  ops/ms  6576.504 48588.806 7.38
GatherOperationsBenchmark.microByteGather64_MASK          256  thrpt  30  ops/ms  1653.073 12341.291 7.46
GatherOperationsBenchmark.microByteGather64_MASK          1024 thrpt  30  ops/ms   416.590  3070.680 7.37
GatherOperationsBenchmark.microByteGather64_MASK          4096 thrpt  30  ops/ms   105.743   767.790 7.26
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   64   thrpt  30  ops/ms  6628.974 48628.463 7.33
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   256  thrpt  30  ops/ms  1676.767 12338.116 7.35
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   1024 thrpt  30  ops/ms   422.612  3070.987 7.26
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   4096 thrpt  30  ops/ms   105.033   767.563 7.30
GatherOperationsBenchmark.microByteGather64_NZ_OFF        64   thrpt  30  ops/ms  8754.635 48525.395 5.54
GatherOperationsBenchmark.microByteGather64_NZ_OFF        256  thrpt  30  ops/ms  2182.044 12338.096 5.65
GatherOperationsBenchmark.microByteGather64_NZ_OFF        1024 thrpt  30  ops/ms   547.353  3071.666 5.61
GatherOperationsBenchmark.microByteGather64_NZ_OFF        4096 thrpt  30  ops/ms   137.853   767.745 5.56
GatherOperationsBenchmark.microShortGather128             64   thrpt  30  ops/ms  8713.480 37696.121 4.32
GatherOperationsBenchmark.microShortGather128             256  thrpt  30  ops/ms  2189.636  9479.710 4.32
GatherOperationsBenchmark.microShortGather128             1024 thrpt  30  ops/ms   545.435  2378.492 4.36
GatherOperationsBenchmark.microShortGather128             4096 thrpt  30  ops/ms   136.213   595.504 4.37
GatherOperationsBenchmark.microShortGather128_MASK        64   thrpt  30  ops/ms  6665.844 37765.315 5.66
GatherOperationsBenchmark.microShortGather128_MASK        256  thrpt  30  ops/ms  1673.950  9482.207 5.66
GatherOperationsBenchmark.microShortGather128_MASK        1024 thrpt  30  ops/ms   420.628  2378.813 5.65
GatherOperationsBenchmark.microShortGather128_MASK        4096 thrpt  30  ops/ms   105.128   595.412 5.66
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 64   thrpt  30  ops/ms  6699.594 37698.398 5.62
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 256  thrpt  30  ops/ms  1682.128  9480.355 5.63
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 1024 thrpt  30  ops/ms   421.942  2380.449 5.64
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 4096 thrpt  30  ops/ms   106.587   595.560 5.58
GatherOperationsBenchmark.microShortGather128_NZ_OFF      64   thrpt  30  ops/ms  8788.830 37709.493 4.29
GatherOperationsBenchmark.microShortGather128_NZ_OFF      256  thrpt  30  ops/ms  2199.706  9485.769 4.31
GatherOperationsBenchmark.microShortGather128_NZ_OFF      1024 thrpt  30  ops/ms   548.309  2380.494 4.34
GatherOperationsBenchmark.microShortGather128_NZ_OFF      4096 thrpt  30  ops/ms   137.434   595.448 4.33
GatherOperationsBenchmark.microShortGather64              64   thrpt  30  ops/ms  5296.860 37797.813 7.13
GatherOperationsBenchmark.microShortGather64              256  thrpt  30  ops/ms  1321.738  9602.510 7.26
GatherOperationsBenchmark.microShortGather64              1024 thrpt  30  ops/ms   330.520  2404.013 7.27
GatherOperationsBenchmark.microShortGather64              4096 thrpt  30  ops/ms    82.149   602.956 7.33
GatherOperationsBenchmark.microShortGather64_MASK         64   thrpt  30  ops/ms  3458.968 37851.452 10.94
GatherOperationsBenchmark.microShortGather64_MASK         256  thrpt  30  ops/ms   879.143  9616.554 10.93
GatherOperationsBenchmark.microShortGather64_MASK         1024 thrpt  30  ops/ms   220.256  2408.851 10.93
GatherOperationsBenchmark.microShortGather64_MASK         4096 thrpt  30  ops/ms    54.947   603.251 10.97
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  64   thrpt  30  ops/ms  3521.856 37736.119 10.71
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  256  thrpt  30  ops/ms   881.456  9602.649 10.89
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  1024 thrpt  30  ops/ms   220.122  2409.030 10.94
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  4096 thrpt  30  ops/ms    55.845   603.126 10.79
GatherOperationsBenchmark.microShortGather64_NZ_OFF       64   thrpt  30  ops/ms  5279.815 37698.023 7.14
GatherOperationsBenchmark.microShortGather64_NZ_OFF       256  thrpt  30  ops/ms  1307.935  9601.551 7.34
GatherOperationsBenchmark.microShortGather64_NZ_OFF       1024 thrpt  30  ops/ms   329.707  2409.962 7.30
GatherOperationsBenchmark.microShortGather64_NZ_OFF       4096 thrpt  30  ops/ms    82.092   603.380 7.35

[1] https://bugs.openjdk.org/browse/JDK-8355563
[2] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1B--scalar-plus-vector-Gather-load-unsigned-bytes-to-vector--vector-index--?lang=en
[3] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1H--scalar-plus-vector---Gather-load-unsigned-halfwords-to-vector--vector-index--?lang=en


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8351623: VectorAPI: Add SVE implementation of subword gather load operation (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26236/head:pull/26236
$ git checkout pull/26236

Update a local copy of the PR:
$ git checkout pull/26236
$ git pull https://git.openjdk.org/jdk.git pull/26236/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26236

View PR using the GUI difftool:
$ git pr show -t 26236

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26236.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 10, 2025

👋 Welcome back xgong! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 10, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 10, 2025
@openjdk
Copy link

openjdk bot commented Jul 10, 2025

@XiaohongGong The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Jul 10, 2025
@mlbridge
Copy link

mlbridge bot commented Jul 10, 2025

Webrevs

@XiaohongGong
Copy link
Author

Hi @Bhavana-Kilambi, @fg1417, could you please help take a look at this PR? BTW, since the vector register size of my SVE machine is 128-bit, could you please help test the correctness on a SVE machine with larger vector size (e.g. 512-bit vector size)? Thanks a lot in advance!

@Bhavana-Kilambi
Copy link
Contributor

Hi @XiaohongGong , thank you for doing this. As for testing, we can currently only test on 256-bit SVE machines (we no longer have any 512bit machines). We will get back to you with the results soon.

@XiaohongGong
Copy link
Author

Hi @XiaohongGong , thank you for doing this. As for testing, we can currently only test on 256-bit SVE machines (we no longer have any 512bit machines). We will get back to you with the results soon.

Testing on 256-bit SVE machines are fine to me. Thanks so much for your help!

ins_pipe(pipe_slow);
%}

instruct vmaskwiden_hi_sve(pReg dst, pReg src) %{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can both the hi and lo widen rules be combined into a single one as the arguments are the same? or would it make it less understandable?

@@ -348,6 +347,12 @@ source %{
return false;
}

// SVE requires vector indices for gather-load/scatter-store operations
// on all data types.
bool Matcher::gather_scatter_needs_vector_index(BasicType bt) {
Copy link
Contributor

@Bhavana-Kilambi Bhavana-Kilambi Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a function that tests for UseSVE > 0 here -

static bool supports_scalable_vector() {

Can it be reused?

match(Set dst (VectorSlice (Binary src1 src2) index));
format %{ "vslice_neon $dst, $src1, $src2, $index" %}
ins_encode %{
uint length_in_bytes = Matcher::vector_length_in_bytes(this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation. two spaces..

match(Set dst_src1 (VectorSlice (Binary dst_src1 src2) index));
format %{ "vslice_sve $dst_src1, $dst_src1, $src2, $index" %}
ins_encode %{
assert(UseSVE > 0, "must be sve");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation. two spaces..

// ---------------------------- Vector Slice ------------------------

instruct vslice_neon(vReg dst, vReg src1, vReg src2, immI index) %{
predicate(VM_Version::use_neon_for_vector(Matcher::vector_length_in_bytes(n)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation. I think there're 3 spaces here.. Same with the SVE version below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

2 participants