relaxed fused multiply-add and fused multiply-subtract

> Note: this instruction proposal is migrated from https://github.com/WebAssembly/simd/pull/79

1. What are the instructions being proposed?

- relaxed f32x4.fma
- relaxed f32x4.fms
- relaxed f64x2.fma
- relaxed f64x2.fms

2. What are the semantics of these instructions?

All the instructions take 3 operands, `a`, `b`, `c`, perform `(a * b) + c` or `-(a * b) + c`:

- `relaxed f32x4.fma(a, b, c) = (a * b) + c`
- `relaxed f32x4.fms(a, b, c) = (a * b) + c`
- `relaxed f64x2.fma(a, b, c) = -(a * b) + c`
- `relaxed f64x2.fms(a, b, c) = -(a * b) + c`

where:

- the intermediate `a * b` is be rounded first, and the final result rounded again (for a total of 2 roundings), or
- the the entire expression evaluated with higher precision and then only rounded once.

3. How will these instructions be implemented? Give examples for at least
   x86-64 and ARM64. Also provide reference implementation in terms of 128-bit
   Wasm SIMD.

> Detailed implementation guidance available at https://github.com/WebAssembly/simd/pull/79, below is an overview

x86/x86-64 with FMA3

- `relaxed f32x4.fma` = `VFMADD213PS`
- `relaxed f32x4.fms` = `VFNMADD213PS`
- `relaxed f64x2.fma` = `VFMADD213PS`
- `relaxed f64x2.fms` = `VFNMADD213PS`

ARM64

- `relaxed f32x4.fma` = `FMLA`
- `relaxed f32x4.fms` = `FMLS`
- `relaxed f64x2.fma` = `FMLA`
- `relaxed f64x2.fms` = `FMLS`

ARMv7 with FMA (Neon v2)

- `relaxed f32x4.fma` = `VFMA`
- `relaxed f32x4.fms` = `VFMS`
- `relaxed f64x2.fma` = `VFMA`
- `relaxed f64x2.fms` = `VFMS`

ARMv7 without FMA (2 rounding)

- `relaxed f32x4.fma` = `VMLA`
- `relaxed f32x4.fms` = `VMLS`
- `relaxed f64x2.fma` = `VMLA`
- `relaxed f64x2.fms` = `VMLS`

Note: Armv8-M will require MVE-F (floating point extension)

RISC-V V

- `relaxed f32x4.fma` = `vfmacc.vv`
- `relaxed f32x4.fms` = `vfnmsac.vv`
- `relaxed f64x2.fma` = `vfmadd.vv`
- `relaxed f64x2.fms` = `vfnmsac.vv`

simd128

- `relaxed f32x4.fma` = `f32x4.add(f32x4.mul)`
- `relaxed f32x4.fms` = `f32x4.sub(f32x4.mul)`
- `relaxed f64x2.fma` = `f64x2.add(f64x2.mul)`
- `relaxed f64x2.fms` = `f64x2.sub(f64x2.mul)`



4. How does behavior differ across processors? What new fingerprinting surfaces will be exposed?

The difference depends on whether hardware supports FMA or not. The dividing line is between *newer* and *older* hardware. Newer (Intel Haswell from 2013 onwards, AMD ZEN from 2017, Cortex-A5 since 2011) hardware tends to come with hardware FMA support so we will probably see less and less hardware without FMA

5. What use cases are there?

Many, especially machine learning (neural nets). Fused multiply-add improves accuracy in numerical algorithms, improves floating-point throughput, and reduces register pressures in some cases. An early prototype and evaluation also showed [significant speedup](https://github.com/WebAssembly/simd/pull/79#:~:text=%5Boctober%203%20update%5D%20evaluation%20of%20qfma%20prototype%20in%20v8) on multiple neural-network models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

relaxed fused multiply-add and fused multiply-subtract #27

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

relaxed fused multiply-add and fused multiply-subtract #27

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions