Skip to content

cross3, dot3, scale, bias benchmark (AOS) - scalar always faster than zmath on M1 #5

@dmurph

Description

@dmurph

I'm consistently seeing scalar being faster on M1 mac, with -Doptimize=ReleaseFast

Example: cross3, dot3, scale, bias benchmark (AOS) - scalar version: 0.9780s, zmath version: 1.0045s

I noticed that the 'swizzle' function call actually has extra CPU instructions generated - see the dot4Old function in this godbolt and play around with the commented out line and the one next to it.

By changing cross3 to use shuffle this seems to help the benchmark:

pub inline fn cross3(v0: Vec, v1: Vec) Vec {
    var xmm0 = @shuffle(f32, v0, undefined, [4]i32{ 1, 2, 0, 2 });
    var xmm1 = @shuffle(f32, v1, undefined, [4]i32{ 2, 0, 1, 3 });
    var result = xmm0 * xmm1;
    xmm0 = @shuffle(f32, xmm0, undefined, [4]i32{ 1, 2, 0, 3 });
    xmm1 = @shuffle(f32, xmm1, undefined, [4]i32{ 2, 0, 1, 3 });
    result = result - xmm0 * xmm1;
    return andInt(result, f32x4_mask3);
}

I recommend changing this everywhere. Also the dot2 is weird... there are a lot of potential perf improvements in the zmath area.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions