Commits
Click on a commit to change the comparison rangeAdd Arm®v9-A architecture SME SGEMM kernels
Add implementation of SGEMM based on the Arm®v9-A architecture Scalable
Matrix Extension (SME) [1], using the Arm C Language Extensions (ACLE)
[2].
Add SME2 compute & packing kernels for SGEMM and enable them under the
ARMV9SME target.
The compute kernel performs outer products on panels of A and B,
accumulating into 2x2 inner blocks of C via the SME two-dimensional
architectural register, ZA.
The non-transpose packing kernel performs a copy into a contiguous
buffer using SVE loads & stores in Streaming SVE mode. Streaming SVE is
an execution mode introduced by SME that supports execution of SVE code
with the SME defined vector length, known as the Streaming SVE vector
length (SVL).
The transpose packing kernel performs on-the-fly transposition by
utilizing horizontal & vertical tile slice access to the SME ZA
register.
Includes an update to the driver to account for expanded inner block.
Note: this places the ARMV9SME target in WIP state. It is functional for
SGEMM, and all GEMM tests are passing. Other BLAS3 routines have not
been updated to match the larger kernel size, so SYMM/TRMM tests are
currently expected to fail in this WIP state.
[1] https://developer.arm.com/documentation/109246/0100/SME-Overview/SME-and-SME2
[2] https://arm-software.github.io/acle/main/acle.html