Accelerate SVE128 SBGEMM/BGEMM
This accelerates SBGEMM/BGEMM by extending the existing 8x4 kernel to 8x8 (unrolling N by 8)
Not sure if it's a good idea to delete the previous 8x4 kernel?
Here are the speedups on single core Neoverse-V2 (SVE128) compared to prev state:
Per-shape speedup
M=N=K=64: SBGEMM 1.164x (16.42%), BGEMM 1.133x (13.30%)
M=N=K=128: SBGEMM 1.220x (22.02%), BGEMM 1.186x (18.56%)
M=N=K=256: SBGEMM 1.241x (24.08%), BGEMM 1.235x (23.54%)
M=N=K=512: SBGEMM 1.240x (23.95%), BGEMM 1.227x (22.75%)
M=N=K=1024: SBGEMM 1.251x (25.11%), BGEMM 1.232x (23.23%)
M=N=K=2048: SBGEMM 1.235x (23.47%), BGEMM 1.246x (24.64%)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>