Avatar for the OpenMathLib user
OpenMathLib
OpenBLAS
BlogDocsChangelog

Added ability to accumulate in FP16. Convert BF16 to FP32. For FP16 and BF16 GEMM in RISC-V (BF16 now works for pre-RVA23)

#5640Merged
Comparing
ChipKerchner:RVV_Narrow_Accumulate_FP16_GEMM
(
efe63e7
) with
develop
(
1ef6319
)
CodSpeed Performance Gauge
0%
Untouched
62

Benchmarks

62 total
test_gemm[100-d]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
+2%
479.1 µs471.1 µs
test_gemm[100-s]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
+1%
275.6 µs273.1 µs
test_syrk[100-d]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
340.5 µs339.5 µs
test_daxpy[1000-d]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
32.3 µs32.2 µs
test_gesdd[mn0-s]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
109.3 µs109.1 µs
test_gemm[1000-s]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
117.5 ms117.4 ms
test_dgemv[1000-s]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
7 ms7 ms
test_gesv[1000-c]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
188.6 ms188.6 ms
test_gemm[1000-c]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
426.1 ms426 ms
test_gesdd[mn1-d]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
93.8 ms93.8 ms
test_syrk[1000-s]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
65.4 ms65.4 ms
test_dgemv[1000-z]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
26.3 ms26.3 ms
test_gesv[1000-d]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
93.3 ms93.3 ms
test_syrk[1000-z]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
476.4 ms476.4 ms
test_gesv[1000-z]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
353.6 ms353.6 ms
test_syev[50-s]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
1.3 ms1.3 ms
test_syrk[1000-c]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
227.5 ms227.5 ms
test_syrk[1000-d]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
130.3 ms130.4 ms
test_daxpy[1000-c]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
32.7 µs32.7 µs
test_gesv[1000-s]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
52.6 ms52.6 ms
test_gesv[100-z]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
937 µs937.1 µs
test_gemm[1000-z]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
875.5 ms875.6 ms
test_syev[200-d]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
58.6 ms58.6 ms
test_dgemv[1000-c]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
14.8 ms14.8 ms
test_dgemv[1000-d]
benchmark/pybench/benchmarks/bench_blas.py
CodSpeed Performance Gauge
0%
13.9 ms13.9 ms

Commits

Click on a commit to change the comparison range
Base
develop
1ef6319
-0.18%
Added ability to accumulate in FP16 for GEMM. Widens once at the end of loops.
b5f2a50
1 month ago
by ChipKerchner
+0.19%
128-bit versions.
aa1cebd
1 month ago
by ChipKerchner
-0.19%
Forget to add defintion.
74d9fe2
1 month ago
by ChipKerchner
+0.19%
Fixed MADD to use float16 values. Use LMUL = 2 in main loop. Now 1.85X faster on BananaPi.
e3cb067
1 month ago
by ChipKerchner
0%
Convert inputs from BF16 to FP32 and use FP32 vector madds. 18% faster.
3356043
1 month ago
by ChipKerchner
-0.14%
Convert BF16 values once (and vectorized).
4121a22
1 month ago
by ChipKerchner
0%
One small change.
9701a80
1 month ago
by ChipKerchner
0%
Only convert B if M is greater or equal to 4.
1cc377e
1 month ago
by ChipKerchner
-0.05%
Add flag for not converting A & B - will be used in future to do conversion during packing.
7a1d234
1 month ago
by ChipKerchner
+0.05%
Add dummy memsets - just in case.
1d6aa0d
1 month ago
by ChipKerchner
0%
Add pre-RVA23 to BF16 GEMM.
efe63e7
29 days ago
by ChipKerchner
© 2026 CodSpeed Technology
Home Terms Privacy Docs