ChipKerchner:RVV_Narrow_Accumulate_FP16_GEMM - Branch - OpenMathLib/OpenBLAS

Added ability to accumulate in FP16. Convert BF16 to FP32. For FP16 and BF16 GEMM in RISC-V (BF16 now works for pre-RVA23)

#5640Merged

Comparing

ChipKerchner:RVV_Narrow_Accumulate_FP16_GEMM

(

efe63e7

) with

develop

(

1ef6319

)

Untouched: 62

Benchmarks

62 total

test_gemm[100-d]

benchmark/pybench/benchmarks/bench_blas.py

+2%

479.1 µs471.1 µs

test_gemm[100-s]

benchmark/pybench/benchmarks/bench_blas.py

+1%

275.6 µs273.1 µs

test_syrk[100-d]

benchmark/pybench/benchmarks/bench_blas.py

340.5 µs339.5 µs

test_daxpy[1000-d]

benchmark/pybench/benchmarks/bench_blas.py

32.3 µs32.2 µs

test_gesdd[mn0-s]

benchmark/pybench/benchmarks/bench_blas.py

109.3 µs109.1 µs

test_gemm[1000-s]

benchmark/pybench/benchmarks/bench_blas.py

117.5 ms117.4 ms

test_dgemv[1000-s]

benchmark/pybench/benchmarks/bench_blas.py

7 ms7 ms

test_gesv[1000-c]

benchmark/pybench/benchmarks/bench_blas.py

188.6 ms188.6 ms

test_gemm[1000-c]

benchmark/pybench/benchmarks/bench_blas.py

426.1 ms426 ms

test_gesdd[mn1-d]

benchmark/pybench/benchmarks/bench_blas.py

93.8 ms93.8 ms

test_syrk[1000-s]

benchmark/pybench/benchmarks/bench_blas.py

65.4 ms65.4 ms

test_dgemv[1000-z]

benchmark/pybench/benchmarks/bench_blas.py

26.3 ms26.3 ms

test_gesv[1000-d]

benchmark/pybench/benchmarks/bench_blas.py

93.3 ms93.3 ms

test_syrk[1000-z]

benchmark/pybench/benchmarks/bench_blas.py

476.4 ms476.4 ms

test_gesv[1000-z]

benchmark/pybench/benchmarks/bench_blas.py

353.6 ms353.6 ms

test_syev[50-s]

benchmark/pybench/benchmarks/bench_blas.py

1.3 ms1.3 ms

test_syrk[1000-c]

benchmark/pybench/benchmarks/bench_blas.py

227.5 ms227.5 ms

test_syrk[1000-d]

benchmark/pybench/benchmarks/bench_blas.py

130.3 ms130.4 ms

test_daxpy[1000-c]

benchmark/pybench/benchmarks/bench_blas.py

32.7 µs32.7 µs

test_gesv[1000-s]

benchmark/pybench/benchmarks/bench_blas.py

52.6 ms52.6 ms

test_gesv[100-z]

benchmark/pybench/benchmarks/bench_blas.py

937 µs937.1 µs

test_gemm[1000-z]

benchmark/pybench/benchmarks/bench_blas.py

875.5 ms875.6 ms

test_syev[200-d]

benchmark/pybench/benchmarks/bench_blas.py

58.6 ms58.6 ms

test_dgemv[1000-c]

benchmark/pybench/benchmarks/bench_blas.py

14.8 ms14.8 ms

test_dgemv[1000-d]

benchmark/pybench/benchmarks/bench_blas.py

13.9 ms13.9 ms

Commits

Click on a commit to change the comparison range

Base

develop

1ef6319

-0.18%

Added ability to accumulate in FP16 for GEMM. Widens once at the end of loops.

b5f2a50

1 month ago

by ChipKerchner

+0.19%

128-bit versions.

aa1cebd

1 month ago

by ChipKerchner

-0.19%

Forget to add defintion.

74d9fe2

1 month ago

by ChipKerchner

+0.19%

Fixed MADD to use float16 values. Use LMUL = 2 in main loop. Now 1.85X faster on BananaPi.

e3cb067

1 month ago

by ChipKerchner

Convert inputs from BF16 to FP32 and use FP32 vector madds. 18% faster.

3356043

1 month ago

by ChipKerchner

-0.14%

Convert BF16 values once (and vectorized).

4121a22

1 month ago

by ChipKerchner

One small change.

9701a80

1 month ago

by ChipKerchner

Only convert B if M is greater or equal to 4.

1cc377e

1 month ago

by ChipKerchner

-0.05%

Add flag for not converting A & B - will be used in future to do conversion during packing.

7a1d234

1 month ago

by ChipKerchner

+0.05%

Add dummy memsets - just in case.

1d6aa0d

1 month ago

by ChipKerchner

Add pre-RVA23 to BF16 GEMM.

efe63e7

29 days ago

by ChipKerchner

Home Terms Privacy Docs