vortex-data
vortex
Blog
Docs
Changelog
Blog
Docs
Changelog
Overview
Branches
Benchmarks
Runs
Performance History
Latest Results
remove vortex dogfooding There's really no point to this and it also complicates concurrency control. Really what we want is an Iceberg or Spiral table! Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
ct/benchmark-website-redesign
7 hours ago
feat[fastlanes]: add 4-block VBMI transpose for 7% additional speedup Add transpose_1024x4_vbmi that processes 4 independent 128-byte blocks simultaneously using fully interleaved operations for maximum ILP. Performance: 12.4 cycles/block (vs 13.3 for dual-block, 300x faster than baseline) Signed-off-by: Claude <noreply@anthropic.com>
claude/bitpacking-transpose-optimization-tM1U4
9 hours ago
feat[fastlanes]: add dual-block VBMI transpose for 10% more throughput Add transpose_1024x2_vbmi and untranspose_1024x2_vbmi for batch processing of two 128-byte blocks simultaneously using interleaved VBMI operations. Performance: - vbmi_dual: 11.9 cycles/block (10.5% faster than single-block at 13.3) - Useful for bulk transpose operations The dual-block version achieves better throughput by: - Loading 4 input ZMM registers upfront (2 per block) - Interleaving gather/transpose/scatter operations - Better instruction-level parallelism hides latencies Signed-off-by: Claude <noreply@anthropic.com>
claude/bitpacking-transpose-optimization-tM1U4
9 hours ago
feat[fastlanes]: add AVX-512 VBMI transpose with 7.5x speedup Add AVX-512 VBMI optimized transpose implementation using vpermi2b/vpermb for vectorized gather and scatter operations. Performance improvements: - VBMI: 13.6 cycles/call (7.5x faster than avx512_gfni at 102.6 cycles) - VBMI: 240x faster than baseline (3276 cycles) Key optimizations: - Use vpermi2b to gather 8 bytes at stride-16 in parallel - Use vpermb for 8x8 byte transpose during scatter phase - Static permutation tables to avoid stack allocation Also adds: - Dual-block transpose_1024x2_avx512 for batch processing - VBMI detection via has_vbmi() function - Updated dispatch to prefer VBMI when available Signed-off-by: Claude <noreply@anthropic.com>
claude/bitpacking-transpose-optimization-tM1U4
9 hours ago
test[fastlanes]: add verification against fastlanes crate transpose Add test to verify our transpose_index implementation exactly matches the fastlanes crate's transpose function for all 1024 indices. Signed-off-by: Claude <noreply@anthropic.com>
claude/bitpacking-transpose-optimization-tM1U4
10 hours ago
perf[fastlanes]: fully unroll BMI2 transpose for 12% performance gain Testing showed that fully unrolling the BMI2 PEXT operations yields approximately 12% better performance compared to the looped version. The compiler doesn't fully optimize nested loops with PEXT intrinsics. Signed-off-by: Claude <noreply@anthropic.com>
claude/bitpacking-transpose-optimization-tM1U4
10 hours ago
feat[fastlanes]: add scalar_fast and ARM64 NEON transpose implementations Add highly optimized transpose implementations: 1. scalar_fast: Uses 8x8 bit matrix transpose algorithm with XOR/shift operations, achieving ~59 ns per 1024-bit transpose (25x faster than baseline). This is portable and works on all platforms. 2. ARM64 NEON: Uses NEON intrinsics for parallel bit transpose on AArch64, processing 2 groups at a time with 128-bit vector registers. Performance results (median times, 1024-bit transpose on x86-64): - baseline: 1.512 µs (bit-by-bit reference) - scalar: 641.2 ns (2.4x faster) - scalar_fast: 58.92 ns (25.7x faster) - NEW - avx2: 212.7 ns (7.1x faster) - avx2_gfni: 72.54 ns (20.8x faster) - bmi2: 60.56 ns (25.0x faster) - avx512_gfni: 44.38 ns (34.1x faster) The scalar_fast implementation achieves near-SIMD performance through: - Gather 8 bytes at stride 16 into u64 - Apply 8x8 bit transpose using 3 XOR/shift steps - Fully unrolled loops for all 16 base patterns Assembly verified to use: - BMI2: PEXT instructions for bit extraction - AVX-512: vpxord/vpsrlq/vpsllq for parallel bit transpose Signed-off-by: Claude <noreply@anthropic.com>
claude/bitpacking-transpose-optimization-tM1U4
10 hours ago
feat[fastlanes]: add BMI2 PEXT/PDEP transpose and fix GFNI implementations Add BMI2 implementation using PEXT/PDEP for efficient bit extraction/deposit, achieving 32x speedup over baseline (~48ns vs ~1.5µs per 1024-bit transpose). Fix AVX2+GFNI and AVX-512+GFNI implementations to use the classic 8x8 bit matrix transpose algorithm with XOR/shift operations, since GFNI's gf2p8affineqb operates per-byte and cannot shuffle bits between bytes. Performance summary (median times, 1024-bit transpose): - baseline: 1.562 µs (bit-by-bit) - scalar: 641.6 ns (2.4x faster) - avx2: 218.8 ns (7x faster) - avx2_gfni: 71.98 ns (22x faster) - bmi2: 47.92 ns (33x faster) - avx512_gfni: 44.38 ns (35x faster) Add BMI2 benchmarks for both transpose and untranspose operations. Signed-off-by: Claude <noreply@anthropic.com>
claude/bitpacking-transpose-optimization-tM1U4
10 hours ago
Active Branches
[DO NOT MERGE] Rewrite the Vortex benchmarks website
last run
7 hours ago
#6132
CodSpeed Performance Gauge
-40%
feat[fastlanes]: add optimized 1024-bit transpose implementations
last run
9 hours ago
#6135
CodSpeed Performance Gauge
-30%
feat[vortex-array]: add unified DisplayTreeNode for tree display and JSON serialization
last run
12 hours ago
#6134
CodSpeed Performance Gauge
-41%
© 2026 CodSpeed Technology
Home
Terms
Privacy
Docs