Latest Results
feat(avx512): working, optimized AVX-512 escape kernel; fix escape_into overflow
The avx512 kernel + feature existed but had never run on AVX-512 hardware
and was a regression. Enabling only `avx512f` makes LLVM emulate the u8
byte compares (no native vpcmpub), so it ran ~1.75x slower than avx2, and
its <64B tail did a 64-byte speculative store that overflowed the
destination buffer on short/dense inputs (heap-buffer-overflow / SIGABRT).
Kernel (src/simd/avx512.rs):
- enable avx512f,avx512bw,avx512vl so LLVM emits vpcmpub + korq + kortestq
natively instead of emulating the byte compares
- fast path: a single OR-combined mask branch over each 256B chunk (1 branch,
the 4 load+mask chains pipeline) instead of 4 short-circuit `&&` branches
- tail (<64B): AVX-512 masked load/store (maskz_loadu_epi8 + mask_storeu_epi8,
k = (1<<nb)-1) — page-safe load, writes exactly nb bytes, no over-store
Dispatch (src/lib.rs): gate the avx512 kernel on runtime
`avx512bw && avx512vl` (not just `avx512f`) to match the enabled features.
Also fixes a pre-existing soundness hole unrelated to avx512: escape_into
never reserved capacity, so a caller buffer sized to the exact output could
overflow. Confirmed under ASAN on BOTH the default avx2 path (32-byte write)
and the avx512 path (8-byte write via escape_unchecked). Now reserves
len*6+32+3 up front (a no-op when the caller already sized dst large enough).
Benchmarks on Intel Xeon 8581C (Emerald Rapids), --features avx512:
rxjs 249 us (avx2 289 us, sonic-rs 311 us)
fixtures 10.31 ms (avx2 11.04 ms, sonic-rs 11.66 ms)
short 80 ns (avx2 84 ns, sonic-rs 202 ns)
Adds tests/stress.rs: brute-force differential vs serde_json across all
lengths (0..=600) x escape densities, plus an escape_into exact-capacity
regression. Verified: 16 unit + 3 stress tests pass on both paths; ASAN clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Latest Branches
0%
0%
renovate/lock-file-maintenance 0%
renovate/actions-checkout-7.x © 2026 CodSpeed Technology