Latest Results
fastlanes: bit-packed compare-constant fast path + bitpack_constant kernel
Speeds up the `bitpack_compare` bench from the parent commit with two
independent optimizations driven by the same observation ā a bit-packed lane
holds values in `[0, 2^bit_width - 1]`, so a constant outside that range can
be answered analytically without touching the packed buffer.
**Compare-constant fast path (`compute/compare.rs`)**
Register a `CompareKernel` for `BitPacked` that short-circuits when the RHS
constant `c` is outside `[0, 2^bit_width - 1]`. For each operator the answer
is a constant boolean modulo patches and validity:
Eq/NotEq - false / true everywhere
Lt/Lte/Gt/Gte - constant once `c` is on either side of the range
Detecting the range is an `O(1)` `i128` check via the new
`BitPackedData::value_fits_bit_width` helper. With no patches and no nulls the
kernel returns a `ConstantArray<bool>` (also `O(1)`); otherwise it allocates a
`BitBuffer`, fills it with the constant result, and overlays the per-position
outcome at each patch index. In-range constants fall through to the canonical
decompress + Arrow compare path; tests exercise both fall-throughs.
**`bitpack_constant` analytical encoder (`array/bitpack_compress.rs`)**
Add a constant-only pack kernel that builds the FastLanes bit pattern for a
`[constant; len]` input without calling `BitPacking::pack`. For constant input
every lane produces the same `bit_width` output words; we compute those words
analytically - each output word's `j`-th bit is bit `(k * T_bits + j) mod
bit_width` of `c` - then `memset` each word `LANES` times into a stack chunk
template and `memcpy` the template into every full chunk. The standard packer
is only invoked for the partial tail (zero-padded past `len`).
`bitpack_encode_constant` wraps the buffer up as a `BitPackedArray`. A
bitwise-equivalence rstest covers byte-identity with `BitPacking::pack` across
lengths, widths, and constants.
**Benches**
* `bitpack_compare` (added in the parent commit) on this branch now exercises
the fast path; at `bit_width ā {4, 16}`, `len ā {1024, 65536}` it runs in
~1.4-1.5 µs vs 8-125 µs for the decompress + Arrow baseline.
* New `bitpack_constant` bench compares the analytical kernel against the
full `bitpack_encode` pipeline on uniform-constant input; at 64 K u32
elements the analytical kernel is roughly 23-62x faster.
**Plan doc (`docs/inrange_compare_plan.md`)**
Document the follow-up plan to accelerate *in-range* ordering comparisons:
compare the packed array against the packed constant via SWAR less-than per
supported bit width (Routes A/B/C, including Knuth broadword with rotation
tables for widths that straddle word boundaries), derive the four ordering
operators from one `Lt` primitive, and benchmark against the canonical SIMD
baseline before landing.
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>claude/bitpack-compare-speedup-KGPS3 Latest Branches
+7%
refactor/parent-ref-stack-allocated +3%
claude/flatbuffers-memory-safety-XKbWQ Ć2.6
claude/bitpack-compare-speedup-KGPS3 Ā© 2026 CodSpeed Technology