Latest Results
feat[vortex-cuda]: GPU FSST decompression kernel (#7776)
## Summary
This commit implements on-GPU decompression of the existing FSST
encoding. This kernel achieves ~42% max throughput utilization as
compared to the `throughput_cuda` benchmark on a DGX spark. CPU work is
required to compute the output offsets.
The core performance win is buffering up to 24 bytes of decompressed
data in three u64 registers and emitting the widest aligned stores
possible up to u128 (st.global.v2.u64).
The 256-entry symbol table (≤ 2 KB) is read directly from global memory.
Staging it into shared memory measured ~3% slower at 10M rows and ~15%
slower at 1M rows. The hypothesis is that L1 already holds the table
after a few iterations and the explicit shared copy adds bank-conflict
latency on the warp-divergent symbols[code] reads; the gap is wider at
1M because the kernel is less bandwidth-bound there.
Further optimizations would require an encoding change. Splits-style
intra-string parallelism (one GPU thread per ~32-compressed-byte chunk
instead of per-string) was prototyped on top of this kernel and measured
an additional +30% kernel throughput at 1M clickbench URLs, +26% at 5M,
+12% at 10M.
Four kernel variants are generated for the unsigned widths of
codes_offsets (u8/u16/u32/u64); signed integer ptypes are reinterpreted
as their unsigned equivalent on the Rust side, so the bit pattern is
preserved without copying.
<!--
Thank you for submitting a pull request! We appreciate your time and
effort.
Please make sure to provide enough information so that we can review
your pull
request. The Summary and Testing sections below contain guidance on what
to
include.
-->
<!--
If this PR is related to a tracked effort, please link to the relevant
issue
here (e.g., `Closes: #123`). Otherwise, feel free to ignore / delete
this.
In this section, please:
1. Explain the rationale for this change.
2. Summarize the changes included in this PR.
A general rule of thumb is that larger PRs should have larger summaries.
If
there are a lot of changes, please help us review the code by explaining
what
was changed and why.
If there is an issue or discussion attached, there is no need to
duplicate all
the details, but clarity is always preferred over brevity.
-->
Addresses: #6538
<!--
## API Changes
Uncomment this section if there are any user-facing changes.
Consider whether the change affects users in one of the following ways:
1. Breaks public APIs in some way.
2. Changes the underlying behavior of one of the engine integrations.
3. Should some documentation be updated to reflect this change?
If a public API is changed in a breaking manner, make sure to add the
appropriate label. You can run `./scripts/public-api.sh` locally to see
if there
are any public API changes (and this also runs in our CI).
-->
## Testing
<!--
Please describe how this change was tested. Here are some common
categories for
testing in Vortex:
1. Verifying existing behavior is maintained.
2. Verifying new behavior and functionality works correctly.
3. Serialization compatibility (backwards and forwards) should be
maintained or
explicitly broken.
-->
Unit tests against the CPU implementation on small and larger dataset.
Signed-off-by: Alfonso Subiotto Marques <alfonso.subiotto@polarsignals.com> Latest Branches
-21%
aduffy/pluggable-arrow-exec 0%
-64%
© 2026 CodSpeed Technology