Avatar for the vortex-data user
vortex-data
vortex
BlogDocsChangelog

Performance History

Latest Results

feat: use cardinality estimator for distinct count stats Replace the exact `HashMap`/`HashSet` previously used to compute distinct-value counts during compression stats generation with Cloudflare's `cardinality-estimator` crate. The estimator gives us a bounded-memory approximation (exact up to ~128 distinct values, then HyperLogLog++) so high-cardinality arrays no longer require an O(n) auxiliary hash table to answer the single question "how many unique values does this have?". - Integer stats swap the hash map for a `CardinalityEstimator` and track the most frequent value via a Boyer-Moore majority candidate plus a second-pass exact count. Sparse/dict schemes only care about the heavy hitter (>= 90% threshold) or a rough distinct ratio, so this is behaviourally equivalent for the decisions they make. - Float and string stats likewise drop their hash sets in favor of the estimator. - The integer and float dictionary encoders now rebuild the exact set of distinct values from the source array at compress time, since they need the values themselves and the stats layer no longer retains them. - `SequenceScheme`'s fast-path check for "all values are distinct" now tolerates the estimator's small approximation error; the deferred callback still validates sequences exactly. Signed-off-by: Robert Kruszewski <github@robertk.io>
rk/cardinality-estimator
15 hours ago

Latest Branches

CodSpeed Performance Gauge
-25%
Use approximate cardinality to decide whether to use dict compression#7759
10 days ago
7fa9309
rk/cardinality-estimator
CodSpeed Performance Gauge
-61%
19 hours ago
f965259
claude/benchmarks-v3-refactor
CodSpeed Performance Gauge
-25%
1 day ago
10a45d4
rk/signature
© 2026 CodSpeed Technology
Home Terms Privacy Docs