Avatar for the Eventual-Inc user
Eventual-Inc
Daft
BlogDocsChangelog

Performance History

Latest Results

Merge branch 'main' into jeev/benchmarking
jeev/benchmarking
53 minutes ago
perf: inline vectorized aggregation for grouped count/sum (#6345) The standard grouped aggregation path in `agg_groupby` materializes per-group index lists (`Vec<Vec<u64>>`) during `make_groups()`, then iterates each group's indices to compute aggregates. At high cardinality this means millions of small heap allocations and scattered memory access during the aggregation phase. At low cardinality it's fine - the index lists are few and large - but the approach scales poorly as group count grows. This PR adds an alternative two-phase vectorized path for count and sum that eliminates the per-group index lists entirely. ## Algorithm **Phase 1 (hash probe)** A single pass over the groupby columns builds a dense `group_ids: Vec<u32>` mapping each row to its group, plus `group_sizes: Vec<u64>` tracking per-group row counts. In a special case when the key is a single integer column, we use an `FnvHashMap` on raw values. Multi-column or non-integer keys fall back to the existing `IndexHash` + comparator path. **Phase 2 (vectorized agg)** Each accumulator runs a tight loop over the dense `group_ids` array - e.g. `sums[gid] += vals[row]`. Count accumulators skip the scatter loop entirely, using pre-computed `group_sizes` from Phase 1. Separating accumulation from hash probing, and looping through group_ids sequentially keeps each loop body minimal and is more cache friendly. Accumulators use an `AggAccumulator` enum rather than trait objects to avoid vtable dispatch in the hot loops. ``` Old path (make_groups + eval_agg_expression): hash probe per-group gather + aggregate ---------- ---------------------------- row 0 -> Group A: [0, 2, 4] -> vals[0], vals[2], vals[4] -> sum = 70 row 1 -> Group B: [1, 3] -> vals[1], vals[3] -> sum = 60 row 2 -> ... row 3 -> ... G heap allocations, scattered access row 4 -> ... New path (inline two-phase): Phase 1: hash probe Phase 2: vectorized agg ----------------------- -------------------------- row 0 -> gid=0 for (gid, val) in group_ids.zip(vals): row 1 -> gid=1 sums[gid] += val row 2 -> gid=0 row 3 -> gid=1 group_ids = [0, 1, 0, 1, 0] row 4 -> gid=0 sums = [70, 60] 2 allocations, sequential access ``` ## Benchmarks ### Rust-level benchmark (5M rows, single int64 key, no query planning) | agg | groups | inline (ms) | fallback (ms) | speedup | |-----|--------|-------------|---------------|---------| | count | 10 | 31.0 | 31.4 | 1.01x | | sum | 10 | 35.5 | 45.2 | **1.27x** | | count+sum | 10 | 35.6 | 46.4 | **1.30x** | | count | 1K | 31.6 | 38.1 | **1.21x** | | sum | 1K | 35.1 | 87.3 | **2.49x** | | count+sum | 1K | 35.7 | 90.1 | **2.52x** | | count | 100K | 42.2 | 114.6 | **2.72x** | | sum | 100K | 47.7 | 180.2 | **3.78x** | | count+sum | 100K | 47.9 | 180.3 | **3.77x** | | count | 5M | 192.4 | 689.9 | **3.59x** | | sum | 5M | 204.4 | 869.6 | **4.25x** | | count+sum | 5M | 221.8 | 972.1 | **4.38x** | ### Rust-level Q1-like benchmark (2 string keys, 6 groups, float64 sum+count) | rows | inline (ms) | fallback (ms) | speedup | |------|-------------|---------------|---------| | 1.2M | 31.6 | 36.7 | **1.16x** | | 5M | 135.5 | 156.7 | **1.16x** | ### Python end-to-end benchmark (5M rows, includes query planning) | agg | groups | inline (ms) | fallback (ms) | speedup | |-----|--------|-------------|---------------|---------| | count | 10 | 119.1 | 122.6 | 1.03x | | sum | 10 | 176.3 | 190.8 | 1.08x | | count+sum | 10 | 175.9 | 170.5 | 0.97x | | count | 1K | 170.9 | 176.3 | 1.03x | | sum | 1K | 124.9 | 215.4 | **1.72x** | | count+sum | 1K | 127.4 | 219.4 | **1.72x** | | count | 100K | 179.4 | 292.0 | **1.63x** | | sum | 100K | 176.8 | 357.9 | **2.02x** | | count+sum | 100K | 183.6 | 372.5 | **2.03x** | | count | 5M | 786.7 | 1316.5 | **1.67x** | | sum | 5M | 828.5 | 1419.7 | **1.71x** | | count+sum | 5M | 839.4 | 1426.7 | **1.70x** | ### Multi-file parquet benchmark (50M rows, 10 files) | agg | groups | inline (ms) | fallback (ms) | speedup | |-----|--------|-------------|---------------|---------| | count | 10 | 149.6 | 226.4 | **1.51x** | | sum | 10 | 152.9 | 246.5 | **1.61x** | | count+sum | 10 | 158.6 | 242.0 | **1.53x** | | count | 1K | 163.2 | 485.7 | **2.98x** | | sum | 1K | 174.0 | 506.5 | **2.91x** | | count+sum | 1K | 181.1 | 486.9 | **2.69x** | | count | 100K | 1126.0 | 2193.1 | **1.95x** | | sum | 100K | 1157.3 | 2358.8 | **2.04x** | | count+sum | 100K | 1565.7 | 2710.6 | **1.73x** | | count | 50M | 3218.5 | 5995.2 | **1.86x** | | sum | 50M | 3361.9 | 5450.5 | **1.62x** | | count+sum | 50M | 4475.8 | 6632.4 | **1.48x** | All benchmarks run on Apple M4 Max, macOS 15.4. The 0.97x for count+sum at 10 groups in the Python in-memory benchmark is within measurement noise (single-agg cases at the same cardinality show 1.03-1.08x). Multi-file parquet shows the largest gains because inline aggregation eliminates per-partition `Vec<Vec<u64>>` allocation overhead that compounds across 10 concurrent partition tasks. ### ClickBench (c6a.4xlarge, 500GB gp2, 100 partitioned parquet files, ~100M rows) Tested with #6558 merged as the baseline, then this PR on top. Total is sum of best-of-3 across all 43 ClickBench queries. | Version | Total | vs v0.7.4 | |---------|------:|----------:| | v0.7.4 | 118.0s | baseline | | v0.7.5 (regressed) | 139.6s | +18% | | #6558 only | 118.4s | +0.3% | | **#6558 + this PR** | **106.6s** | **-10%** | Queries with the largest improvements from inline agg: | Query | #6558 only | + this PR | Speedup | Pattern | |------:|-----------:|----------:|--------:|---------| | Q33 | 13.09s | 9.51s | **1.38x** | `GROUP BY WatchID, ClientIP` (high cardinality) | | Q19 | 8.19s | 6.74s | **1.22x** | `GROUP BY UserID, minute, SearchPhrase` | | Q34 | 9.99s | 7.92s | **1.26x** | `GROUP BY URL` (high cardinality string) | | Q18 | 2.56s | 1.83s | **1.40x** | `GROUP BY UserID, SearchPhrase` | | Q16 | 2.20s | 1.68s | **1.31x** | `GROUP BY UserID` | | Q13 | 1.56s | 1.25s | **1.25x** | `GROUP BY SearchPhrase` + COUNT | | Q32 | 2.09s | 1.71s | **1.22x** | `GROUP BY WatchID, ClientIP` + SUM/AVG | No regressions observed. Queries without GROUP BY or using unsupported agg types (AVG, COUNT DISTINCT) are unchanged, as expected. ## Related Issues Addresses #6343 --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
main
1 hour ago
migrate other changes
slade/uuid-type
1 hour ago
fix file.rs
kvthr:write_metrics
3 hours ago

Latest Branches

CodSpeed Performance Gauge
0%
refactor: Use DAFT_REF_NAME and DAFT_SHA env vars for benchmark run metadata#6610
1 hour ago
6a4cb17
jeev/benchmarking
CodSpeed Performance Gauge
0%
24 hours ago
b40610f
slade/uuid-type
CodSpeed Performance Gauge
0%
3 hours ago
53c04ac
kvthr:write_metrics
© 2026 CodSpeed Technology
Home Terms Privacy Docs