Latest Results
perf: inline vectorized aggregation for grouped count/sum (#6345)
The standard grouped aggregation path in `agg_groupby` materializes
per-group index lists (`Vec<Vec<u64>>`) during `make_groups()`, then
iterates each group's indices to compute aggregates. At high cardinality
this means millions of small heap allocations and scattered memory
access during the aggregation phase. At low cardinality it's fine - the
index lists are few and large - but the approach scales poorly as group
count grows.
This PR adds an alternative two-phase vectorized path for count and sum
that eliminates the per-group index lists entirely.
## Algorithm
**Phase 1 (hash probe)**
A single pass over the groupby columns builds a dense `group_ids:
Vec<u32>` mapping each row to its group, plus `group_sizes: Vec<u64>`
tracking per-group row counts.
In a special case when the key is a single integer column, we use an
`FnvHashMap` on raw values. Multi-column or non-integer keys fall back
to the existing `IndexHash` + comparator path.
**Phase 2 (vectorized agg)**
Each accumulator runs a tight loop over the dense `group_ids` array -
e.g. `sums[gid] += vals[row]`.
Count accumulators skip the scatter loop entirely, using pre-computed
`group_sizes` from Phase 1.
Separating accumulation from hash probing, and looping through group_ids
sequentially keeps each loop body minimal and is more cache friendly.
Accumulators use an `AggAccumulator` enum rather than trait objects to
avoid vtable dispatch in the hot loops.
```
Old path (make_groups + eval_agg_expression):
hash probe per-group gather + aggregate
---------- ----------------------------
row 0 -> Group A: [0, 2, 4] -> vals[0], vals[2], vals[4] -> sum = 70
row 1 -> Group B: [1, 3] -> vals[1], vals[3] -> sum = 60
row 2 -> ...
row 3 -> ... G heap allocations, scattered access
row 4 -> ...
New path (inline two-phase):
Phase 1: hash probe Phase 2: vectorized agg
----------------------- --------------------------
row 0 -> gid=0 for (gid, val) in group_ids.zip(vals):
row 1 -> gid=1 sums[gid] += val
row 2 -> gid=0
row 3 -> gid=1 group_ids = [0, 1, 0, 1, 0]
row 4 -> gid=0 sums = [70, 60]
2 allocations, sequential access
```
## Benchmarks
### Rust-level benchmark (5M rows, single int64 key, no query planning)
| agg | groups | inline (ms) | fallback (ms) | speedup |
|-----|--------|-------------|---------------|---------|
| count | 10 | 31.0 | 31.4 | 1.01x |
| sum | 10 | 35.5 | 45.2 | **1.27x** |
| count+sum | 10 | 35.6 | 46.4 | **1.30x** |
| count | 1K | 31.6 | 38.1 | **1.21x** |
| sum | 1K | 35.1 | 87.3 | **2.49x** |
| count+sum | 1K | 35.7 | 90.1 | **2.52x** |
| count | 100K | 42.2 | 114.6 | **2.72x** |
| sum | 100K | 47.7 | 180.2 | **3.78x** |
| count+sum | 100K | 47.9 | 180.3 | **3.77x** |
| count | 5M | 192.4 | 689.9 | **3.59x** |
| sum | 5M | 204.4 | 869.6 | **4.25x** |
| count+sum | 5M | 221.8 | 972.1 | **4.38x** |
### Rust-level Q1-like benchmark (2 string keys, 6 groups, float64
sum+count)
| rows | inline (ms) | fallback (ms) | speedup |
|------|-------------|---------------|---------|
| 1.2M | 31.6 | 36.7 | **1.16x** |
| 5M | 135.5 | 156.7 | **1.16x** |
### Python end-to-end benchmark (5M rows, includes query planning)
| agg | groups | inline (ms) | fallback (ms) | speedup |
|-----|--------|-------------|---------------|---------|
| count | 10 | 119.1 | 122.6 | 1.03x |
| sum | 10 | 176.3 | 190.8 | 1.08x |
| count+sum | 10 | 175.9 | 170.5 | 0.97x |
| count | 1K | 170.9 | 176.3 | 1.03x |
| sum | 1K | 124.9 | 215.4 | **1.72x** |
| count+sum | 1K | 127.4 | 219.4 | **1.72x** |
| count | 100K | 179.4 | 292.0 | **1.63x** |
| sum | 100K | 176.8 | 357.9 | **2.02x** |
| count+sum | 100K | 183.6 | 372.5 | **2.03x** |
| count | 5M | 786.7 | 1316.5 | **1.67x** |
| sum | 5M | 828.5 | 1419.7 | **1.71x** |
| count+sum | 5M | 839.4 | 1426.7 | **1.70x** |
### Multi-file parquet benchmark (50M rows, 10 files)
| agg | groups | inline (ms) | fallback (ms) | speedup |
|-----|--------|-------------|---------------|---------|
| count | 10 | 149.6 | 226.4 | **1.51x** |
| sum | 10 | 152.9 | 246.5 | **1.61x** |
| count+sum | 10 | 158.6 | 242.0 | **1.53x** |
| count | 1K | 163.2 | 485.7 | **2.98x** |
| sum | 1K | 174.0 | 506.5 | **2.91x** |
| count+sum | 1K | 181.1 | 486.9 | **2.69x** |
| count | 100K | 1126.0 | 2193.1 | **1.95x** |
| sum | 100K | 1157.3 | 2358.8 | **2.04x** |
| count+sum | 100K | 1565.7 | 2710.6 | **1.73x** |
| count | 50M | 3218.5 | 5995.2 | **1.86x** |
| sum | 50M | 3361.9 | 5450.5 | **1.62x** |
| count+sum | 50M | 4475.8 | 6632.4 | **1.48x** |
All benchmarks run on Apple M4 Max, macOS 15.4. The 0.97x for count+sum
at 10 groups in the Python in-memory benchmark is within measurement
noise (single-agg cases at the same cardinality show 1.03-1.08x).
Multi-file parquet shows the largest gains because inline aggregation
eliminates per-partition `Vec<Vec<u64>>` allocation overhead that
compounds across 10 concurrent partition tasks.
### ClickBench (c6a.4xlarge, 500GB gp2, 100 partitioned parquet files,
~100M rows)
Tested with #6558 merged as the baseline, then this PR on top. Total is
sum of best-of-3 across all 43 ClickBench queries.
| Version | Total | vs v0.7.4 |
|---------|------:|----------:|
| v0.7.4 | 118.0s | baseline |
| v0.7.5 (regressed) | 139.6s | +18% |
| #6558 only | 118.4s | +0.3% |
| **#6558 + this PR** | **106.6s** | **-10%** |
Queries with the largest improvements from inline agg:
| Query | #6558 only | + this PR | Speedup | Pattern |
|------:|-----------:|----------:|--------:|---------|
| Q33 | 13.09s | 9.51s | **1.38x** | `GROUP BY WatchID, ClientIP` (high
cardinality) |
| Q19 | 8.19s | 6.74s | **1.22x** | `GROUP BY UserID, minute,
SearchPhrase` |
| Q34 | 9.99s | 7.92s | **1.26x** | `GROUP BY URL` (high cardinality
string) |
| Q18 | 2.56s | 1.83s | **1.40x** | `GROUP BY UserID, SearchPhrase` |
| Q16 | 2.20s | 1.68s | **1.31x** | `GROUP BY UserID` |
| Q13 | 1.56s | 1.25s | **1.25x** | `GROUP BY SearchPhrase` + COUNT |
| Q32 | 2.09s | 1.71s | **1.22x** | `GROUP BY WatchID, ClientIP` +
SUM/AVG |
No regressions observed. Queries without GROUP BY or using unsupported
agg types (AVG, COUNT DISTINCT) are unchanged, as expected.
## Related Issues
Addresses #6343
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Latest Branches
0%
0%
0%
© 2026 CodSpeed Technology