Latest Results
feat(temporal): add Spark-style add_months and months_between (#6913)
## Summary
Implements two missing functions from issue #3798 by adding Spark-style
`add_months` and `months_between` as native Daft temporal expressions.
This PR adds two new scalar UDFs in the temporal module, wires them
through the Python and SQL surfaces, and adds regression coverage. Both
functions match Spark's documented semantics, including end-of-month
clamping in `add_months` and the 8-decimal rounding in `months_between`.
## Why
The issue asks for parity with PySpark's temporal functions. This PR
focuses on two practical pieces:
- Calendar-month arithmetic (`add_months`) with correct end-of-month
clamping.
- Spark-compatible `months_between` including the same-day /
both-last-day fast paths and time-of-day fractional math.
## Changes Made
- Add `AddMonths` and `MonthsBetween` scalar UDFs in
`src/daft-functions-temporal/src/date_arithmetic.rs`:
- `AddMonths` uses `chrono::Months::checked_add_months` /
`checked_sub_months` for end-of-month clamping.
- `MonthsBetween` casts both inputs to `Timestamp(us, None)`, applies
same-day / both-last-day shortcuts, and rounds to 8 decimal places.
- Register both UDFs in `src/daft-functions-temporal/src/lib.rs`.
- Add SQL handlers `SQLAddMonths` and `SQLMonthsBetween` in
`src/daft-sql/src/modules/temporal.rs` with Spark argument order.
- Add Python wrappers `add_months` and `months_between` in
`daft/functions/datetime.py` and export them from
`daft/functions/__init__.py`.
- Add focused tests in `tests/dataframe/test_temporals.py`:
- `add_months` coverage: basic, EOM clamping (incl. leap-year
2024-02-29), negative months, year rollover, Timestamp input, null
propagation.
- `months_between` coverage: same day-of-month, both-last-day,
day-difference, time-of-day fraction, Spark doc example (`3.94959677`),
negative direction, null propagation.
- SQL integration test covering both functions.
## Behavior
- `add_months('2023-01-31', 1)` returns `2023-02-28`;
`add_months('2024-01-31', 1)` returns `2024-02-29`.
- `add_months` always returns Date, even when the input is a Timestamp.
- `months_between('1997-02-28', '1996-10-30')` returns `3.93548387`
(pure-date inputs).
- `months_between('1997-02-28 10:30:00', '1996-10-30')` returns
`3.94959677` (Spark doc example).
- `months_between` returns an integer when both inputs share
day-of-month or are both the last day of their respective months.
- Null in either input row propagates to null in the output.
## Test Plan
- `cargo check -p daft-functions-temporal -p daft-sql`
- `make build`
- `DAFT_RUNNER=native pytest -q tests/dataframe/test_temporals.py -k
"add_months or months_between"`
## Related Issues
- Part of #3798 perf(inline-agg): pack two-string-column keys into u64 for typed FNV grouping
## Summary
Extends Item 3 of #6585 (and builds on PR #6748). For the
two-string-column groupby shape, this change packs the two `u32`
symbol IDs produced by `symbolize_column` into a single `u64` key and
groups against a typed `FnvHashMap<u64, u32>` in a tight integer loop
— no per-row comparator closure, no `IndexHash`, no dynamic-typed
`Series` equality dispatch.
PR #6748 already symbolizes Utf8/Binary group-by columns, but the
symbolized columns are still fed into the generic multi-column hash
path (`agg_generic_hash_path`), which keeps the comparator-closure
overhead. This PR captures the remaining benefit by routing the
two-string-column shape — including the TPC-H Q1 `l_returnflag` /
`l_linestatus` pattern — through a dedicated typed map. Grouping
semantics and final query results are unchanged.
## Conceptual example
```
Input rows: After symbolization: Packed u64 key:
key1 | key2 key1_sym | key2_sym (key1_sym << 32) | key2_sym
--------|-------- ---------|--------- -------------------------
"alice" | "red" 0 | 0 0x00000000_00000000
"bob" | "red" 1 | 0 0x00000001_00000000
"alice" | "blue" 0 | 1 0x00000000_00000001
"alice" | "red" 0 | 0 0x00000000_00000000
```
The two symbol spaces sit in disjoint 32-bit halves of the u64 key,
so distinct (sym0, sym1) pairs always yield distinct packed keys.
Null-equals-null is preserved: when a column has nulls,
`symbolize_column` reserves symbol ID 0 for null and starts non-null
IDs at 1, so both-null rows share a unique packed key and never collide
with non-null rows.
## Key changes
- Add `agg_packed_u64_path` for exactly two Utf8/Binary group-by
columns: symbolize each column into a `Vec<u32>`, pack the pair into
a `u64`, group with `FnvHashMap<u64, u32>` using the same
Vacant/Occupied pattern as the existing single-column fast paths.
- Extract a small `symbolize_string_col` helper that returns
`Ok(Some(Vec<u32>))` for Utf8/Binary and `Ok(None)` otherwise — used
by the new path and isolates per-dtype null/value-accessor wiring.
- Dispatch in `agg_groupby_inline` tries the packed-u64 path first for
multi-column shapes, then falls through to the existing
`agg_symbolized_path`, then to `agg_generic_hash_path`. All other
multi-column shapes (3+ columns, int×string, pure int multi-col) are
unchanged.
- No avg-bytes-per-row gate: unlike `agg_symbolized_path`, the
packed-u64 path's cost is one symbolize pass plus a tight `u64`-keyed
loop, not a symbolized RecordBatch rebuild fed through the generic
comparator-closure hash path. Short-string shapes (TPC-H Q1) win
along with long-string shapes.
- New `bench_packed_u64_two_strings` Rust-level benchmark covering
two-Utf8-column shapes at varying cardinalities and across the full
count/sum/min/max/count+sum agg matrix.
## Benchmarks
### Rust-level Q1-like benchmark (2 short string keys, 6 groups, sum+count of two float64 cols)
| rows | inline (ms) | fallback (ms) | speedup |
|------|-------------|---------------|---------|
| 1.2M | 21.75 | 33.74 | 1.55x |
| 5M | 111.34 | 175.12 | 1.57x |
### Rust-level packed-u64 benchmark (2 long Utf8 keys, int64 val, full agg matrix)
| agg | rows | distinct | inline (ms) | fallback (ms) | speedup |
|-----------|------|-------------------|-------------|---------------|---------|
| count | 1.2M | 8 × 4 = 32 | 57.05 | 28.53 | 0.50x |
| sum | 1.2M | 8 × 4 = 32 | 56.05 | 33.73 | 0.60x |
| min | 1.2M | 8 × 4 = 32 | 57.66 | 33.31 | 0.58x |
| max | 1.2M | 8 × 4 = 32 | 56.23 | 33.97 | 0.60x |
| count+sum | 1.2M | 8 × 4 = 32 | 56.75 | 34.10 | 0.60x |
| count | 1.2M | 64 × 32 = 2048 | 55.56 | 27.64 | 0.50x |
| sum | 1.2M | 64 × 32 = 2048 | 62.70 | 39.52 | 0.63x |
| min | 1.2M | 64 × 32 = 2048 | 54.75 | 41.41 | 0.76x |
| max | 1.2M | 64 × 32 = 2048 | 56.90 | 41.01 | 0.72x |
| count+sum | 1.2M | 64 × 32 = 2048 | 55.77 | 40.71 | 0.73x |
| count | 5M | 8 × 4 = 32 | 246.37 | 127.56 | 0.52x |
| sum | 5M | 8 × 4 = 32 | 235.03 | 142.14 | 0.60x |
| min | 5M | 8 × 4 = 32 | 235.38 | 140.41 | 0.60x |
| max | 5M | 8 × 4 = 32 | 238.82 | 142.38 | 0.60x |
| count+sum | 5M | 8 × 4 = 32 | 240.04 | 142.84 | 0.60x |
| count | 5M | 1000 × 100 = 100k | 241.55 | 132.99 | 0.55x |
| sum | 5M | 1000 × 100 = 100k | 243.80 | 207.44 | 0.85x |
| min | 5M | 1000 × 100 = 100k | 246.93 | 210.96 | 0.85x |
| max | 5M | 1000 × 100 = 100k | 245.68 | 212.00 | 0.86x |
| count+sum | 5M | 1000 × 100 = 100k | 239.63 | 206.81 | 0.86x |
The long-string shapes are below 1.0x vs the Daft fallback, but that
gap is pre-existing in PR #6748's inline path — Daft's general groupby
machinery is currently faster for these specific shapes. The delta
this PR introduces is the comparison below.
### Inline-vs-inline: packed-u64 vs PR #6748 inline (same shape, same machine)
| shape | PR #6748 inline (ms) | packed-u64 (ms) | speedup |
|----------------------------------------|----------------------|-----------------|---------|
| Q1 1.2M × 6 short strings (sum+count) | 23.07 | 21.75 | 1.06x |
| Q1 5M × 6 short strings (sum+count) | 120.15 | 111.34 | 1.08x |
| 1.2M × 32 long strings (count) | 61.69 | 57.05 | 1.08x |
| 1.2M × 32 long strings (sum) | 65.88 | 56.05 | 1.18x |
| 1.2M × 32 long strings (min) | 62.78 | 57.66 | 1.09x |
| 1.2M × 32 long strings (max) | 63.70 | 56.23 | 1.13x |
| 1.2M × 32 long strings (count+sum) | 64.76 | 56.75 | 1.14x |
| 1.2M × 2048 long strings (count) | 62.43 | 55.56 | 1.12x |
| 1.2M × 2048 long strings (sum) | 62.62 | 62.70 | 1.00x |
| 1.2M × 2048 long strings (min) | 64.45 | 54.75 | 1.18x |
| 1.2M × 2048 long strings (max) | 63.78 | 56.90 | 1.12x |
| 1.2M × 2048 long strings (count+sum) | 66.10 | 55.77 | 1.19x |
| 5M × 32 long strings (count) | 293.04 | 246.37 | 1.19x |
| 5M × 32 long strings (sum) | 265.15 | 235.03 | 1.13x |
| 5M × 32 long strings (min) | 263.15 | 235.38 | 1.12x |
| 5M × 32 long strings (max) | 267.26 | 238.82 | 1.12x |
| 5M × 32 long strings (count+sum) | 262.31 | 240.04 | 1.09x |
| 5M × 100k long strings (count) | 268.71 | 241.55 | 1.11x |
| 5M × 100k long strings (sum) | 269.30 | 243.80 | 1.10x |
| 5M × 100k long strings (min) | 267.40 | 246.93 | 1.08x |
| 5M × 100k long strings (max) | 269.19 | 245.68 | 1.10x |
| 5M × 100k long strings (count+sum) | 270.75 | 239.63 | 1.13x |
Across all 22 measured shapes, packed-u64 is 1.06x to 1.19x faster
than PR #6748's inline path. This is the "missing benefit" the
comparator-closure / IndexHash dispatch was hiding.
All benchmarks run on Linux (WSL), Rust nightly --release, warmup=3,
iters=10, --test-threads=1. Commands:
cargo test -p daft-recordbatch --release -- bench_q1_like --nocapture --ignored --test-threads=1
cargo test -p daft-recordbatch --release -- bench_packed_u64_two_strings --nocapture --ignored --test-threads=1
## Test plan
- 5 new test_inline_packed_u64_* cases comparing inline output to
the fallback path:
- Utf8 × Utf8, no nulls.
- Utf8 × Utf8, nulls in both columns.
- Binary × Binary.
- Utf8 × Binary.
- Short-string CHAR(1) shape (TPC-H Q1).
- All existing inline_agg tests still pass (48 total, 0 failures).
- New bench_packed_u64_two_strings benchmark exercises the path
end-to-end at scale across the count/sum/min/max/count+sum matrix.
## Related issues
Part of #6585 (deeper specialization of Item 3, on top of PR #6748).BABTUNA:perf/packed-symbol-groupby perf(inline-agg): pack two-string-column keys into u64 for typed FNV grouping
## Summary
Extends Item 3 of #6585 (and builds on PR #6748). For the
two-string-column groupby shape, this change packs the two `u32`
symbol IDs produced by `symbolize_column` into a single `u64` key and
groups against a typed `FnvHashMap<u64, u32>` in a tight integer loop
— no per-row comparator closure, no `IndexHash`, no dynamic-typed
`Series` equality dispatch.
PR #6748 already symbolizes Utf8/Binary group-by columns, but the
symbolized columns are still fed into the generic multi-column hash
path (`agg_generic_hash_path`), which keeps the comparator-closure
overhead. This PR captures the remaining benefit by routing the
two-string-column shape — including the TPC-H Q1 `l_returnflag` /
`l_linestatus` pattern — through a dedicated typed map. Grouping
semantics and final query results are unchanged.
## Conceptual example
```
Input rows: After symbolization: Packed u64 key:
key1 | key2 key1_sym | key2_sym (key1_sym << 32) | key2_sym
--------|-------- ---------|--------- -------------------------
"alice" | "red" 0 | 0 0x00000000_00000000
"bob" | "red" 1 | 0 0x00000001_00000000
"alice" | "blue" 0 | 1 0x00000000_00000001
"alice" | "red" 0 | 0 0x00000000_00000000
```
The two symbol spaces sit in disjoint 32-bit halves of the u64 key,
so distinct (sym0, sym1) pairs always yield distinct packed keys.
Null-equals-null is preserved: when a column has nulls,
`symbolize_column` reserves symbol ID 0 for null and starts non-null
IDs at 1, so both-null rows share a unique packed key and never collide
with non-null rows.
## Key changes
- Add `agg_packed_u64_path` for exactly two Utf8/Binary group-by
columns: symbolize each column into a `Vec<u32>`, pack the pair into
a `u64`, group with `FnvHashMap<u64, u32>` using the same
Vacant/Occupied pattern as the existing single-column fast paths.
- Extract a small `symbolize_string_col` helper that returns
`Ok(Some(Vec<u32>))` for Utf8/Binary and `Ok(None)` otherwise — used
by the new path and isolates per-dtype null/value-accessor wiring.
- Dispatch in `agg_groupby_inline` tries the packed-u64 path first for
multi-column shapes, then falls through to the existing
`agg_symbolized_path`, then to `agg_generic_hash_path`. All other
multi-column shapes (3+ columns, int×string, pure int multi-col) are
unchanged.
- No avg-bytes-per-row gate: unlike `agg_symbolized_path`, the
packed-u64 path's cost is one symbolize pass plus a tight `u64`-keyed
loop, not a symbolized RecordBatch rebuild fed through the generic
comparator-closure hash path. Short-string shapes (TPC-H Q1) win
along with long-string shapes.
- New `bench_packed_u64_two_strings` Rust-level benchmark covering
two-Utf8-column shapes at varying cardinalities and across the full
count/sum/min/max/count+sum agg matrix.
## Benchmarks
### Rust-level Q1-like benchmark (2 short string keys, 6 groups, sum+count of two float64 cols)
| rows | inline (ms) | fallback (ms) | speedup |
|------|-------------|---------------|---------|
| 1.2M | 21.75 | 33.74 | 1.55x |
| 5M | 111.34 | 175.12 | 1.57x |
### Rust-level packed-u64 benchmark (2 long Utf8 keys, int64 val, full agg matrix)
| agg | rows | distinct | inline (ms) | fallback (ms) | speedup |
|-----------|------|-------------------|-------------|---------------|---------|
| count | 1.2M | 8 × 4 = 32 | 57.05 | 28.53 | 0.50x |
| sum | 1.2M | 8 × 4 = 32 | 56.05 | 33.73 | 0.60x |
| min | 1.2M | 8 × 4 = 32 | 57.66 | 33.31 | 0.58x |
| max | 1.2M | 8 × 4 = 32 | 56.23 | 33.97 | 0.60x |
| count+sum | 1.2M | 8 × 4 = 32 | 56.75 | 34.10 | 0.60x |
| count | 1.2M | 64 × 32 = 2048 | 55.56 | 27.64 | 0.50x |
| sum | 1.2M | 64 × 32 = 2048 | 62.70 | 39.52 | 0.63x |
| min | 1.2M | 64 × 32 = 2048 | 54.75 | 41.41 | 0.76x |
| max | 1.2M | 64 × 32 = 2048 | 56.90 | 41.01 | 0.72x |
| count+sum | 1.2M | 64 × 32 = 2048 | 55.77 | 40.71 | 0.73x |
| count | 5M | 8 × 4 = 32 | 246.37 | 127.56 | 0.52x |
| sum | 5M | 8 × 4 = 32 | 235.03 | 142.14 | 0.60x |
| min | 5M | 8 × 4 = 32 | 235.38 | 140.41 | 0.60x |
| max | 5M | 8 × 4 = 32 | 238.82 | 142.38 | 0.60x |
| count+sum | 5M | 8 × 4 = 32 | 240.04 | 142.84 | 0.60x |
| count | 5M | 1000 × 100 = 100k | 241.55 | 132.99 | 0.55x |
| sum | 5M | 1000 × 100 = 100k | 243.80 | 207.44 | 0.85x |
| min | 5M | 1000 × 100 = 100k | 246.93 | 210.96 | 0.85x |
| max | 5M | 1000 × 100 = 100k | 245.68 | 212.00 | 0.86x |
| count+sum | 5M | 1000 × 100 = 100k | 239.63 | 206.81 | 0.86x |
The long-string shapes are below 1.0x vs the Daft fallback, but that
gap is pre-existing in PR #6748's inline path — Daft's general groupby
machinery is currently faster for these specific shapes. The delta
this PR introduces is the comparison below.
### Inline-vs-inline: packed-u64 vs PR #6748 inline (same shape, same machine)
| shape | PR #6748 inline (ms) | packed-u64 (ms) | speedup |
|----------------------------------------|----------------------|-----------------|---------|
| Q1 1.2M × 6 short strings (sum+count) | 23.07 | 21.75 | 1.06x |
| Q1 5M × 6 short strings (sum+count) | 120.15 | 111.34 | 1.08x |
| 1.2M × 32 long strings (count) | 61.69 | 57.05 | 1.08x |
| 1.2M × 32 long strings (sum) | 65.88 | 56.05 | 1.18x |
| 1.2M × 32 long strings (min) | 62.78 | 57.66 | 1.09x |
| 1.2M × 32 long strings (max) | 63.70 | 56.23 | 1.13x |
| 1.2M × 32 long strings (count+sum) | 64.76 | 56.75 | 1.14x |
| 1.2M × 2048 long strings (count) | 62.43 | 55.56 | 1.12x |
| 1.2M × 2048 long strings (sum) | 62.62 | 62.70 | 1.00x |
| 1.2M × 2048 long strings (min) | 64.45 | 54.75 | 1.18x |
| 1.2M × 2048 long strings (max) | 63.78 | 56.90 | 1.12x |
| 1.2M × 2048 long strings (count+sum) | 66.10 | 55.77 | 1.19x |
| 5M × 32 long strings (count) | 293.04 | 246.37 | 1.19x |
| 5M × 32 long strings (sum) | 265.15 | 235.03 | 1.13x |
| 5M × 32 long strings (min) | 263.15 | 235.38 | 1.12x |
| 5M × 32 long strings (max) | 267.26 | 238.82 | 1.12x |
| 5M × 32 long strings (count+sum) | 262.31 | 240.04 | 1.09x |
| 5M × 100k long strings (count) | 268.71 | 241.55 | 1.11x |
| 5M × 100k long strings (sum) | 269.30 | 243.80 | 1.10x |
| 5M × 100k long strings (min) | 267.40 | 246.93 | 1.08x |
| 5M × 100k long strings (max) | 269.19 | 245.68 | 1.10x |
| 5M × 100k long strings (count+sum) | 270.75 | 239.63 | 1.13x |
Across all 22 measured shapes, packed-u64 is 1.06x to 1.19x faster
than PR #6748's inline path. This is the "missing benefit" the
comparator-closure / IndexHash dispatch was hiding.
All benchmarks run on Linux (WSL), Rust nightly --release, warmup=3,
iters=10, --test-threads=1. Commands:
cargo test -p daft-recordbatch --release -- bench_q1_like --nocapture --ignored --test-threads=1
cargo test -p daft-recordbatch --release -- bench_packed_u64_two_strings --nocapture --ignored --test-threads=1
## Test plan
- 5 new test_inline_packed_u64_* cases comparing inline output to
the fallback path:
- Utf8 × Utf8, no nulls.
- Utf8 × Utf8, nulls in both columns.
- Binary × Binary.
- Utf8 × Binary.
- Short-string CHAR(1) shape (TPC-H Q1).
- All existing inline_agg tests still pass (48 total, 0 failures).
- New bench_packed_u64_two_strings benchmark exercises the path
end-to-end at scale across the count/sum/min/max/count+sum matrix.
## Related issues
Part of #6585 (deeper specialization of Item 3, on top of PR #6748).BABTUNA:perf/packed-symbol-groupby Latest Branches
0%
0%
BABTUNA:perf/packed-symbol-groupby 0%
everettVT/uuidv7-arrow-kernel © 2026 CodSpeed Technology