Avatar for the Eventual-Inc user
Eventual-Inc
Daft
BlogDocsChangelog

Performance History

Latest Results

fix: revert "feat(daft-ext): scalar daft_func macro with overloading (#6844)" (#6925) ## Changes Made This reverts commit afb30afd237dc8babb0a4269243db9bc157f362f. It's an immediate fix for an extension regression which is blocking the 0.7.11 release. After the release is fixed, we can reintroduce this with a longer-term fix. We could also just remove the overloading things for now and leave the proc macro as an immediate follow-up. ## Related Issues https://github.com/Eventual-Inc/Daft/issues/6922 This hasn't been verified, but a patch would be this. That being said, I'd prefer the longer term fix which uses interior mutability with a scalar function factory to fix the overload registration and allow for overloads for both extensions and daft internal. ```rs if self.variants.len() == 1 { return Ok(BuiltinScalarFnVariant::Sync(Arc::new( self.variants[0].as_ref().clone(), ))); } ```
main
46 minutes ago
feat(temporal): add Spark-style add_months and months_between (#6913) ## Summary Implements two missing functions from issue #3798 by adding Spark-style `add_months` and `months_between` as native Daft temporal expressions. This PR adds two new scalar UDFs in the temporal module, wires them through the Python and SQL surfaces, and adds regression coverage. Both functions match Spark's documented semantics, including end-of-month clamping in `add_months` and the 8-decimal rounding in `months_between`. ## Why The issue asks for parity with PySpark's temporal functions. This PR focuses on two practical pieces: - Calendar-month arithmetic (`add_months`) with correct end-of-month clamping. - Spark-compatible `months_between` including the same-day / both-last-day fast paths and time-of-day fractional math. ## Changes Made - Add `AddMonths` and `MonthsBetween` scalar UDFs in `src/daft-functions-temporal/src/date_arithmetic.rs`: - `AddMonths` uses `chrono::Months::checked_add_months` / `checked_sub_months` for end-of-month clamping. - `MonthsBetween` casts both inputs to `Timestamp(us, None)`, applies same-day / both-last-day shortcuts, and rounds to 8 decimal places. - Register both UDFs in `src/daft-functions-temporal/src/lib.rs`. - Add SQL handlers `SQLAddMonths` and `SQLMonthsBetween` in `src/daft-sql/src/modules/temporal.rs` with Spark argument order. - Add Python wrappers `add_months` and `months_between` in `daft/functions/datetime.py` and export them from `daft/functions/__init__.py`. - Add focused tests in `tests/dataframe/test_temporals.py`: - `add_months` coverage: basic, EOM clamping (incl. leap-year 2024-02-29), negative months, year rollover, Timestamp input, null propagation. - `months_between` coverage: same day-of-month, both-last-day, day-difference, time-of-day fraction, Spark doc example (`3.94959677`), negative direction, null propagation. - SQL integration test covering both functions. ## Behavior - `add_months('2023-01-31', 1)` returns `2023-02-28`; `add_months('2024-01-31', 1)` returns `2024-02-29`. - `add_months` always returns Date, even when the input is a Timestamp. - `months_between('1997-02-28', '1996-10-30')` returns `3.93548387` (pure-date inputs). - `months_between('1997-02-28 10:30:00', '1996-10-30')` returns `3.94959677` (Spark doc example). - `months_between` returns an integer when both inputs share day-of-month or are both the last day of their respective months. - Null in either input row propagates to null in the output. ## Test Plan - `cargo check -p daft-functions-temporal -p daft-sql` - `make build` - `DAFT_RUNNER=native pytest -q tests/dataframe/test_temporals.py -k "add_months or months_between"` ## Related Issues - Part of #3798
main
3 hours ago
perf(inline-agg): pack two-string-column keys into u64 for typed FNV grouping ## Summary Extends Item 3 of #6585 (and builds on PR #6748). For the two-string-column groupby shape, this change packs the two `u32` symbol IDs produced by `symbolize_column` into a single `u64` key and groups against a typed `FnvHashMap<u64, u32>` in a tight integer loop — no per-row comparator closure, no `IndexHash`, no dynamic-typed `Series` equality dispatch. PR #6748 already symbolizes Utf8/Binary group-by columns, but the symbolized columns are still fed into the generic multi-column hash path (`agg_generic_hash_path`), which keeps the comparator-closure overhead. This PR captures the remaining benefit by routing the two-string-column shape — including the TPC-H Q1 `l_returnflag` / `l_linestatus` pattern — through a dedicated typed map. Grouping semantics and final query results are unchanged. ## Conceptual example ``` Input rows: After symbolization: Packed u64 key: key1 | key2 key1_sym | key2_sym (key1_sym << 32) | key2_sym --------|-------- ---------|--------- ------------------------- "alice" | "red" 0 | 0 0x00000000_00000000 "bob" | "red" 1 | 0 0x00000001_00000000 "alice" | "blue" 0 | 1 0x00000000_00000001 "alice" | "red" 0 | 0 0x00000000_00000000 ``` The two symbol spaces sit in disjoint 32-bit halves of the u64 key, so distinct (sym0, sym1) pairs always yield distinct packed keys. Null-equals-null is preserved: when a column has nulls, `symbolize_column` reserves symbol ID 0 for null and starts non-null IDs at 1, so both-null rows share a unique packed key and never collide with non-null rows. ## Key changes - Add `agg_packed_u64_path` for exactly two Utf8/Binary group-by columns: symbolize each column into a `Vec<u32>`, pack the pair into a `u64`, group with `FnvHashMap<u64, u32>` using the same Vacant/Occupied pattern as the existing single-column fast paths. - Extract a small `symbolize_string_col` helper that returns `Ok(Some(Vec<u32>))` for Utf8/Binary and `Ok(None)` otherwise — used by the new path and isolates per-dtype null/value-accessor wiring. - Dispatch in `agg_groupby_inline` tries the packed-u64 path first for multi-column shapes, then falls through to the existing `agg_symbolized_path`, then to `agg_generic_hash_path`. All other multi-column shapes (3+ columns, int×string, pure int multi-col) are unchanged. - No avg-bytes-per-row gate: unlike `agg_symbolized_path`, the packed-u64 path's cost is one symbolize pass plus a tight `u64`-keyed loop, not a symbolized RecordBatch rebuild fed through the generic comparator-closure hash path. Short-string shapes (TPC-H Q1) win along with long-string shapes. - New `bench_packed_u64_two_strings` Rust-level benchmark covering two-Utf8-column shapes at varying cardinalities and across the full count/sum/min/max/count+sum agg matrix. ## Benchmarks ### Rust-level Q1-like benchmark (2 short string keys, 6 groups, sum+count of two float64 cols) | rows | inline (ms) | fallback (ms) | speedup | |------|-------------|---------------|---------| | 1.2M | 21.75 | 33.74 | 1.55x | | 5M | 111.34 | 175.12 | 1.57x | ### Rust-level packed-u64 benchmark (2 long Utf8 keys, int64 val, full agg matrix) | agg | rows | distinct | inline (ms) | fallback (ms) | speedup | |-----------|------|-------------------|-------------|---------------|---------| | count | 1.2M | 8 × 4 = 32 | 57.05 | 28.53 | 0.50x | | sum | 1.2M | 8 × 4 = 32 | 56.05 | 33.73 | 0.60x | | min | 1.2M | 8 × 4 = 32 | 57.66 | 33.31 | 0.58x | | max | 1.2M | 8 × 4 = 32 | 56.23 | 33.97 | 0.60x | | count+sum | 1.2M | 8 × 4 = 32 | 56.75 | 34.10 | 0.60x | | count | 1.2M | 64 × 32 = 2048 | 55.56 | 27.64 | 0.50x | | sum | 1.2M | 64 × 32 = 2048 | 62.70 | 39.52 | 0.63x | | min | 1.2M | 64 × 32 = 2048 | 54.75 | 41.41 | 0.76x | | max | 1.2M | 64 × 32 = 2048 | 56.90 | 41.01 | 0.72x | | count+sum | 1.2M | 64 × 32 = 2048 | 55.77 | 40.71 | 0.73x | | count | 5M | 8 × 4 = 32 | 246.37 | 127.56 | 0.52x | | sum | 5M | 8 × 4 = 32 | 235.03 | 142.14 | 0.60x | | min | 5M | 8 × 4 = 32 | 235.38 | 140.41 | 0.60x | | max | 5M | 8 × 4 = 32 | 238.82 | 142.38 | 0.60x | | count+sum | 5M | 8 × 4 = 32 | 240.04 | 142.84 | 0.60x | | count | 5M | 1000 × 100 = 100k | 241.55 | 132.99 | 0.55x | | sum | 5M | 1000 × 100 = 100k | 243.80 | 207.44 | 0.85x | | min | 5M | 1000 × 100 = 100k | 246.93 | 210.96 | 0.85x | | max | 5M | 1000 × 100 = 100k | 245.68 | 212.00 | 0.86x | | count+sum | 5M | 1000 × 100 = 100k | 239.63 | 206.81 | 0.86x | The long-string shapes are below 1.0x vs the Daft fallback, but that gap is pre-existing in PR #6748's inline path — Daft's general groupby machinery is currently faster for these specific shapes. The delta this PR introduces is the comparison below. ### Inline-vs-inline: packed-u64 vs PR #6748 inline (same shape, same machine) | shape | PR #6748 inline (ms) | packed-u64 (ms) | speedup | |----------------------------------------|----------------------|-----------------|---------| | Q1 1.2M × 6 short strings (sum+count) | 23.07 | 21.75 | 1.06x | | Q1 5M × 6 short strings (sum+count) | 120.15 | 111.34 | 1.08x | | 1.2M × 32 long strings (count) | 61.69 | 57.05 | 1.08x | | 1.2M × 32 long strings (sum) | 65.88 | 56.05 | 1.18x | | 1.2M × 32 long strings (min) | 62.78 | 57.66 | 1.09x | | 1.2M × 32 long strings (max) | 63.70 | 56.23 | 1.13x | | 1.2M × 32 long strings (count+sum) | 64.76 | 56.75 | 1.14x | | 1.2M × 2048 long strings (count) | 62.43 | 55.56 | 1.12x | | 1.2M × 2048 long strings (sum) | 62.62 | 62.70 | 1.00x | | 1.2M × 2048 long strings (min) | 64.45 | 54.75 | 1.18x | | 1.2M × 2048 long strings (max) | 63.78 | 56.90 | 1.12x | | 1.2M × 2048 long strings (count+sum) | 66.10 | 55.77 | 1.19x | | 5M × 32 long strings (count) | 293.04 | 246.37 | 1.19x | | 5M × 32 long strings (sum) | 265.15 | 235.03 | 1.13x | | 5M × 32 long strings (min) | 263.15 | 235.38 | 1.12x | | 5M × 32 long strings (max) | 267.26 | 238.82 | 1.12x | | 5M × 32 long strings (count+sum) | 262.31 | 240.04 | 1.09x | | 5M × 100k long strings (count) | 268.71 | 241.55 | 1.11x | | 5M × 100k long strings (sum) | 269.30 | 243.80 | 1.10x | | 5M × 100k long strings (min) | 267.40 | 246.93 | 1.08x | | 5M × 100k long strings (max) | 269.19 | 245.68 | 1.10x | | 5M × 100k long strings (count+sum) | 270.75 | 239.63 | 1.13x | Across all 22 measured shapes, packed-u64 is 1.06x to 1.19x faster than PR #6748's inline path. This is the "missing benefit" the comparator-closure / IndexHash dispatch was hiding. All benchmarks run on Linux (WSL), Rust nightly --release, warmup=3, iters=10, --test-threads=1. Commands: cargo test -p daft-recordbatch --release -- bench_q1_like --nocapture --ignored --test-threads=1 cargo test -p daft-recordbatch --release -- bench_packed_u64_two_strings --nocapture --ignored --test-threads=1 ## Test plan - 5 new test_inline_packed_u64_* cases comparing inline output to the fallback path: - Utf8 × Utf8, no nulls. - Utf8 × Utf8, nulls in both columns. - Binary × Binary. - Utf8 × Binary. - Short-string CHAR(1) shape (TPC-H Q1). - All existing inline_agg tests still pass (48 total, 0 failures). - New bench_packed_u64_two_strings benchmark exercises the path end-to-end at scale across the count/sum/min/max/count+sum matrix. ## Related issues Part of #6585 (deeper specialization of Item 3, on top of PR #6748).
BABTUNA:perf/packed-symbol-groupby
4 hours ago
perf(inline-agg): pack two-string-column keys into u64 for typed FNV grouping ## Summary Extends Item 3 of #6585 (and builds on PR #6748). For the two-string-column groupby shape, this change packs the two `u32` symbol IDs produced by `symbolize_column` into a single `u64` key and groups against a typed `FnvHashMap<u64, u32>` in a tight integer loop — no per-row comparator closure, no `IndexHash`, no dynamic-typed `Series` equality dispatch. PR #6748 already symbolizes Utf8/Binary group-by columns, but the symbolized columns are still fed into the generic multi-column hash path (`agg_generic_hash_path`), which keeps the comparator-closure overhead. This PR captures the remaining benefit by routing the two-string-column shape — including the TPC-H Q1 `l_returnflag` / `l_linestatus` pattern — through a dedicated typed map. Grouping semantics and final query results are unchanged. ## Conceptual example ``` Input rows: After symbolization: Packed u64 key: key1 | key2 key1_sym | key2_sym (key1_sym << 32) | key2_sym --------|-------- ---------|--------- ------------------------- "alice" | "red" 0 | 0 0x00000000_00000000 "bob" | "red" 1 | 0 0x00000001_00000000 "alice" | "blue" 0 | 1 0x00000000_00000001 "alice" | "red" 0 | 0 0x00000000_00000000 ``` The two symbol spaces sit in disjoint 32-bit halves of the u64 key, so distinct (sym0, sym1) pairs always yield distinct packed keys. Null-equals-null is preserved: when a column has nulls, `symbolize_column` reserves symbol ID 0 for null and starts non-null IDs at 1, so both-null rows share a unique packed key and never collide with non-null rows. ## Key changes - Add `agg_packed_u64_path` for exactly two Utf8/Binary group-by columns: symbolize each column into a `Vec<u32>`, pack the pair into a `u64`, group with `FnvHashMap<u64, u32>` using the same Vacant/Occupied pattern as the existing single-column fast paths. - Extract a small `symbolize_string_col` helper that returns `Ok(Some(Vec<u32>))` for Utf8/Binary and `Ok(None)` otherwise — used by the new path and isolates per-dtype null/value-accessor wiring. - Dispatch in `agg_groupby_inline` tries the packed-u64 path first for multi-column shapes, then falls through to the existing `agg_symbolized_path`, then to `agg_generic_hash_path`. All other multi-column shapes (3+ columns, int×string, pure int multi-col) are unchanged. - No avg-bytes-per-row gate: unlike `agg_symbolized_path`, the packed-u64 path's cost is one symbolize pass plus a tight `u64`-keyed loop, not a symbolized RecordBatch rebuild fed through the generic comparator-closure hash path. Short-string shapes (TPC-H Q1) win along with long-string shapes. - New `bench_packed_u64_two_strings` Rust-level benchmark covering two-Utf8-column shapes at varying cardinalities and across the full count/sum/min/max/count+sum agg matrix. ## Benchmarks ### Rust-level Q1-like benchmark (2 short string keys, 6 groups, sum+count of two float64 cols) | rows | inline (ms) | fallback (ms) | speedup | |------|-------------|---------------|---------| | 1.2M | 21.75 | 33.74 | 1.55x | | 5M | 111.34 | 175.12 | 1.57x | ### Rust-level packed-u64 benchmark (2 long Utf8 keys, int64 val, full agg matrix) | agg | rows | distinct | inline (ms) | fallback (ms) | speedup | |-----------|------|-------------------|-------------|---------------|---------| | count | 1.2M | 8 × 4 = 32 | 57.05 | 28.53 | 0.50x | | sum | 1.2M | 8 × 4 = 32 | 56.05 | 33.73 | 0.60x | | min | 1.2M | 8 × 4 = 32 | 57.66 | 33.31 | 0.58x | | max | 1.2M | 8 × 4 = 32 | 56.23 | 33.97 | 0.60x | | count+sum | 1.2M | 8 × 4 = 32 | 56.75 | 34.10 | 0.60x | | count | 1.2M | 64 × 32 = 2048 | 55.56 | 27.64 | 0.50x | | sum | 1.2M | 64 × 32 = 2048 | 62.70 | 39.52 | 0.63x | | min | 1.2M | 64 × 32 = 2048 | 54.75 | 41.41 | 0.76x | | max | 1.2M | 64 × 32 = 2048 | 56.90 | 41.01 | 0.72x | | count+sum | 1.2M | 64 × 32 = 2048 | 55.77 | 40.71 | 0.73x | | count | 5M | 8 × 4 = 32 | 246.37 | 127.56 | 0.52x | | sum | 5M | 8 × 4 = 32 | 235.03 | 142.14 | 0.60x | | min | 5M | 8 × 4 = 32 | 235.38 | 140.41 | 0.60x | | max | 5M | 8 × 4 = 32 | 238.82 | 142.38 | 0.60x | | count+sum | 5M | 8 × 4 = 32 | 240.04 | 142.84 | 0.60x | | count | 5M | 1000 × 100 = 100k | 241.55 | 132.99 | 0.55x | | sum | 5M | 1000 × 100 = 100k | 243.80 | 207.44 | 0.85x | | min | 5M | 1000 × 100 = 100k | 246.93 | 210.96 | 0.85x | | max | 5M | 1000 × 100 = 100k | 245.68 | 212.00 | 0.86x | | count+sum | 5M | 1000 × 100 = 100k | 239.63 | 206.81 | 0.86x | The long-string shapes are below 1.0x vs the Daft fallback, but that gap is pre-existing in PR #6748's inline path — Daft's general groupby machinery is currently faster for these specific shapes. The delta this PR introduces is the comparison below. ### Inline-vs-inline: packed-u64 vs PR #6748 inline (same shape, same machine) | shape | PR #6748 inline (ms) | packed-u64 (ms) | speedup | |----------------------------------------|----------------------|-----------------|---------| | Q1 1.2M × 6 short strings (sum+count) | 23.07 | 21.75 | 1.06x | | Q1 5M × 6 short strings (sum+count) | 120.15 | 111.34 | 1.08x | | 1.2M × 32 long strings (count) | 61.69 | 57.05 | 1.08x | | 1.2M × 32 long strings (sum) | 65.88 | 56.05 | 1.18x | | 1.2M × 32 long strings (min) | 62.78 | 57.66 | 1.09x | | 1.2M × 32 long strings (max) | 63.70 | 56.23 | 1.13x | | 1.2M × 32 long strings (count+sum) | 64.76 | 56.75 | 1.14x | | 1.2M × 2048 long strings (count) | 62.43 | 55.56 | 1.12x | | 1.2M × 2048 long strings (sum) | 62.62 | 62.70 | 1.00x | | 1.2M × 2048 long strings (min) | 64.45 | 54.75 | 1.18x | | 1.2M × 2048 long strings (max) | 63.78 | 56.90 | 1.12x | | 1.2M × 2048 long strings (count+sum) | 66.10 | 55.77 | 1.19x | | 5M × 32 long strings (count) | 293.04 | 246.37 | 1.19x | | 5M × 32 long strings (sum) | 265.15 | 235.03 | 1.13x | | 5M × 32 long strings (min) | 263.15 | 235.38 | 1.12x | | 5M × 32 long strings (max) | 267.26 | 238.82 | 1.12x | | 5M × 32 long strings (count+sum) | 262.31 | 240.04 | 1.09x | | 5M × 100k long strings (count) | 268.71 | 241.55 | 1.11x | | 5M × 100k long strings (sum) | 269.30 | 243.80 | 1.10x | | 5M × 100k long strings (min) | 267.40 | 246.93 | 1.08x | | 5M × 100k long strings (max) | 269.19 | 245.68 | 1.10x | | 5M × 100k long strings (count+sum) | 270.75 | 239.63 | 1.13x | Across all 22 measured shapes, packed-u64 is 1.06x to 1.19x faster than PR #6748's inline path. This is the "missing benefit" the comparator-closure / IndexHash dispatch was hiding. All benchmarks run on Linux (WSL), Rust nightly --release, warmup=3, iters=10, --test-threads=1. Commands: cargo test -p daft-recordbatch --release -- bench_q1_like --nocapture --ignored --test-threads=1 cargo test -p daft-recordbatch --release -- bench_packed_u64_two_strings --nocapture --ignored --test-threads=1 ## Test plan - 5 new test_inline_packed_u64_* cases comparing inline output to the fallback path: - Utf8 × Utf8, no nulls. - Utf8 × Utf8, nulls in both columns. - Binary × Binary. - Utf8 × Binary. - Short-string CHAR(1) shape (TPC-H Q1). - All existing inline_agg tests still pass (48 total, 0 failures). - New bench_packed_u64_two_strings benchmark exercises the path end-to-end at scale across the count/sum/min/max/count+sum matrix. ## Related issues Part of #6585 (deeper specialization of Item 3, on top of PR #6748).
BABTUNA:perf/packed-symbol-groupby
5 hours ago
style: format uuid function
everettVT/uuidv7-arrow-kernel
5 hours ago
ci: re-trigger checks
BABTUNA:feat/temporal-alias-batch4
6 hours ago

Latest Branches

CodSpeed Performance Gauge
0%
fix: revert "feat(daft-ext): scalar daft_func macro with overloading (#6844)"#6925
2 hours ago
a354639
rchowell/ext-revert
CodSpeed Performance Gauge
0%
perf(inline-agg): pack two-string-column keys into u64 for typed FNV grouping#6924
5 hours ago
3346a01
BABTUNA:perf/packed-symbol-groupby
CodSpeed Performance Gauge
0%
6 hours ago
c2628ff
everettVT/uuidv7-arrow-kernel
© 2026 CodSpeed Technology
Home Terms Privacy Docs