Avatar for the Eventual-Inc user
Eventual-Inc
Daft
BlogDocsChangelog

Performance History

Latest Results

feat(temporal): add Spark-style timezone conversions Implements from_utc_timestamp, to_utc_timestamp, and convert_timezone for Spark parity (#3798). FromUtcTimestamp interprets the input as a UTC instant and returns the wall-clock time in the given timezone as a tz-naive Timestamp. ToUtcTimestamp does the inverse. convert_timezone is a Python and SQL alias over the existing convert_time_zone with Spark's reversed argument order (target_tz, source_ts).
BABTUNA:feat/temporal-tz-conversions
15 minutes ago
style: apply ruff format and cargo fmt
BABTUNA:feat/temporal-add-months
5 hours ago
fix: Handle dtype mismatch error in join_asof join keys (#6904) Currently as_of joins with mismatched by keys would fail with `mismatch dtype error`. The fix is to 1. normalize and cast the keys to a shared supertype (e..g. int64 and float64 are normalized to float64), which is the same methodology used for the on_key, as well as for the join keys of equality joins. 2. remove the computation of right_cols_to_drop in the local executor, because it does not drop the casted expressions computed during normalization, e.g. Cast(Column("left_on_key"), Utf8), and led to duplicate columns produced in the output (the "left_on_key" column was duplicated in the result). Since we already computed the desired output schema in the logical plan, we can simply use this as the basis to prune columns during execution. ``` left = {"ts": [1, 3, 5], "v": [...]} # ts is Int64 right = {"ts": [2.0, 4.0], "w": [...]} # ts is Float64 # correct output {"ts": [1, 3, 5 ], "v": [10, 30, 50], "w": [None, 20, 40]} # without the fix: the second "ts" silently overwrites the first: {"ts": [None, 2.0, 4.0], "v": [10, 30, 50], "w": [None, 20, 40]} # ^^^ left ts [1, 3, 5] is gone — no error raised ``` **a more in-depth explanation for the second bug:** 1. This bug requires three conditions to trigger: - a join key (meaning it should have been a candidate for right_cols_to_drop) - that shares a name with the other side (explained later) - mismatched types on that key (causing normalization to wrap it in a Cast expression, that prevents it from being caught in right_cols_to_drop). 2. Here’s how the bug occurs: 1. At the logical plan layer, right_cols_to_drop is computed from bare unresolved column expressions — before any normalization has occurred. It is then passed to deduplicate_asof_join_columns, which uses it to determine which right-side columns need to be renamed with a right. prefix. Since the join keys are inside the dropped cols set, the deduplication step skips it since it’s already being dropped. 2. After translation, at the physical plan layer, AsofJoinOperator::new ignored the output_schema and recomputed right_cols_to_drop from scratch — but by this point, translation had already wrapped bare Column("g") expressions in Cast(Column("g"), Utf8). The extract_name closure used in the recomputation only handled bare column expressions, so it returned None for any cast-wrapped expression, silently omitting the right key column from right_cols_to_drop. 4. Without it in the drop set, prune_right_batch kept the column, producing a record batch with duplicate column names. When to_pydict() built a Python dict, the duplicate key caused the right-side values to silently overwrite the left-side values, corrupting the output.
main
6 hours ago
fix(distributed): emit empty downstream task for limit(0) (#6916) ## Changes Made `LimitNode` in the Flotilla pipeline computes `max_concurrent_tasks = total_remaining().div_ceil(estimated_num_rows)`. For `limit=0` against any input source carrying row-count metadata (e.g. `InMemoryScan`), this collapses to 0, the submission loop never runs, and **no downstream task is ever emitted**. Blocking sinks downstream — notably ungrouped aggregates — then have 0 input tasks, never finalize, and produce 0 output rows. The native runner doesn't have a Limit pipeline node, and its `AggregateSink::finalize` always runs once even with zero input states, which is why the native path was unaffected. ### Fix Short-circuit `is_take_done() && is_skip_done()` at the top of `limit_execution_loop` by emitting one empty scan task and returning — so downstream blocking sinks still finalize and produce their conventional one-row null result. Pulled the empty-scan construction into a private `build_empty_scan_task` helper to dedupe with the existing "all rows skipped" code path. ### Repro (before fix) ```python import daft from daft import col df = daft.from_pydict({"values": [1.0, 2.0]}).limit(0) df.agg(col("values").sum().alias("s")).collect().to_pydict() # Ray: {"s": []} ← bug # Native: {"s": [None]} ``` ### Tests - New unit test `test_limit_zero_emits_empty_task` in `src/daft-distributed/src/pipeline_node/limit.rs` covering the limit=0 path. - Verified locally: `tests/dataframe/test_percentile.py` all 14 tests pass on Ray, and `tests/dataframe/test_limit_offset.py` all 199 tests pass on Ray. ## Related Issues Unblocks the v0.7.11 release — `test_percentile_empty_input` was failing on the Ray test job: https://github.com/Eventual-Inc/Daft/actions/runs/25575457827/job/75087986460 The test was added in #6153 (percentile/median ops). It exposed this latent `limit=0` bug in the distributed limit node; the bug also affects `sum`, `count`, `mean`, and any other ungrouped aggregation over `limit(0)`-empty input on the Ray runner. ## Test plan - [ ] CI green - [ ] `test_percentile_empty_input` passes on Ray - [ ] `test_limit_offset.py` suite stays green on Ray 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Varun Madan <varun@Varuns-MacBook-Pro.local> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
main
6 hours ago
feat(lance): partitioned merge + address greptile review Adds the fast-path partitioned column merge for benchmark enrichment workflows, and addresses the greptile-bot review comments on #6915. Merge feature `df.write_lance(uri, mode="merge", partition_cols=[...])` now performs a fast-path column merge: new columns are written as raw `.lance` files alongside existing data and stitched into each leaf fragment's metadata without rewriting base data. Implementation in `daft/io/lance/lance_partitioned_merge.py` (algorithm cribbed from `daft-lance`'s `FastPathFragmentWriter` — inlined because `daft-lance` depends on `daft` and so cannot be a Daft dependency). Read-side requirements wired through: - `include_fragment_id=True` on `LanceNamespaceScanOperator` exposes per-leaf fragment ids. - `default_scan_options` passthrough so callers can request `with_row_address=True` for `_rowaddr` per-leaf. Workflow: df = daft.read_lance(uri, namespace_partitioning=True, include_fragment_id=True, default_scan_options={"with_row_address": True}) df = df.with_column("prod__caption", caption_udf(df["frame"])) df.write_lance(uri, partition_cols=["robot_id"], mode="merge") Review fixes - P1: `_group_by_partition` used `pa.compute.equal(col, null)` which returns nulls; `table.filter()` treats null as False, silently dropping every row with a NULL partition value. Fixed via `pa.compute.is_null()` branch. Test: `test_null_partition_values_preserved`. - P1: `_build_partition_keys` treated `source_ids[0]` as a positional Arrow schema index. Per spec, `source_ids` references the logical `lance:field_id` metadata; this happened to work for Daft-written namespaces (we assign field_id = position) but breaks reads of third-party-written namespaces with non-positional ids. Fixed by looking up the source field via `lance:field_id` metadata. Test: `test_source_ids_resolved_by_lance_field_id_not_position`. - P2: Moved Daft-internal imports (`io_config_to_storage_options`, `ExpressionsProjection`) from inline to the top of their respective files, per project convention. - P2: `read_lance(..., namespace_partitioning=True)` now warns when `version` / `asof` / `block_size` / `commit_lock` / `index_cache_size` / `metadata_cache_size_bytes` are passed but ignored (those options target a single Lance dataset and don't compose with partitioned namespaces). Tests: `test_read_lance_warns_on_ignored_args_under_namespace_partitioning` and `test_read_lance_no_warning_when_args_default`.
universalmind303/lance-partition-take-2
10 hours ago

Latest Branches

CodSpeed Performance Gauge
0%
feat(temporal): add Spark-style timezone conversions#6919
41 minutes ago
73b0b72
BABTUNA:feat/temporal-tz-conversions
CodSpeed Performance Gauge
0%
3 hours ago
fc81d4b
BABTUNA:feat/temporal-add-months
CodSpeed Performance Gauge
-1%
fix(distributed): emit empty downstream task for limit(0)#6916
9 hours ago
f82f46d
varun/fix-percentile-empty-ray
© 2026 CodSpeed Technology
Home Terms Privacy Docs