Latest Results
fix(udf): handle UDF expressions with no column references (#6805) (#6814)
Fixes #6805. Two commits: failing tests (red), then the fix (green).
Two related bugs surface when a UDF expression has no column references
after constant folding (e.g., `with_column("msg", lit("hello"))`
followed by a UDF whose only input is `msg` — the optimizer inlines the
literal and the UDF expression collapses to a literal-only form).
## Part 1 — `remap_used_cols` returns empty Vec instead of `vec![0]`
`daft_dsl::utils::remap_used_cols` previously returned `vec![0]` as a
"borrow column 0 to keep the row count alive" fallback. The downstream
UDF op (streaming sink for async UDFs, intermediate op for sync UDFs)
then indexed into a batch that projection pushdown had narrowed to zero
columns, panicking with `index out of bounds: the len is 0 but the index
is 0`. `RecordBatch::get_columns(&[])` already preserves `num_rows`, so
the fallback isn't needed.
## Part 2 — broadcast length-1 inputs in Python UDF eval branches
Even with the panic fixed, evaluating a literal-only input yields a
length-1 Series, so the UDF was invoked once and the result broadcast to
N rows. This is wrong for non-pure UDFs (random sampling, external API
calls, anything stateful) — the user wrote `with_column` expecting
per-row execution.
Fix: in both `Expr::ScalarFn(ScalarFn::Python)` branches (sync and
async) of `daft-recordbatch/src/lib.rs`, broadcast any length-1 input
Series to the upstream row count before invoking the UDF. Mirrors the
post-result broadcast already in `async_udf.rs` and
`intermediate_ops/udf.rs`.
## Tests
Six failing repros land in the first commit, all green after the fix:
1. Verbatim repro from #6805 (panic on async batch UDF + select)
2. Async batch UDF property test (UDF must see N rows, not
run-once-broadcast)
3. Sync batch UDF (different eval branch)
4. Row-wise UDF (`@daft.func`)
5. Empty input batch (`row_count == 0`)
6. Ray integration: multi-partition input on the Ray runner exercises
the actor UDF path
## Scope
- **Native runner**: panic + broadcast bugs both fixed.
- **Ray runner**: panic fixed via Part 1 (flows through
`actor_udf.rs:156`'s call to `remap_used_cols`); the multi-partition Ray
integration test confirms.
- **Legacy `@daft.udf` decorator**: NOT addressed — that decorator
routes through `Expr::Function` / `LegacyPythonUDF` and is being removed
in 0.8.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> feat(paimon): enhance Paimon integration (#6635)
Major improvements to the Paimon integration:
- Filter pushdown: implement PredicateVisitor-based converter supporting
comparison, is_in, is_null, between, and string operations
- Write support: add pypaimon stats patch for complex types
(list/map/struct), schema conversion for large_string/large_binary
- Lazy imports: defer all pypaimon imports to function call time so
users can `import daft` without pypaimon installed
- Table metadata: expose primary_keys, partition_keys, bucket_count,
table_options via PaimonTable properties
- Catalog cleanup: simplify identifier conversion, add table options
passthrough on create_table
Tests: add comprehensive filter pushdown, nested types, decimal, null
values, and write roundtrip tests; reorganize catalog tests to
tests/catalog/paimon/.
## Changes Made
- Implement `PaimonPredicateVisitor` using tree fold pattern for filter
predicate pushdown
- Add `PaimonDataSink` with write support including complex type stats
patching
- Use lazy imports (`daft.dependencies`) for pyarrow and deferred
imports for pypaimon
- Expose Paimon table metadata properties
- Reorganize and expand test coverage
## Related Issues
Closes #6735
Related to #4976
---------
Co-authored-by: biyan <biyan.by@alibaba-inc.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Latest Branches
0%
chenghuichen:random-shuffle 0%
+13%
© 2026 CodSpeed Technology