Eventual-Inc
Daft
Blog
Docs
Changelog
Blog
Docs
Changelog
Overview
Branches
Benchmarks
Runs
Performance History
Latest Results
test: verify strict-superset group-by over clustered source is correct Adds an execution-level test (not just an explain check) that a group-by whose keys are a strict superset of the source clustering takes the single-stage local aggregation path and still computes complete, correct groups. Guards the soundness of clustering_is_covered_by for single-input aggregation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jay/clustering-spec-datasource
51 minutes ago
test+docs: clustering-aware shuffle elision over custom DataSource - Add tests/io/test_data_source_clustering.py exercising the distributed planner (ray runner): no-declaration baseline shuffles; exact-match and subset (refinement) for groupby/window/distinct skip the shuffle; the unsound inverse still shuffles; expression-valued clustering follows a projection; and the shuffle-free plan computes the correct result. - Document get_clustering_spec() in the custom connectors guide, with column and expression-valued examples and the distributed-only / soundness caveats. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jay/clustering-spec-datasource
2 hours ago
Merge branch 'main' into patch-1
ARDA7787:patch-1
12 hours ago
docs(paimon): clarify object-store IO config usage
jackylee-ch:codex-doc-paimon-io-config
20 hours ago
fix(filesystem): fix pyarrow fs memory by caching by value, not identity (#7025) ## Changes Made Fixes a memory leak in pyarrow fs. In long-running `write_iceberg` jobs this drained file descriptors and threads until the process OOM'd. This was triggered with a refresh-credentials S3 setup, but the cache is broken for every IOConfig. This PR keys the cache on `repr(io_config)` The audit found that `IOConfig.__hash__` returns equal values for semantically-equal configs, but `__eq__` is identity-based on the PyO3 wrapper. The dict-keyed cache at `daft/filesystem.py:35` therefore missed on **every** call when the Rust side handed a fresh Python wrapper to each writer, rebuilding a new PyArrow `S3FileSystem` (with its own thread pool and connection pool) per output file. | | FD slope / iter | RSS slope MiB / iter | |---|---|---| | Before fix | +63.9 | +2.15 | | After fix | **−0.05** | +0.39 | ## Related Issues - N/A
main
1 day ago
feat(checkpoint): add distributed observability counters Surface checkpoint progress on the dashboard for distributed (Flotilla) runs via worker->driver counter aggregation: - keys_staged on the StageCheckpointKeys source operator - files_staged and checkpoints_sealed on the write sink Each worker's RuntimeStats builds a StatSnapshot; the distributed pipeline node's handle_worker_node_stats sums the new fields into driver-side meter counters and re-exports them. Non-checkpoint operators report zero via default no-op RuntimeStats methods. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
rohit/feature/checkpoint-metrics
1 day ago
fix(filesystem): key pyarrow fs cache by IOConfig content, not identity IOConfig.__hash__ returns equal values for semantically-equal configs, but __eq__ is identity-based on the PyO3 wrapper. The dict-keyed cache in _resolve_paths_and_filesystem therefore missed on every call when the Rust side handed a fresh Python wrapper to each writer, rebuilding a new PyArrow S3FileSystem (with its own thread + connection pool) per output file. In long-running write_iceberg jobs this drained file descriptors and threads until the process OOM'd. Key on repr(io_config) instead — the cached entry's expiry field still drives refresh-credentials invalidation. In a 30-iteration MRE (16 partitions = 480 writers), FD growth drops from +63.9/iter to -0.05/iter and RSS slope drops 5x.
rchowell/write-leak-fix
1 day ago
docs: add shuffle algorithms tuning guide (#7017) Adds a user-facing page in the Optimization section covering Daft's four `shuffle_algorithm` options — `auto`, `map_reduce`, `pre_shuffle_merge`, and `flight_shuffle` — when each applies, and how to tune `flight_shuffle_dirs` and `flight_shuffle_compression`. Adds cross-links between the new page and `partitioning.md` so the partition-count → shuffle-cost path is followable in both directions. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
main
1 day ago
Latest Branches
CodSpeed Performance Gauge
-1%
feat: allow custom datasources to specify clustering spec
#7031
1 hour ago
9e017d3
jay/clustering-spec-datasource
CodSpeed Performance Gauge
0%
fix(local): replace unguarded unwrap() calls with recoverable error handling
#7003
13 hours ago
21dcb68
ARDA7787:patch-1
CodSpeed Performance Gauge
-1%
docs(paimon): clarify object-store IO config usage
#7029
21 hours ago
9b07d39
jackylee-ch:codex-doc-paimon-io-config
© 2026 CodSpeed Technology
Home
Terms
Privacy
Docs