Avatar for the Eventual-Inc user
Eventual-Inc
Daft
BlogDocsChangelog

Performance History

Latest Results

feat(checkpoint): wire checkpoint= into csv, json, warc, iceberg, hudi, lance (#6855) ## Summary Surfaces the `checkpoint=CheckpointConfig | None` kwarg on every non-streaming Daft reader. Previously only `read_parquet` supported it, which would have made any public checkpointing launch read as half-shipped. Stacked on top of #6853 (DF-1915) since it depends on the new `CheckpointConfig` shape. ## What changed **Per-reader kwarg.** Every reader now accepts `checkpoint=`: - `read_parquet`, `read_csv`, `read_json`, `read_warc` - `read_iceberg`, `read_hudi`, `read_lance` Streaming sources (Kafka) intentionally excluded β€” they have their own progress-tracking notion. **Shared helper.** Extracted the type-check + native-runner gate + `builder.with_checkpoint(checkpoint._inner)` sequence into `daft.io._checkpoint.attach_checkpoint(builder, checkpoint)`. Every reader is now a one-liner. Refactored `read_parquet` to use the helper too, replacing a ~12-line inline block. ```python config = daft.CheckpointConfig( store=daft.CheckpointStore("s3://bucket/ckpt"), on="file_id", ) df = daft.read_csv("...", checkpoint=config) # was unsupported df = daft.read_iceberg(table, checkpoint=config) # was unsupported # ... etc ``` ## Why nothing else changes The optimizer rule (`RewriteCheckpointSource`) operates on any `Source` node. The gap was purely API surface β€” no Rust changes needed. ## Test plan - [x] Native-runner gate test parametrizes parquet/csv/json with real file fixtures (3 cases). - [x] Helper-direct tests: gate fires + None-passthrough. - [x] Signature-introspection test parametrized across all 7 readers β€” pins `checkpoint` kwarg presence and `default=None` so a regression dropping the kwarg or changing the default would fail. - [x] All previously-passing checkpoint tests still pass (12 cases total in the gate test file). - [ ] CI Ray-runner integration: existing checkpoint integration tests already cover the optimizer-rule round-trip via parquet; no new round-trips for the other readers in this PR (DF-2007 is API surface only). πŸ€– Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
main
9 minutes ago
feat(checkpoint): bundle source config into CheckpointConfig with tunable filter knobs (#6853) ## Summary Replaces `read_parquet(checkpoint=CheckpointStore, on=...)` with a single bundled `CheckpointConfig` argument. The new shape pairs the store, the source key column, and **per-strategy filter tuning** into one object passed via `checkpoint=`. The 5 anti-join knobs (`num_workers`, `cpus_per_worker`, `keys_load_batch_size`, `max_concurrency_per_worker`, `filter_batch_size`) were previously hardcoded in the optimizer rule with a TODO. They're now configurable per source. ## API ```python config = daft.CheckpointConfig( store=daft.CheckpointStore("s3://bucket/ckpt"), on="file_id", settings=daft.KeyFilteringSettings(num_workers=4, cpus_per_worker=1.0), ) df = daft.read_parquet("s3://input/", checkpoint=config) ``` `settings=` is optional. Each field of `KeyFilteringSettings` is independently optional; unset fields fall back to today's hardcoded defaults at the rule layer (`num_workers=2`, `cpus_per_worker=1.0`, `max_concurrency_per_worker=1`) or pass through as `None` to the executor for the other two. ## Why nest under a strategy enum `CheckpointConfig.settings: CheckpointSettings` is an enum with one variant today (`KeyFiltering(KeyFilteringSettings)`). Future filter strategies (bloom filter, hash partition) carry different tuning knobs, so the per-source config should already nest them under a strategy variant rather than flatten anti-join knobs onto the top-level config. ## Test plan - [x] Rust unit tests in `rewrite_checkpoint_source.rs`: defaults pinned, full overrides flow through, partial overrides keep per-field fallbacks. - [x] Python unit tests in `tests/checkpoint/test_checkpoint_config.py`: round-trip via the new `settings` getter, parametrized rejection of zero/negative inputs, default-construction path, optional `settings=` kwarg. - [x] All existing checkpoint integration tests rewritten for the new shape (30 call sites in `tests/integration/checkpoint/test_checkpoint_s3.py`). - [x] Native-runner gate test confirmed firing with the new shape. - [x] `make build` + Python tests pass locally (15/15). - [ ] CI Ray-runner integration tests. πŸ€– Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
main
1 hour ago
feat: refactored out update_best_match
euan/optimize-asof
2 hours ago
feat:parallelized final probe
euan/optimize-asof
2 hours ago

Latest Branches

CodSpeed Performance Gauge
0%
feat(distributed): support into_batches under flight shuffle#6810
43 minutes ago
8817d8c
oh/into_batches_without_emitting_flight_refs
CodSpeed Performance Gauge
0%
feat(checkpoint): wire checkpoint= into csv, json, warc, iceberg, hudi, lance#6855
2 hours ago
2794ac5
rohit/feature/df-2007-checkpoint-other-sources
CodSpeed Performance Gauge
-1%
2 hours ago
d7e898b
euan/optimize-asof
Β© 2026 CodSpeed Technology
Home Terms Privacy Docs