Latest Results
feat(checkpoint): wire checkpoint= into csv, json, warc, iceberg, hudi, lance (#6855)
## Summary
Surfaces the `checkpoint=CheckpointConfig | None` kwarg on every
non-streaming Daft reader. Previously only `read_parquet` supported it,
which would have made any public checkpointing launch read as
half-shipped.
Stacked on top of #6853 (DF-1915) since it depends on the new
`CheckpointConfig` shape.
## What changed
**Per-reader kwarg.** Every reader now accepts `checkpoint=`:
- `read_parquet`, `read_csv`, `read_json`, `read_warc`
- `read_iceberg`, `read_hudi`, `read_lance`
Streaming sources (Kafka) intentionally excluded β they have their own
progress-tracking notion.
**Shared helper.** Extracted the type-check + native-runner gate +
`builder.with_checkpoint(checkpoint._inner)` sequence into
`daft.io._checkpoint.attach_checkpoint(builder, checkpoint)`. Every
reader is now a one-liner. Refactored `read_parquet` to use the helper
too, replacing a ~12-line inline block.
```python
config = daft.CheckpointConfig(
store=daft.CheckpointStore("s3://bucket/ckpt"),
on="file_id",
)
df = daft.read_csv("...", checkpoint=config) # was unsupported
df = daft.read_iceberg(table, checkpoint=config) # was unsupported
# ... etc
```
## Why nothing else changes
The optimizer rule (`RewriteCheckpointSource`) operates on any `Source`
node. The gap was purely API surface β no Rust changes needed.
## Test plan
- [x] Native-runner gate test parametrizes parquet/csv/json with real
file fixtures (3 cases).
- [x] Helper-direct tests: gate fires + None-passthrough.
- [x] Signature-introspection test parametrized across all 7 readers β
pins `checkpoint` kwarg presence and `default=None` so a regression
dropping the kwarg or changing the default would fail.
- [x] All previously-passing checkpoint tests still pass (12 cases total
in the gate test file).
- [ ] CI Ray-runner integration: existing checkpoint integration tests
already cover the optimizer-rule round-trip via parquet; no new
round-trips for the other readers in this PR (DF-2007 is API surface
only).
π€ Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> feat(checkpoint): bundle source config into CheckpointConfig with tunable filter knobs (#6853)
## Summary
Replaces `read_parquet(checkpoint=CheckpointStore, on=...)` with a
single bundled `CheckpointConfig` argument. The new shape pairs the
store, the source key column, and **per-strategy filter tuning** into
one object passed via `checkpoint=`.
The 5 anti-join knobs (`num_workers`, `cpus_per_worker`,
`keys_load_batch_size`, `max_concurrency_per_worker`,
`filter_batch_size`) were previously hardcoded in the optimizer rule
with a TODO. They're now configurable per source.
## API
```python
config = daft.CheckpointConfig(
store=daft.CheckpointStore("s3://bucket/ckpt"),
on="file_id",
settings=daft.KeyFilteringSettings(num_workers=4, cpus_per_worker=1.0),
)
df = daft.read_parquet("s3://input/", checkpoint=config)
```
`settings=` is optional. Each field of `KeyFilteringSettings` is
independently optional; unset fields fall back to today's hardcoded
defaults at the rule layer (`num_workers=2`, `cpus_per_worker=1.0`,
`max_concurrency_per_worker=1`) or pass through as `None` to the
executor for the other two.
## Why nest under a strategy enum
`CheckpointConfig.settings: CheckpointSettings` is an enum with one
variant today (`KeyFiltering(KeyFilteringSettings)`). Future filter
strategies (bloom filter, hash partition) carry different tuning knobs,
so the per-source config should already nest them under a strategy
variant rather than flatten anti-join knobs onto the top-level config.
## Test plan
- [x] Rust unit tests in `rewrite_checkpoint_source.rs`: defaults
pinned, full overrides flow through, partial overrides keep per-field
fallbacks.
- [x] Python unit tests in `tests/checkpoint/test_checkpoint_config.py`:
round-trip via the new `settings` getter, parametrized rejection of
zero/negative inputs, default-construction path, optional `settings=`
kwarg.
- [x] All existing checkpoint integration tests rewritten for the new
shape (30 call sites in
`tests/integration/checkpoint/test_checkpoint_s3.py`).
- [x] Native-runner gate test confirmed firing with the new shape.
- [x] `make build` + Python tests pass locally (15/15).
- [ ] CI Ray-runner integration tests.
π€ Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Latest Branches
0%
oh/into_batches_without_emitting_flight_refs 0%
rohit/feature/df-2007-checkpoint-other-sources -1%
Β© 2026 CodSpeed Technology