Eventual-Inc
Daft
BlogDocsChangelog

feat(io): add filename provider support for parquet, csv and upload

#5895
Comparing
Jay-ju:filename-provider
(
88378b3
) with
main
(
29ffd49
)
CodSpeed Performance Gauge
0%
Untouched
24
Ignored
4

Benchmarks

Passed

test_iter_rows_first_row[1 Small File]
tests/benchmarks/test_interactive_reads.py
CodSpeed Performance Gauge
+3%
35.7 ms34.5 ms
test_tpch[1-in-memory-5]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
+2%
132.2 ms129.9 ms
test_tpch[1-in-memory-7]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
+2%
127.6 ms125.5 ms
test_count[100 Small Files]
tests/benchmarks/test_interactive_reads.py
CodSpeed Performance Gauge
+1%
57.4 ms56.7 ms
test_tpch_sql[1-in-memory-2]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
+1%
165.5 ms164.6 ms
test_tpch_sql[1-in-memory-9]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
0%
264 ms262.9 ms
test_tpch_sql[1-in-memory-10]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
0%
188.9 ms188.2 ms
test_tpch_sql[1-in-memory-6]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
0%
29.2 ms29.1 ms
test_tpch[1-in-memory-6]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
0%
28.5 ms28.4 ms
test_tpch_sql[1-in-memory-3]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
0%
116.3 ms116 ms
test_tpch_sql[1-in-memory-8]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
0%
134.2 ms133.9 ms
test_tpch_sql[1-in-memory-5]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
0%
117.8 ms117.6 ms
test_tpch_sql[1-in-memory-4]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
0%
89 ms88.9 ms
test_tpch[1-in-memory-4]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
0%
88 ms88.1 ms
test_tpch[1-in-memory-8]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
0%
148.4 ms148.5 ms
test_tpch[1-in-memory-9]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
0%
275.9 ms275.9 ms
test_tpch[1-in-memory-1]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
0%
403.1 ms403.2 ms
test_tpch[1-in-memory-2]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
0%
56.3 ms56.4 ms
test_tpch_sql[1-in-memory-7]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
-1%
117.7 ms118.3 ms
test_tpch[1-in-memory-10]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
-1%
196.1 ms197.1 ms
test_tpch[1-in-memory-3]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
-1%
124 ms124.7 ms
test_tpch_sql[1-in-memory-1]
tests/benchmarks/test_local_tpch.py
CodSpeed Performance Gauge
-1%
399.8 ms403.4 ms
test_show[1 Small File]
tests/benchmarks/test_interactive_reads.py
CodSpeed Performance Gauge
-1%
11.9 ms12.1 ms
test_explain[100 Small Files]
tests/benchmarks/test_interactive_reads.py
CodSpeed Performance Gauge
-1%
12.3 ms12.4 ms

Ignored

test_explain[1 Small File]
tests/benchmarks/test_interactive_reads.py
Ignored
CodSpeed Performance Gauge
0%
2.1 ms2.1 ms
test_count[1 Small File]
tests/benchmarks/test_interactive_reads.py
Ignored
CodSpeed Performance Gauge
+2%
3.5 ms3.4 ms
test_iter_rows_first_row[100 Small Files]
tests/benchmarks/test_interactive_reads.py
Ignored
CodSpeed Performance Gauge
-7%
153.7 ms164.4 ms
test_show[100 Small Files]
tests/benchmarks/test_interactive_reads.py
Ignored
CodSpeed Performance Gauge
+2%
23.2 ms22.6 ms

Commits

Click on a commit to change the comparison range
Base
main
29ffd49
+0.22%
feat(io): add filename provider support for parquet, csv and upload Implement a cross-language FilenameProvider mechanism for Daft writes, covering block-based parquet/csv writers and row-based URL upload. - Python - Extend `daft.io.filename_provider.FilenameProvider` and `_DefaultFilenameProvider` as the public strategy interface for filename generation. - Add optional `filename_provider` and internal `write_uuid` plumbing to `DataFrame.write_parquet` / `write_csv`, and route them via `LogicalPlanBuilder.write_tabular`. - Update `daft.logical.builder.LogicalPlanBuilder.write_tabular` to accept and forward `filename_provider` / `write_uuid` into the Rust logical plan. - Extend `daft.functions.url.upload` to accept `filename_provider` and generate a `write_uuid` for each logical upload. - Add `Expression.upload(..., filename_provider=...)` wrapper that passes through to `daft.functions.upload`. - Wire `daft.io.__init__` to expose `FilenameProvider` in the public API. - Rust logical plan & DSL - Extend `daft-logical-plan::OutputFileInfo` with optional `filename_provider` and `write_uuid` fields. - Represent the Python provider as `common_py_serde::PyObjectWrapper` for safe serde round-tripping, and adjust `OutputFileInfo::new` accordingly. - Thread `filename_provider` / `write_uuid` through `LogicalPlanBuilder::table_write` (and its pyo3 binding) into `SinkInfo::OutputFileInfo`. - Add `RuntimePyObject` support for `Literal::Python` in the DSL runtime to allow passing Python objects (such as providers) as UDF args. - Rust writers - Introduce `build_filename_with_provider` helper in `daft-writers::utils` that prefers calling a Python `FilenameProvider` hook when present, falling back to the previous UUID-based filename scheme. - Extend native parquet and csv writers (`create_native_parquet_writer` / `create_native_csv_writer`) to accept `filename_provider` and `write_uuid`, and use `build_filename_with_provider` with appropriate extensions ("parquet" / "csv"). - Teach `PhysicalWriterFactory::create_writer` to unwrap the `PyObjectWrapper` stored in `OutputFileInfo` and pass the underlying `Arc<Py<PyAny>>` plus `write_uuid` into native writer creation. - URL upload - Extend the `UrlUpload` UDF args to include `filename_provider: Option<RuntimePyObject>` and `write_uuid: Option<String>`. - When uploading into a single folder (`is_single_folder=True`), use the Python `FilenameProvider.get_filename_for_row(...)` hook (with ext="") to derive the basename; for row-specific full paths, keep the user-specified path untouched and do not call the provider. This change brings Daft in line with Ray Data's `FilenameProvider` concept, giving users deterministic and customizable control over output filenames across different sinks, while preserving the existing default naming scheme when no provider is supplied.
88378b3
4 days ago
by Jay-ju
© 2025 CodSpeed Technology
Home Terms Privacy Docs