jay/split-all-files - Branch - Eventual-Inc/Daft

feat: Split all Parquet ScanTasks by default

#3454

Comparing

jay/split-all-files

(

bba2bed

) with

main

(

063de4d

)

-32%

Improvements: 0

Regressions: 1

Untouched: 26

New: 0

Dropped: 0

Ignored: 1

Benchmarks

Failed

test_show[100 Small Files]Regression

tests/benchmarks/test_interactive_reads.py::test_show[100 Small Files]

-32%

15.8 ms

23.3 ms

Passed

test_show[1 Small File]

tests/benchmarks/test_interactive_reads.py::test_show[1 Small File]

+2%

11.6 ms

11.5 ms

test_tpch[1-in-memory-native-8]

tests/benchmarks/test_local_tpch.py::test_tpch[1-in-memory-native-8]

+1%

185.1 ms

183.5 ms

test_tpch_sql[1-in-memory-native-4]

tests/benchmarks/test_local_tpch.py::test_tpch_sql[1-in-memory-native-4]

+1%

158.5 ms

157.3 ms

test_tpch_sql[1-in-memory-native-10]

tests/benchmarks/test_local_tpch.py::test_tpch_sql[1-in-memory-native-10]

+1%

245.3 ms

243.4 ms

test_tpch_sql[1-in-memory-native-1]

tests/benchmarks/test_local_tpch.py::test_tpch_sql[1-in-memory-native-1]

+1%

493.3 ms

490.4 ms

test_tpch[1-in-memory-native-2]

tests/benchmarks/test_local_tpch.py::test_tpch[1-in-memory-native-2]

+1%

114.3 ms

113.7 ms

test_tpch[1-in-memory-native-7]

tests/benchmarks/test_local_tpch.py::test_tpch[1-in-memory-native-7]

+1%

153 ms

152.1 ms

test_tpch_sql[1-in-memory-native-2]

tests/benchmarks/test_local_tpch.py::test_tpch_sql[1-in-memory-native-2]

+1%

240.4 ms

239.1 ms

test_tpch[1-in-memory-native-4]

tests/benchmarks/test_local_tpch.py::test_tpch[1-in-memory-native-4]

155.3 ms

154.7 ms

test_tpch[1-in-memory-native-1]

tests/benchmarks/test_local_tpch.py::test_tpch[1-in-memory-native-1]

466.1 ms

465.1 ms

test_tpch[1-in-memory-native-6]

tests/benchmarks/test_local_tpch.py::test_tpch[1-in-memory-native-6]

29.1 ms

29 ms

test_count[100 Small Files]

tests/benchmarks/test_interactive_reads.py::test_count[100 Small Files]

72.7 ms

72.6 ms

test_tpch[1-in-memory-native-5]

tests/benchmarks/test_local_tpch.py::test_tpch[1-in-memory-native-5]

335.4 ms

334.8 ms

test_iter_rows_first_row[1 Small File]

tests/benchmarks/test_interactive_reads.py::test_iter_rows_first_row[1 Small File]

101.2 ms

101.1 ms

test_tpch_sql[1-in-memory-native-6]

tests/benchmarks/test_local_tpch.py::test_tpch_sql[1-in-memory-native-6]

29.8 ms

test_tpch_sql[1-in-memory-native-3]

tests/benchmarks/test_local_tpch.py::test_tpch_sql[1-in-memory-native-3]

153.8 ms

154 ms

test_tpch_sql[1-in-memory-native-5]

tests/benchmarks/test_local_tpch.py::test_tpch_sql[1-in-memory-native-5]

263.4 ms

263.9 ms

test_tpch[1-in-memory-native-3]

tests/benchmarks/test_local_tpch.py::test_tpch[1-in-memory-native-3]

-1%

151.9 ms

152.9 ms

test_explain[100 Small Files]

tests/benchmarks/test_interactive_reads.py::test_explain[100 Small Files]

-1%

6 ms

test_tpch[1-in-memory-native-10]

tests/benchmarks/test_local_tpch.py::test_tpch[1-in-memory-native-10]

-1%

231.7 ms

234.5 ms

test_tpch_sql[1-in-memory-native-7]

tests/benchmarks/test_local_tpch.py::test_tpch_sql[1-in-memory-native-7]

-2%

1.2 s

test_tpch[1-in-memory-native-9]

tests/benchmarks/test_local_tpch.py::test_tpch[1-in-memory-native-9]

-2%

485.5 ms

493.2 ms

test_tpch_sql[1-in-memory-native-9]

tests/benchmarks/test_local_tpch.py::test_tpch_sql[1-in-memory-native-9]

-2%

516.7 ms

527.3 ms

test_tpch_sql[1-in-memory-native-8]

tests/benchmarks/test_local_tpch.py::test_tpch_sql[1-in-memory-native-8]

-2%

197.4 ms

202.2 ms

test_count[1 Small File]

tests/benchmarks/test_interactive_reads.py::test_count[1 Small File]

-4%

3.5 ms

3.7 ms

test_iter_rows_first_row[100 Small Files]

tests/benchmarks/test_interactive_reads.py::test_iter_rows_first_row[100 Small Files]

-7%

304.4 ms

327.3 ms

Ignored

test_explain[1 Small File]Ignored

tests/benchmarks/test_interactive_reads.py::test_explain[1 Small File]

-3%

1.8 ms

Commits

Click on a commit to change the comparison range

Base

main

063de4d

-32%

Perform split on all files Refactor into accumulator struct Rename Further simplification of accumulator logic Cleanup into separate accumulator and accumulator context Account for potentially null TableMetadata Refactor into Iterator Refactor into state machine Convert Parquet file iterator to state machine as well small cleanup Reorganization into a separate module Cleanup to extend this easier for using catalog information Perform 16 Parquet metadata fetches in parallel perf: reduce calls to ScanTask::estimate_in_memory_size Adds unit test Adds more unit tests Add feature flag DAFT_ENABLE_AGGRESSIVE_SCANTASK_SPLITTING Add a benchmarking script Trigger data materialization in benchmark Refactors to ParquetFileSplitter to not use state machine Big refactor to split into multiple files and iterators Add better docs Refactor splitter code nit naming Refactor Fetchable reordering for readability Simplify State logic for FetchParquetMetadataByWindows impl IntoIterator for SplittableScanTaskRef by propagating the config ref docstrings Removed advance_state for more explicit naming Remove trait

bba2bed

1 month ago

Home Terms Privacy Docs

feat: Split *all* Parquet ScanTasks by default

Benchmarks

Failed

Passed

Ignored

Commits

feat: Split all Parquet ScanTasks by default