Latest Results
perf(parquet): rewrite reader with arrow-rs public decoder API (#6952)
## Summary
Replaces the parquet2-based reader (`arrowrs_reader`, `async_reader`,
`read_planner`) with one built on arrow-rs's `array_reader` API. New
module `daft-parquet/src/reader/` covering byte IO + decode +
per-row-group assembly, plus a sibling `helpers.rs` for predicate /
row-selection utilities.
## Layout
```
src/daft-parquet/src/
├── reader/
│ ├── mod.rs entry points (stream_parquet, read_parquet), phase-1 dispatch,
│ │ concurrent-but-ordered per-RG decode driver
│ ├── chunk_source.rs byte sources: local pread, remote coalesced range GETs
│ ├── field_reader.rs schema walk + per-column decode
│ ├── rg_processor.rs per-RG streaming + chunked-pred processors
│ └── util.rs pure helpers
└── helpers.rs predicate / row-selection utilities (pushable cols,
offset / delete-vector selections, RG pruning)
```
## Key behaviors
- **Local**: positioned coalesced `pread`s (64KB gap merge). No mmap, no
whole-file `Bytes`.
- **Remote**: per-RG range GETs with adjacency merge (≤1MB gap, split
runs >24MB at chunk boundaries into ~16MB pieces). Fetches run in
background; decoders await an assembled `HashMap<leaf, Bytes>` via
`Shared` — never park on per-byte IO. Duplicate RG indices in the same
scan are supported (no eviction on consume).
- **Concurrency / ordering**: per-RG decode tasks are spawned
concurrently onto the compute runtime via a `JoinSet`, each draining
into a per-RG bounded channel (capacity 1) which is read back in RG
order. Output is always file-ordered within a single file;
`maintain_order=false` at the scan layer only reorders BETWEEN scan
tasks.
- **Predicate pushdown** (two-phase): decode pred cols in parallel
across RGs, evaluate per-RG, reuse the decoded arrays during assembly so
cols in both predicate and projection (e.g. `where(col > 50)`) are
decoded once.
- **Chunked-pred path**: when projection is fully covered by predicate
cols, stream pred cols chunk-by-chunk and filter in place. Within-RG
early stop on `limit`.
- **Iceberg field-id mapping** applied before filtering active leaves by
name.
- **Nested types** decoded via lockstep parquet+arrow schema walk.
## Note: `experimental` feature on parquet dep
`array_reader` (with `PrimitiveArrayReader`, `make_byte_array_reader`,
`ListArrayReader`, …) is gated behind parquet's `experimental` feature
in arrow-rs — public but `#[doc(hidden)]` and no semver guarantees. Our
specific touchpoints have been stable in name for 1–2 years; future
parquet bumps may require minor adjustment.
## Test plan
- [x] `cargo test -p daft-parquet`
- [x] `pytest tests/io/test_parquet.py
tests/recordbatch/recordbatch_io/test_parquet.py` (62 passed)
- [x] `pytest tests/io` (861 passed, 1 unrelated moto fixture error)
- [x] Iceberg + projection + predicate paths
- [x] Local + remote benchmarks (below)
## Benchmarks
`daft.read_parquet(uri).to_arrow()` plus projection / limit / where /
combined variants. This PR built `maturin develop --release`; PyPI
versions installed from `pip install daft==X`. 2 warmups, 5–7 repeats,
best-of.
**Environment**: EC2 aarch64 (Graviton), us-west-2. Local fixtures on
EBS; remote reads target `s3://colinho-test/parquet_bench_fixtures/` in
the same region. (Earlier draft used macOS — re-running on EC2 because
Mac thermals were swinging numbers by 20–40% across reruns.)
**Comparison points**:
- **this PR** — HEAD of `colin/parquet-perf`.
- **0.7.13** — current PyPI latest (effectively `main`).
- **0.7.3** — older release pulled in by request for a broader baseline.
### Local (full read)
| rows | cols | rgs | 0.7.3 | 0.7.13 | this PR | vs 0.7.13 | vs 0.7.3 |
|---:|---:|---:|---:|---:|---:|---:|---:|
| 10M | 1 | 1 | 54.0 ms | 20.7 ms | 20.9 ms | 0.99× | **2.58×** |
| 10M | 1 | 8 | 33.6 ms | 14.7 ms | 13.0 ms | 1.13× | **2.58×** |
| 10M | 1 | 64 | 23.3 ms | 20.0 ms | 18.4 ms | 1.09× | 1.27× |
| 1M | 32 | 1 | 40.4 ms | 110.8 ms | 31.3 ms | **3.54×** | 1.29× |
| 1M | 32 | 8 | 43.5 ms | 27.2 ms | 22.1 ms | 1.23× | **1.97×** |
| 1M | 32 | 64 | 67.8 ms | 46.0 ms | 46.9 ms | 0.98× | 1.45× |
| 10K | 1024 | 1 | 46.5 ms | 73.9 ms | 34.5 ms | **2.14×** | 1.35× |
| 10K | 1024 | 8 | 160.1 ms | 109.0 ms | 94.2 ms | 1.16× | **1.70×** |
| 10K | 1024 | 64 | 1190.9 ms | 685.5 ms | 566.5 ms | 1.21× | **2.10×**
|
Aggregate: **1.31× vs 0.7.13** (1.11 s → 0.85 s), **1.96× vs 0.7.3**
(1.66 s → 0.85 s).
### Remote (S3, all 5 ops)
| shape | op | 0.7.3 | 0.7.13 | this PR | vs 0.7.13 | vs 0.7.3 |
|---|---|---:|---:|---:|---:|---:|
| 10M×1×1 | full | 220.8 | 282.5 | 273.4 | 1.03× | 0.81× |
| 10M×1×1 | project_1col | 225.3 | 280.0 | 241.2 | 1.16× | 0.93× |
| 10M×1×1 | limit_100 | 198.2 | 277.5 | 241.0 | 1.15× | 0.82× |
| 10M×1×1 | filter_high | 231.2 | 371.0 | 242.0 | **1.53×** | 0.96× |
| 10M×1×1 | combined | 162.3 | 316.6 | 252.3 | 1.26× | 0.64× |
| 10M×1×8 | full | 177.2 | 537.1 | 195.0 | **2.75×** | 0.91× |
| 10M×1×8 | project_1col | 173.2 | 558.5 | 170.3 | **3.28×** | 1.02× |
| 10M×1×8 | limit_100 | 91.0 | 214.4 | 126.3 | 1.70× | 0.72× |
| 10M×1×8 | filter_high | 194.5 | 635.4 | 187.9 | **3.38×** | 1.04× |
| 10M×1×8 | combined | 159.6 | 190.8 | 174.3 | 1.09× | 0.92× |
| 10M×1×64 | full | 186.7 | 2990.1 | 171.8 | **17.40×** | 1.09× |
| 10M×1×64 | project_1col | 191.8 | 2877.0 | 225.0 | **12.79×** | 0.85×
|
| 10M×1×64 | limit_100 | 83.0 | 163.7 | 141.3 | 1.16× | 0.59× |
| 10M×1×64 | filter_high | 186.4 | 2849.2 | 229.0 | **12.44×** | 0.81× |
| 10M×1×64 | combined | 153.7 | 200.7 | 123.3 | 1.63× | 1.25× |
| 1M×32×1 | full | 271.6 | 481.1 | 294.8 | **1.63×** | 0.92× |
| 1M×32×1 | project_1col | 93.7 | 171.3 | 156.6 | 1.09× | 0.60× |
| 1M×32×1 | limit_100 | 238.0 | 335.8 | 314.5 | 1.07× | 0.76× |
| 1M×32×1 | filter_high | 325.5 | 448.8 | 421.0 | 1.07× | 0.77× |
| 1M×32×1 | combined | 99.0 | 192.0 | 209.3 | 0.92× | 0.47× |
| 1M×32×8 | full | 285.5 | 865.8 | 247.2 | **3.50×** | 1.16× |
| 1M×32×8 | project_1col | 101.6 | 533.6 | 211.8 | **2.52×** | 0.48× |
| 1M×32×8 | limit_100 | 145.9 | 250.0 | 201.4 | 1.24× | 0.72× |
| 1M×32×8 | filter_high | 337.7 | 1231.8 | 306.0 | **4.03×** | 1.10× |
| 1M×32×8 | combined | 97.7 | 228.3 | 149.3 | 1.53× | 0.65× |
| 1M×32×64 | full | 378.1 | 3600.1 | 330.6 | **10.89×** | 1.14× |
| 1M×32×64 | project_1col | 303.6 | 3060.5 | 257.1 | **11.90×** | 1.18×
|
| 1M×32×64 | limit_100 | 155.4 | 334.3 | 198.6 | 1.68× | 0.78× |
| 1M×32×64 | filter_high | 417.9 | 6402.3 | 389.6 | **16.43×** | 1.07× |
| 1M×32×64 | combined | 295.2 | 269.6 | 176.6 | 1.53× | **1.67×** |
| 10K×1024×1 | full | 332.0 | 493.2 | 402.6 | 1.22× | 0.82× |
| 10K×1024×1 | project_1col | 138.2 | 293.3 | 290.4 | 1.01× | 0.48× |
| 10K×1024×1 | limit_100 | 314.7 | 403.9 | 493.5 | 0.82× | 0.64× |
| 10K×1024×1 | filter_high | 358.6 | 599.1 | 507.3 | 1.18× | 0.71× |
| 10K×1024×1 | combined | 159.5 | 272.1 | 256.5 | 1.06× | 0.62× |
| 10K×1024×8 | full | 501.1 | 1021.2 | 377.2 | **2.71×** | 1.33× |
| 10K×1024×8 | project_1col | 188.6 | 626.0 | 272.2 | **2.30×** | 0.69×
|
| 10K×1024×8 | limit_100 | 244.4 | 390.2 | 337.8 | 1.15× | 0.72× |
| 10K×1024×8 | filter_high | 515.6 | 1291.5 | 481.7 | **2.68×** | 1.07×
|
| 10K×1024×8 | combined | 186.2 | 326.1 | 292.9 | 1.11× | 0.64× |
| 10K×1024×64 | full | 1766.0 | 4386.4 | 892.2 | **4.92×** | **1.98×** |
| 10K×1024×64 | project_1col | 506.1 | 3080.5 | 395.5 | **7.79×** |
1.28× |
| 10K×1024×64 | limit_100 | 523.4 | 493.0 | 421.7 | 1.17× | 1.24× |
| 10K×1024×64 | filter_high | 1964.7 | 7135.6 | 1071.6 | **6.66×** |
**1.83×** |
| 10K×1024×64 | combined | 473.8 | 493.7 | 374.5 | 1.32× | 1.27× |
(ms, all best-of.) Aggregate: **3.82× vs 0.7.13** (52.5 s → 13.7 s),
**1.05× vs 0.7.3** (14.4 s → 13.7 s).
### Headline
- **vs 0.7.13** (current PyPI latest): this PR wins everywhere — **1.31×
local**, **3.82× remote**, with up to **17× on many-RG remote
`full`/`filter` reads**. 0.7.13 has its own remote regressions vs 0.7.3
(especially 64-RG shapes) — this PR fixes them.
- **vs 0.7.3**: this PR is **1.96× local**, roughly flat in remote
aggregate but with notable wins on many-RG `full` / `filter_high` (the
byte-range-coalescing wins) and clear losses on small-payload
limit/projection ops where per-RG setup overhead doesn't recoup on tiny
GETs.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Latest Branches
0%
XuQianJin-Stars:fix/lance-utils-redundant-num-groups -1%
BABTUNA:feat/temporal-unix-extractors 0%
BABTUNA:feat/temporal-tz-conversions © 2026 CodSpeed Technology