Latest Results
perf(parquet): rewrite reader with arrow-rs public decoder API
Replaces the parquet2-based stack (arrowrs_reader, async_reader,
read_planner) with a flat (row_group × column) parallel decoder built
on arrow-rs's public low-level APIs (SerializedPageReader,
PrimitiveArrayReader, make_byte_array_reader, …).
Key changes:
- ParquetSource enum dispatches Local vs Url at one entry point
(parquet_stream_v2 / parquet_read_v2)
- LocalChunkSource: positioned coalesced preads, no mmap, no whole-file Bytes
- RemoteChunkSource: per-RG coalesced range GETs with adjacency merge
(≤1MB gap, ≤16MB per run, split at chunk boundaries above 24MB), fetches
run in background while decode setup proceeds
- Two-phase predicate pushdown: decode pred cols in parallel across RGs,
evaluate per-RG to get bool masks, reuse pred arrays during assembly
- Chunked-pred path for queries whose projection is fully covered by the
predicate — streams pred cols chunk-by-chunk for within-RG limit early
stop and better cache locality
- Iceberg field-id mapping applied BEFORE filtering active leaves by name
- Nested type decode via FieldReaderBuilder walking parquet+arrow schemas
in lockstep (mirrors arrow-rs's private complex::Visitor)
Module layout under arrowrs_v2/:
- mod.rs entry points + phase-1 dispatch
- chunk_source.rs byte sources (Local/Remote)
- field_reader.rs schema walk + per-column decode
- rg_processor.rs per-RG streaming + chunked-pred processors
- util.rs pure helpers (RowSelection arithmetic, projection)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Latest Branches
0%
3em0:fix/6954-hash-collisions 0%
0%
© 2026 CodSpeed Technology