Avatar for the Eventual-Inc user
Eventual-Inc
Daft
BlogDocsChangelog

Performance History

Latest Results

ci: retrigger to land CodSpeed on a consistent runner
BABTUNA:perf/packed-symbol-groupby
6 hours ago
feat: ASOF join benchmarking scripts (#6940) ## Summary Adds a self-contained benchmarking suite for Daft's `join_asof` operation: - **`data_generation.py`** — generates reproducible left/right parquet datasets at three scales (`small`, `medium`, `large`) with clustered timestamps and Zipf-skewed entity distribution, written to `benchmarking/data/asof_join/` - **`benchmark.py`** — runs a single asof-join using Daft's native or Ray runner, wrapped in a memray memory tracker, and prints a JSON result with wall time and memray output path - **`deployment.yaml`** — Ray cluster config for AWS (1 `m7i.large` head + 4 `r7i.4xlarge` workers) for distributed runs against S3 ## How to run From inside `benchmarking/asof_join/`: **1. Generate data (one-time)** ```bash python data_generation.py --scale small # or --scale medium / --scale large / --all ``` **2. Run locally (native runner)** ```bash python benchmark.py --scale small # Output: asof_join_memray.bin + JSON result on stdout ``` **3. Inspect memory profile** ```bash memray flamegraph asof_join_memray.bin ``` **Run on a Ray cluster** > Before running: update `DATA_ROOT` to your S3 bucket and uncomment `daft.set_runner_ray()` in `benchmark.py`. Also update the S3 bucket and IAM settings in `deployment.yaml`. Spin up the cluster: ```bash ray up benchmarking/asof_join/deployment.yaml ``` Forward the dashboard in one terminal: ```bash ray dashboard benchmarking/asof_join/deployment.yaml ``` Submit the job in another (after updating `DATA_ROOT` and `daft.set_runner_ray()` in `benchmark.py`): ```bash ray job submit --address "http://localhost:8265" --working-dir benchmarking/asof_join -- python benchmark.py --scale small ``` Tear down when done: ```bash ray down benchmarking/asof_join/deployment.yaml ```
main
6 hours ago
feat(show): env defaults and auto alignment for preview output (#6856) ## Summary Implements 1 and 5 bullet points of issue by improving `.show()` preview defaults and adding meaningful `align="auto"` behavior. This PR adds environment-variable defaults for `.show()` formatting options and makes `align="auto"` behave as expected: numeric/decimal columns are right-aligned and non-numeric columns are left-aligned. Auto-alignment is opt-in via the new env var or an explicit `align="auto"` argument; the hard-coded default stays `"left"` so existing docstring/doctest output (~40 examples) remains stable. ## Why The issue asks for more formatting control in `.show()` output. This PR focuses on two practical pieces: - Configurable default preview formatting without changing callsites. - Correct automatic alignment for mixed numeric/text previews when the user opts in. ## Changes Made - Add `.show()` default resolution helper in `daft/dataframe/preview.py`: - `DAFT_SHOW_FORMAT` - `DAFT_SHOW_VERBOSE` - `DAFT_SHOW_MAX_WIDTH` - `DAFT_SHOW_ALIGN` - Wire default resolution into `DataFrame.show(...)` in `daft/dataframe/dataframe.py`. - Update Rust preview alignment logic in `src/daft-recordbatch/src/preview.rs`: - `PreviewAlign::Auto` now maps numeric/decimal dtypes to right alignment. - Non-numeric dtypes remain left-aligned. - Add focused tests: - Rust unit test for auto alignment behavior. - Python tests for env default resolution, sentinel `None` handling, explicit-arg precedence, and the unchanged `"left"` fallback default. - Python formatting test covering `align="auto"` behavior. ## Behavior - Existing explicit `.show(...)` arguments keep precedence. - When callers leave defaults, env vars can now control show formatting defaults. - `align="auto"` (whether set via env var or explicit argument) produces numeric-right / non-numeric-left alignment in preview rendering. - The hard-coded default alignment remains `"left"` to keep existing examples stable; users opt into the new behavior with `DAFT_SHOW_ALIGN=auto` or `df.show(align="auto")`. ## Test Plan - `python -m ruff check daft/dataframe/preview.py daft/dataframe/dataframe.py tests/dataframe/test_show.py` - `cargo test -p daft-recordbatch test_auto_aligns_numeric_right_and_non_numeric_left -- --nocapture` - `DAFT_RUNNER=native pytest -q tests/dataframe/test_show.py -k 'resolve_show_defaults'` ## Related Issues - Part of #4114 <img width="887" height="702" alt="image" src="https://github.com/user-attachments/assets/eb81e69b-17b5-4992-83a9-e4de2b5805e0" /> <img width="968" height="618" alt="image" src="https://github.com/user-attachments/assets/45a98587-0910-400c-bc5c-f2655f4a5bb9" /> <img width="962" height="703" alt="image" src="https://github.com/user-attachments/assets/c8d5fe48-c896-4eb4-8b31-86fc5fefa983" /> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
main
16 hours ago
refactor(inline-agg): fold packed-u64 fast path into agg_symbolized_path Restructures the packed-u64 two-string-column optimization to live inside `agg_symbolized_path` rather than as a separate dispatch branch. Eliminates the double byte-tally that previous revisions of this PR paid on the TPC-H Q1 short-string path, and matches PR #6748's dispatch structure exactly — `agg_groupby_inline`'s multi-column branch is now identical to #6748's. Changes: - `agg_symbolized_path` now symbolizes each Utf8/Binary col into a `Vec<u32>` once, stored alongside a `None` slot for non-string cols. If exactly two string cols, pack the two symbol IDs into a u64 and group with `FnvHashMap<u64, u32>` via the new `agg_packed_u64_inner` helper. Otherwise (mixed shape or 3+ cols), rebuild the symbolized RecordBatch and call `agg_generic_hash_path` as before. - Delete the separate `agg_packed_u64_path` function and the `symbolize_string_col` helper; their logic is now inline in `agg_symbolized_path`. - Revert the dispatcher in `agg_groupby_inline` to PR #6748's structure (`match agg_symbolized_path → None: generic`). - Update doc comments on three inherited tests to reflect that the packed-u64 fast path lives inside `agg_symbolized_path`, not as a separate dispatch step. Tests: all 48 `inline_agg` tests pass, including the 5 `test_inline_packed_u64_*` cases. Benchmarks: long-string two-col shapes (1.2M-5M rows x 32-100k groups) continue to show ~1.06x-1.18x speedup vs the equivalent no-fast-path baseline on the same restructured code, matching the earlier separate-function implementation's measurements. Short-string shapes (TPC-H Q1) now route through the exact same code path as PR #6748 — no overhead added.
BABTUNA:perf/packed-symbol-groupby
19 hours ago

Latest Branches

CodSpeed Performance Gauge
0%
perf(inline-agg): pack two-string-column keys into u64 for typed FNV grouping#6924
6 hours ago
41312f4
BABTUNA:perf/packed-symbol-groupby
CodSpeed Performance Gauge
-2%
20 hours ago
c2314fc
QuakeWang:fix/paimon-column-order
CodSpeed Performance Gauge
-1%
feat(show): env defaults and auto alignment for preview output#6856
21 hours ago
ade52cb
BABTUNA:feat/show-preview-defaults
© 2026 CodSpeed Technology
Home Terms Privacy Docs