Avatar for the Eventual-Inc user
Eventual-Inc
Daft
BlogDocsChangelog

Performance History

Latest Results

fix(extensions): handle zero-arg functions and add arg-count guard in call - Avoid indexing args[0] in return_field when param_count is 0 - Add arg-count validation in generated call() for clear error messages
daft_func-extension-macro
1 hour ago
feat(dashboard): Draft tasks view for Flotilla (#6783) ## Changes Made Adds a new collapsible Tasks sidebar to the Execution tab for Flotilla queries, providing worker- and task-level visibility for debugging stuck or slow queries. While running: <img width="1502" height="581" alt="image" src="https://github.com/user-attachments/assets/f6f21a99-dfa0-4a79-b4d7-ab70d03c3c7a"/> After finished: <img width="1459" height="463" alt="image" src="https://github.com/user-attachments/assets/3bb99782-73ab-462e-90cd-8ef169e056e2"/> ### Task view sidebar - **Running tasks "top" section** — flat table of the longest-running active tasks (top K=10), sorted by duration descending. Live-updating durations. Collapses away when no tasks are running. Quick "what's running and what's slow?" view without expanding groups. - **Groups table** — tasks grouped by their full **`node_ids` chain** (the distributed pipeline nodes that contributed local plan nodes to the task). Group rows show aggregate stats (task count, status breakdown, CPU time) and are expandable to individual task detail. Two columns surface the plan: - **Local Plan** — the local-pipeline display (e.g. `ScanTaskSource→Project`), parsed from `task.name`. - **Distributed Plan** — the chain of distributed pipeline nodes (e.g. `Repartition→GroupedAggregate→Project`) derived by looking up each `node_ids[i]` in the physical plan. Each node is a clickable chip. - Header summary shows live running / finished / failed counts. - **Click-to-filter**: click any chip in the Distributed Plan column (or the existing "View Tasks" affordance on a plan-tree node) to narrow the sidebar to tasks whose `node_ids` *contains* that node id. Deep-linkable via the `node` URL param. - **Hover-to-highlight**: hovering a task row glows every node in its chain in the physical plan tree (amber inner ring), distinct from the magenta sticky highlight set by the click filter. ### Bounded memory (`TaskStore`) The dashboard server uses a **bounded `TaskStore`** instead of storing every task individually, keeping memory at O(groups × K) instead of O(total_tasks): - **Per-group aggregate summaries** (`TaskGroupSummary`) — always accurate counts, totals, and timestamps, even after individual tasks are evicted. - **Retention policy for individual tasks:** - Active (in-flight) tasks: always retained for debugging stuck queries. - Failed/cancelled tasks: retained up to a global FIFO cap (default 100). - Completed tasks: **top-K longest by duration** per group (default K=10) — retains bottleneck tasks, not just the most recent. - Groups keyed by the full `node_ids` chain — tasks with identical chains belong to the same group regardless of how their `name` strings differ (e.g. parameter-driven variations like different limit values yielding `Limit(10)` vs `Limit(100)` are now collapsed into one group). - Frontend shows "Showing N of M tasks (longest by duration, plus active and failed)" when drill-down is partial. ### Wiring Backend wires `TaskSubmit`/`TaskEnd` events from the distributed scheduler (`TaskLifecycleEventSubscriber` in #6759) through `DaftContext` to the dashboard subscriber, which POSTs to new `/engine/query/{id}/task/submit` and `/task/end` endpoints. The endpoints update the per-query `TaskStore` and notify SSE subscribers via the existing `query_info` event. ### Requires - `DAFT_TASK_EVENTS_ENABLED=true` environment variable to emit task lifecycle events from the distributed scheduler (env gate added in #6782). - `DAFT_DASHBOARD_URL` set so the subscriber forwards events to the dashboard server. ### Known gaps relative to the original mock UI - **No worker IP** — only `worker_id` is available in `TaskEndEvent`. - **No per-task rows/bytes metrics** — these need further work to avoid double-counting (tracked in [DF-1992](https://linear.app/eventual/issue/DF-1992/add-per-task-rowsbytes-stats-to-the-flotilla-tasks-sidebar)). - **No per-task context** — filename / partition id / shuffle key range are not surfaced from the scheduler. The "Context" column from the mock has been dropped. - **No distinct Executing transition** — tasks go from `Pending` straight to `Finished/Failed/Cancelled`. The UI treats `Pending` with no `end_sec` as "Executing" as a best-effort proxy. `TaskEvent::Scheduled` is not forwarded. ## Related Issues Builds on #6750 / #6752 (mock Tasks UI) and #6759 (TaskSubmit/TaskEnd lifecycle events). Requires #6782 (env gate for task events). --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Chris Kellogg <cckellogg@gmail.com>
main
3 hours ago
Merge branch 'main' into sam/parquet-early-stats
sam/parquet-early-stats
3 hours ago
docs: add Tencent Cloud COS connector documentation (#6818) Add documentation for the COS (Cloud Object Storage) connector introduced in PR #6140. This includes: - New docs/connectors/cos.md with: - Overview of COS support (cos:// and cosn:// URL schemes) - Environment variable configuration (COS_* and TENCENTCLOUD_*) - Manual credential configuration examples - STS temporary credentials usage - Custom endpoint and anonymous access examples - Writing data examples - Full configuration options table - Region/endpoint auto-derivation explanation - Supported operations list - Updated docs/SUMMARY.md to include COS in navigation - Updated docs/connectors/index.md Cloud Storage table - Updated docs/api/config.md to include CosConfig API reference ## Changes Made | File | Change | |------|--------| | `docs/connectors/cos.md` | **New file** — Full documentation for the COS connector, including configuration via environment variables and `CosConfig`, reading/writing examples, STS temporary credentials, custom endpoint, anonymous access, configuration options table, and supported operations list (+199 lines) | | `docs/SUMMARY.md` | Added COS entry to the Data Connectors navigation section (+1 line) | | `docs/connectors/index.md` | Added Tencent Cloud COS row to the Cloud Storage table (+1 line) | | `docs/api/config.md` | Added `CosConfig` API reference under I/O Configurations section (+4 lines) | ## Related Issues - Follows up on #6140 (feat: Add Tencent Cloud COS support)
main
4 hours ago
feat(functions): add simhash and hamming_distance for near-duplicate detection (#6821) ## Changes Made This PR adds two new functions that together form a complete text near-duplicate detection pipeline, complementary to the existing MinHash approach: - `simhash(text, *, ngram_size=3, hash_function="xxhash3_64")` — Computes a 64-bit locality-sensitive fingerprint from text using character n-grams. Similar texts produce fingerprints that differ in only a few bits. - `hamming_distance(left, right)` — Computes the bitwise Hamming distance (popcount of XOR) between integer or FixedSizeBinary inputs. This is the standard metric for comparing hash fingerprints produced by **`simhash`**, **`image_hash`**, or any other bit-oriented hash. Daft already has MinHash for Jaccard-based LSH deduplication. SimHash is complementary: it produces a single UInt64 fingerprint (vs MinHash's FixedSizeList\[UInt32, N\] signature vector), making it significantly cheaper to store, index, and compare at scale. The tradeoff is that SimHash captures weighted feature similarity rather than set overlap, making it better suited for document-level near-duplicate detection. --- ### Rename: existing hamming\_distance → hamming\_distance\_str This PR also renames the character-level hamming\_distance function (introduced in #6797 yesterday, not yet released) to hamming\_distance\_str, and gives the hamming\_distance name to the bitwise version. Why: - "Hamming distance" in information theory and data engineering universally refers to the number of differing bits. The character-level variant is a valid but niche definition. - The internal Rust implementation already named this function hamming_distance_str, which is intuitive. - Since #6797 has not shipped in any release, renaming now is free; after a release it would be breaking.
main
5 hours ago

Latest Branches

CodSpeed Performance Gauge
0%
feat(extensions): add proc macro for making writing extension functio…#6837
2 hours ago
bff1b14
daft_func-extension-macro
CodSpeed Performance Gauge
0%
3 hours ago
5ed9da6
sam/parquet-early-stats
CodSpeed Performance Gauge
0%
feat(dashboard): Draft tasks view for Flotilla#6783
4 hours ago
16bde99
claude/integrate-task-events-xazrE
© 2026 CodSpeed Technology
Home Terms Privacy Docs