Latest Results
feat(dashboard): Draft tasks view for Flotilla (#6783)
## Changes Made
Adds a new collapsible Tasks sidebar to the Execution tab for Flotilla
queries, providing worker- and task-level visibility for debugging stuck
or slow queries.
While running:
<img width="1502" height="581" alt="image"
src="https://github.com/user-attachments/assets/f6f21a99-dfa0-4a79-b4d7-ab70d03c3c7a"/>
After finished:
<img width="1459" height="463" alt="image"
src="https://github.com/user-attachments/assets/3bb99782-73ab-462e-90cd-8ef169e056e2"/>
### Task view sidebar
- **Running tasks "top" section** — flat table of the longest-running
active tasks (top K=10), sorted by duration descending. Live-updating
durations. Collapses away when no tasks are running. Quick "what's
running and what's slow?" view without expanding groups.
- **Groups table** — tasks grouped by their full **`node_ids` chain**
(the distributed pipeline nodes that contributed local plan nodes to the
task). Group rows show aggregate stats (task count, status breakdown,
CPU time) and are expandable to individual task detail. Two columns
surface the plan:
- **Local Plan** — the local-pipeline display (e.g.
`ScanTaskSource→Project`), parsed from `task.name`.
- **Distributed Plan** — the chain of distributed pipeline nodes (e.g.
`Repartition→GroupedAggregate→Project`) derived by looking up each
`node_ids[i]` in the physical plan. Each node is a clickable chip.
- Header summary shows live running / finished / failed counts.
- **Click-to-filter**: click any chip in the Distributed Plan column (or
the existing "View Tasks" affordance on a plan-tree node) to narrow the
sidebar to tasks whose `node_ids` *contains* that node id. Deep-linkable
via the `node` URL param.
- **Hover-to-highlight**: hovering a task row glows every node in its
chain in the physical plan tree (amber inner ring), distinct from the
magenta sticky highlight set by the click filter.
### Bounded memory (`TaskStore`)
The dashboard server uses a **bounded `TaskStore`** instead of storing
every task individually, keeping memory at O(groups × K) instead of
O(total_tasks):
- **Per-group aggregate summaries** (`TaskGroupSummary`) — always
accurate counts, totals, and timestamps, even after individual tasks are
evicted.
- **Retention policy for individual tasks:**
- Active (in-flight) tasks: always retained for debugging stuck queries.
- Failed/cancelled tasks: retained up to a global FIFO cap (default
100).
- Completed tasks: **top-K longest by duration** per group (default
K=10) — retains bottleneck tasks, not just the most recent.
- Groups keyed by the full `node_ids` chain — tasks with identical
chains belong to the same group regardless of how their `name` strings
differ (e.g. parameter-driven variations like different limit values
yielding `Limit(10)` vs `Limit(100)` are now collapsed into one group).
- Frontend shows "Showing N of M tasks (longest by duration, plus active
and failed)" when drill-down is partial.
### Wiring
Backend wires `TaskSubmit`/`TaskEnd` events from the distributed
scheduler (`TaskLifecycleEventSubscriber` in #6759) through
`DaftContext` to the dashboard subscriber, which POSTs to new
`/engine/query/{id}/task/submit` and `/task/end` endpoints. The
endpoints update the per-query `TaskStore` and notify SSE subscribers
via the existing `query_info` event.
### Requires
- `DAFT_TASK_EVENTS_ENABLED=true` environment variable to emit task
lifecycle events from the distributed scheduler (env gate added in
#6782).
- `DAFT_DASHBOARD_URL` set so the subscriber forwards events to the
dashboard server.
### Known gaps relative to the original mock UI
- **No worker IP** — only `worker_id` is available in `TaskEndEvent`.
- **No per-task rows/bytes metrics** — these need further work to avoid
double-counting (tracked in
[DF-1992](https://linear.app/eventual/issue/DF-1992/add-per-task-rowsbytes-stats-to-the-flotilla-tasks-sidebar)).
- **No per-task context** — filename / partition id / shuffle key range
are not surfaced from the scheduler. The "Context" column from the mock
has been dropped.
- **No distinct Executing transition** — tasks go from `Pending`
straight to `Finished/Failed/Cancelled`. The UI treats `Pending` with no
`end_sec` as "Executing" as a best-effort proxy. `TaskEvent::Scheduled`
is not forwarded.
## Related Issues
Builds on #6750 / #6752 (mock Tasks UI) and #6759 (TaskSubmit/TaskEnd
lifecycle events). Requires #6782 (env gate for task events).
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Chris Kellogg <cckellogg@gmail.com> feat(functions): add simhash and hamming_distance for near-duplicate detection (#6821)
## Changes Made
This PR adds two new functions that together form a complete text
near-duplicate detection pipeline, complementary to the existing MinHash
approach:
- `simhash(text, *, ngram_size=3, hash_function="xxhash3_64")` —
Computes a 64-bit locality-sensitive fingerprint from text using
character n-grams. Similar texts produce fingerprints that differ in
only a few bits.
- `hamming_distance(left, right)` — Computes the bitwise Hamming
distance (popcount of XOR) between integer or FixedSizeBinary inputs.
This is the standard metric for comparing hash fingerprints produced by
**`simhash`**, **`image_hash`**, or any other bit-oriented hash.
Daft already has MinHash for Jaccard-based LSH deduplication. SimHash is
complementary: it produces a single UInt64 fingerprint (vs MinHash's
FixedSizeList\[UInt32, N\] signature vector), making it significantly
cheaper to store, index, and compare at scale. The tradeoff is that
SimHash captures weighted feature similarity rather than set overlap,
making it better suited for document-level near-duplicate detection.
---
### Rename: existing hamming\_distance → hamming\_distance\_str
This PR also renames the character-level hamming\_distance function
(introduced in #6797 yesterday, not yet released) to
hamming\_distance\_str, and gives the hamming\_distance name to the
bitwise version.
Why:
- "Hamming distance" in information theory and data engineering
universally refers to the number of differing bits. The character-level
variant is a valid but niche definition.
- The internal Rust implementation already named this function
hamming_distance_str, which is intuitive.
- Since #6797 has not shipped in any release, renaming now is free;
after a release it would be breaking. Latest Branches
0%
daft_func-extension-macro 0%
0%
claude/integrate-task-events-xazrE © 2026 CodSpeed Technology