Latest Results
feat: add image_hash() for image deduplication (#6485)
Adds native perceptual hashing to Daft, enabling large-scale image
similarity detection and deduplication workflows without requiring
custom UDFs.
---
#### Changes
New `daft.functions.image_hash()` function that accepts an Image column
and returns a `FixedSizeBinary` column. Supports 8 algorithms:
| Algorithm | Description |
|-----------|-------------|
| phash (default) | Full 2D DCT perceptual hash — most robust |
| phash_simple | Row-wise DCT only — faster variant |
| dhash | Horizontal difference hash — fast structural comparison |
| dhash_vertical | Vertical difference hash |
| ahash | Average hash — fastest |
| whash | Multi-level Haar wavelet hash |
| crop_resistant | Segment-based hash — robust against cropping |
| colorhash | HSV color distribution hash |
---
#### Implementation Notes
- **Rust backend**: all algorithms are implemented natively in the
`daft-image` crate, including an FFT-based DCT and multi-level Haar DWT,
with results that are bit-exact against the Python `imagehash` library
- **Type system**: `HashMethod` enum implements `FromLiteral` +
`FromStr`; argument parsing uses `#[derive(FunctionArgs)]`, with no
redundant string round-trips
- **Performance**: batch hashing runs in parallel across all CPU cores
via `rayon`; the resize kernel operates on single-channel luma rather
than RGB, reducing the dominant convolution cost proportionally —
together these yield 5–25× speedups over an equivalent Python UDF on the
same Daft pipeline
- **Null propagation**: null images produce null hashes, consistent with
other Daft column operations
- **Input validation**: Python validates `method`, `hash_size`,
`binbits`, `segments`, and the power-of-2 constraint for `whash` early,
with clear error messages
---
#### Tests
- `tests/cookbook/test_image_hash_compat.py`: bit-exact compatibility
tests against the `imagehash` Python library
- `tests/cookbook/test_image_hash.py`: standalone tests covering output
dtype/size, null propagation, identical-image zero distance,
discriminability, similarity ordering, and error handling — all 8
algorithms covered
- `tests/recordbatch/image/test_image_hash.py`: RecordBatch-level tests
- `src/daft-image/benches/image_ops.rs`: benchmarks for all algorithms
---
Closes #4889 Latest Branches
0%
+27%
BABTUNA:feat/inline-agg-minmax -1%
desmond/fix-dependabot-6598 © 2026 CodSpeed Technology