Latest Results
GEFF IO improvement (#411)
Hi @cmalinmayor
# Proposed Change
Optimizes zarr I/O in `core_io` by replacing boolean-mask indexing with
`oindex` integer indexing for reads, and adding explicit chunking with
zarr v3 sharding for writes.
While profiling `tracksdata`, I noticed two performance issues:
1. `_load_prop_to_memory` converts boolean masks to Python lists via
`mask.tolist()` before indexing zarr arrays. For sparse selections (e.g.
5 nodes out of 8674) this is ~7x slower than using integer indices with
`oindex`, because zarr has to process the entire boolean array instead
of just reading the relevant chunks.
2. No explicit chunking when writing arrays — zarr's auto-chunking
splits trailing dimensions unnecessarily. For example, `edge_ids` with
shape `(189870, 2)` gets chunked as `(47468, 1)`, putting each column in
a separate chunk.
Also fixes a bug: the per-node boolean mask was applied to the varlength
`DATA` array, which is a flat 1D concatenation with a completely
different length. This crashes on zarr v3 with
`VindexInvalidSelectionError`. The DATA array must always be loaded in
full since deserialization uses byte offsets from VALUES.
### Benchmarks
Dataset: BF-C2DL-HSC/01 (8674 nodes, 189870 edges)
**`reader.build()` selecting 5 nodes:**
| Version | Time |
|---------|------|
| Before (`mask.tolist`) | ~155 ms |
| After (`oindex`) | ~21 ms |
**File count on zarr v3 write:**
| Version | Files on disk |
|---------|--------------|
| Before (auto-chunking) | 119 |
| After (sharding) | 67 |
**Chunking examples:**
| Array | Before | After |
|-------|--------|-------|
| `edge_ids` (189870, 2) uint64 | (47468, **1**) | (189870, **2**) |
| `bbox` (8674, 6) int64 | (4337, 6) | (8674, 6) |
| `mask/data` (4192807,) uint64 | (131026,) | (1048576,) |
# Types of Changes
- Bugfix (non-breaking change which fixes an issue)
- New feature or enhancement
Which topics does your change affect?
- Core io
# Checklist
- [x] I have read the
[developer/contributing](https://github.com/live-image-tracking-tools/geff/blob/main/CONTRIBUTING)
docs.
- [ ] I have added tests that prove that my feature works in various
situations or tests the bugfix (if appropriate).
- [x] I have checked that I maintained or improved code coverage.
- [x] I have written docstrings and checked that they render correctly
by looking at the docs preview (link left as a comment on the PR).
# Further Comments
The change is split into two commits:
1. **Reading**: adds `_load_zarr_subset()` using `oindex` and fixes the
varlength DATA bug. The boolean mask API of `_load_prop_to_memory` and
`build()` is unchanged — the conversion to integer indices happens
internally.
2. **Writing**: adds `_write_zarr_array()` that chunks only along the
first dimension (power-of-two size, ~8 MiB target). On zarr v3 it uses
sharding so each array becomes a single file with sub-chunks inside. On
zarr v2 it just uses the larger explicit chunks. Empty arrays fall back
to direct assignment.
All 116 `test_core_io` tests pass.
I haven't added new tests yet — the existing masked-read tests
(`test_build_w_masked_nodes`, `test_load_prop_into_memory`, etc.)
already exercise the new code paths since the public API is unchanged.
Happy to add targeted benchmarks or edge-case tests if needed.
---------
Co-authored-by: Caroline Malin-Mayor <malinmayorc@janelia.hhmi.org>
Co-authored-by: Caroline Malin-Mayor <cmalinmayor@gmail.com> GEFF IO improvement (#411)
Hi @cmalinmayor
# Proposed Change
Optimizes zarr I/O in `core_io` by replacing boolean-mask indexing with
`oindex` integer indexing for reads, and adding explicit chunking with
zarr v3 sharding for writes.
While profiling `tracksdata`, I noticed two performance issues:
1. `_load_prop_to_memory` converts boolean masks to Python lists via
`mask.tolist()` before indexing zarr arrays. For sparse selections (e.g.
5 nodes out of 8674) this is ~7x slower than using integer indices with
`oindex`, because zarr has to process the entire boolean array instead
of just reading the relevant chunks.
2. No explicit chunking when writing arrays — zarr's auto-chunking
splits trailing dimensions unnecessarily. For example, `edge_ids` with
shape `(189870, 2)` gets chunked as `(47468, 1)`, putting each column in
a separate chunk.
Also fixes a bug: the per-node boolean mask was applied to the varlength
`DATA` array, which is a flat 1D concatenation with a completely
different length. This crashes on zarr v3 with
`VindexInvalidSelectionError`. The DATA array must always be loaded in
full since deserialization uses byte offsets from VALUES.
### Benchmarks
Dataset: BF-C2DL-HSC/01 (8674 nodes, 189870 edges)
**`reader.build()` selecting 5 nodes:**
| Version | Time |
|---------|------|
| Before (`mask.tolist`) | ~155 ms |
| After (`oindex`) | ~21 ms |
**File count on zarr v3 write:**
| Version | Files on disk |
|---------|--------------|
| Before (auto-chunking) | 119 |
| After (sharding) | 67 |
**Chunking examples:**
| Array | Before | After |
|-------|--------|-------|
| `edge_ids` (189870, 2) uint64 | (47468, **1**) | (189870, **2**) |
| `bbox` (8674, 6) int64 | (4337, 6) | (8674, 6) |
| `mask/data` (4192807,) uint64 | (131026,) | (1048576,) |
# Types of Changes
- Bugfix (non-breaking change which fixes an issue)
- New feature or enhancement
Which topics does your change affect?
- Core io
# Checklist
- [x] I have read the
[developer/contributing](https://github.com/live-image-tracking-tools/geff/blob/main/CONTRIBUTING)
docs.
- [ ] I have added tests that prove that my feature works in various
situations or tests the bugfix (if appropriate).
- [x] I have checked that I maintained or improved code coverage.
- [x] I have written docstrings and checked that they render correctly
by looking at the docs preview (link left as a comment on the PR).
# Further Comments
The change is split into two commits:
1. **Reading**: adds `_load_zarr_subset()` using `oindex` and fixes the
varlength DATA bug. The boolean mask API of `_load_prop_to_memory` and
`build()` is unchanged — the conversion to integer indices happens
internally.
2. **Writing**: adds `_write_zarr_array()` that chunks only along the
first dimension (power-of-two size, ~8 MiB target). On zarr v3 it uses
sharding so each array becomes a single file with sub-chunks inside. On
zarr v2 it just uses the larger explicit chunks. Empty arrays fall back
to direct assignment.
All 116 `test_core_io` tests pass.
I haven't added new tests yet — the existing masked-read tests
(`test_build_w_masked_nodes`, `test_load_prop_into_memory`, etc.)
already exercise the new code paths since the public API is unchanged.
Happy to add targeted benchmarks or edge-case tests if needed.
---------
Co-authored-by: Caroline Malin-Mayor <malinmayorc@janelia.hhmi.org>
Co-authored-by: Caroline Malin-Mayor <cmalinmayor@gmail.com> Latest Branches
+23%
JoOkuma:jookuma/geff-io-improv -72%
dependabot/github_actions/softprops/action-gh-release-3 -2%
© 2026 CodSpeed Technology