Avatar for the live-image-tracking-tools user
live-image-tracking-tools
geff
BlogDocsChangelog

Performance History

Latest Results

GEFF IO improvement (#411) Hi @cmalinmayor # Proposed Change Optimizes zarr I/O in `core_io` by replacing boolean-mask indexing with `oindex` integer indexing for reads, and adding explicit chunking with zarr v3 sharding for writes. While profiling `tracksdata`, I noticed two performance issues: 1. `_load_prop_to_memory` converts boolean masks to Python lists via `mask.tolist()` before indexing zarr arrays. For sparse selections (e.g. 5 nodes out of 8674) this is ~7x slower than using integer indices with `oindex`, because zarr has to process the entire boolean array instead of just reading the relevant chunks. 2. No explicit chunking when writing arrays — zarr's auto-chunking splits trailing dimensions unnecessarily. For example, `edge_ids` with shape `(189870, 2)` gets chunked as `(47468, 1)`, putting each column in a separate chunk. Also fixes a bug: the per-node boolean mask was applied to the varlength `DATA` array, which is a flat 1D concatenation with a completely different length. This crashes on zarr v3 with `VindexInvalidSelectionError`. The DATA array must always be loaded in full since deserialization uses byte offsets from VALUES. ### Benchmarks Dataset: BF-C2DL-HSC/01 (8674 nodes, 189870 edges) **`reader.build()` selecting 5 nodes:** | Version | Time | |---------|------| | Before (`mask.tolist`) | ~155 ms | | After (`oindex`) | ~21 ms | **File count on zarr v3 write:** | Version | Files on disk | |---------|--------------| | Before (auto-chunking) | 119 | | After (sharding) | 67 | **Chunking examples:** | Array | Before | After | |-------|--------|-------| | `edge_ids` (189870, 2) uint64 | (47468, **1**) | (189870, **2**) | | `bbox` (8674, 6) int64 | (4337, 6) | (8674, 6) | | `mask/data` (4192807,) uint64 | (131026,) | (1048576,) | # Types of Changes - Bugfix (non-breaking change which fixes an issue) - New feature or enhancement Which topics does your change affect? - Core io # Checklist - [x] I have read the [developer/contributing](https://github.com/live-image-tracking-tools/geff/blob/main/CONTRIBUTING) docs. - [ ] I have added tests that prove that my feature works in various situations or tests the bugfix (if appropriate). - [x] I have checked that I maintained or improved code coverage. - [x] I have written docstrings and checked that they render correctly by looking at the docs preview (link left as a comment on the PR). # Further Comments The change is split into two commits: 1. **Reading**: adds `_load_zarr_subset()` using `oindex` and fixes the varlength DATA bug. The boolean mask API of `_load_prop_to_memory` and `build()` is unchanged — the conversion to integer indices happens internally. 2. **Writing**: adds `_write_zarr_array()` that chunks only along the first dimension (power-of-two size, ~8 MiB target). On zarr v3 it uses sharding so each array becomes a single file with sub-chunks inside. On zarr v2 it just uses the larger explicit chunks. Empty arrays fall back to direct assignment. All 116 `test_core_io` tests pass. I haven't added new tests yet — the existing masked-read tests (`test_build_w_masked_nodes`, `test_load_prop_into_memory`, etc.) already exercise the new code paths since the public API is unchanged. Happy to add targeted benchmarks or edge-case tests if needed. --------- Co-authored-by: Caroline Malin-Mayor <malinmayorc@janelia.hhmi.org> Co-authored-by: Caroline Malin-Mayor <cmalinmayor@gmail.com>
main
2 days ago
GEFF IO improvement (#411) Hi @cmalinmayor # Proposed Change Optimizes zarr I/O in `core_io` by replacing boolean-mask indexing with `oindex` integer indexing for reads, and adding explicit chunking with zarr v3 sharding for writes. While profiling `tracksdata`, I noticed two performance issues: 1. `_load_prop_to_memory` converts boolean masks to Python lists via `mask.tolist()` before indexing zarr arrays. For sparse selections (e.g. 5 nodes out of 8674) this is ~7x slower than using integer indices with `oindex`, because zarr has to process the entire boolean array instead of just reading the relevant chunks. 2. No explicit chunking when writing arrays — zarr's auto-chunking splits trailing dimensions unnecessarily. For example, `edge_ids` with shape `(189870, 2)` gets chunked as `(47468, 1)`, putting each column in a separate chunk. Also fixes a bug: the per-node boolean mask was applied to the varlength `DATA` array, which is a flat 1D concatenation with a completely different length. This crashes on zarr v3 with `VindexInvalidSelectionError`. The DATA array must always be loaded in full since deserialization uses byte offsets from VALUES. ### Benchmarks Dataset: BF-C2DL-HSC/01 (8674 nodes, 189870 edges) **`reader.build()` selecting 5 nodes:** | Version | Time | |---------|------| | Before (`mask.tolist`) | ~155 ms | | After (`oindex`) | ~21 ms | **File count on zarr v3 write:** | Version | Files on disk | |---------|--------------| | Before (auto-chunking) | 119 | | After (sharding) | 67 | **Chunking examples:** | Array | Before | After | |-------|--------|-------| | `edge_ids` (189870, 2) uint64 | (47468, **1**) | (189870, **2**) | | `bbox` (8674, 6) int64 | (4337, 6) | (8674, 6) | | `mask/data` (4192807,) uint64 | (131026,) | (1048576,) | # Types of Changes - Bugfix (non-breaking change which fixes an issue) - New feature or enhancement Which topics does your change affect? - Core io # Checklist - [x] I have read the [developer/contributing](https://github.com/live-image-tracking-tools/geff/blob/main/CONTRIBUTING) docs. - [ ] I have added tests that prove that my feature works in various situations or tests the bugfix (if appropriate). - [x] I have checked that I maintained or improved code coverage. - [x] I have written docstrings and checked that they render correctly by looking at the docs preview (link left as a comment on the PR). # Further Comments The change is split into two commits: 1. **Reading**: adds `_load_zarr_subset()` using `oindex` and fixes the varlength DATA bug. The boolean mask API of `_load_prop_to_memory` and `build()` is unchanged — the conversion to integer indices happens internally. 2. **Writing**: adds `_write_zarr_array()` that chunks only along the first dimension (power-of-two size, ~8 MiB target). On zarr v3 it uses sharding so each array becomes a single file with sub-chunks inside. On zarr v2 it just uses the larger explicit chunks. Empty arrays fall back to direct assignment. All 116 `test_core_io` tests pass. I haven't added new tests yet — the existing masked-read tests (`test_build_w_masked_nodes`, `test_load_prop_into_memory`, etc.) already exercise the new code paths since the public API is unchanged. Happy to add targeted benchmarks or edge-case tests if needed. --------- Co-authored-by: Caroline Malin-Mayor <malinmayorc@janelia.hhmi.org> Co-authored-by: Caroline Malin-Mayor <cmalinmayor@gmail.com>
main
7 days ago
Apply suggestion from @cmalinmayor
JoOkuma:jookuma/geff-io-improv
7 days ago
fixing zarr import
JoOkuma:jookuma/geff-io-improv
7 days ago
Test reading with empty node mask
JoOkuma:jookuma/geff-io-improv
8 days ago
Merge branch 'main' into jookuma/geff-io-improv
JoOkuma:jookuma/geff-io-improv
8 days ago
Dataframe/csv to geff (#407) # Proposed Change Convert a dataframe to a GEFF (and thus a csv read with pandas) Closes #319 @AnniekStok This would hopefully take over the functionality of your funtracks PR https://github.com/funkelab/funtracks/pull/186 - we can just call dataframes_to_memory_geff after splitting the edges into their own df. Current implementation doesn't support list elements, but I think we should just write them into separate columns (unsquish on write) and then manipulate the InMemoryGeff after calling dataframes_to_memory_geff to resquish them as needed. Just need a way to save which ones should be squished, which can go in custom metadata. Open questions: Should I add a benchmark? Should I implement the squishing? If so, how should I let the user tell us which columns to squish on read - optional argument of dict from new name to old column names, save GeffMetadata with csvs and add that dict to the extras, both as options? # Types of Changes What types of changes does your code introduce? Delete those that do not apply. - New feature or enhancement Which topics does your change affect? Delete those that do not apply. - Convert - Core io # Checklist - [x] I have read the [developer/contributing](https://github.com/live-image-tracking-tools/geff/blob/main/CONTRIBUTING) docs. - [x] I have added tests that prove that my feature works in various situations or tests the bugfix (if appropriate). - [x] I have checked that I maintained or improved code coverage. - [ ] I have written docstrings and checked that they render correctly by looking at the docs preview (link left as a comment on the PR). # Further Comments I would love at least the default value refactoring to live in GEFF, along with the construct_props helper. If we keep the dataframe specific code in funtracks, at least we can avoid duplicating constructing the values and missing arrays with the correct dtype of default values. --------- Co-authored-by: AnniekStok <anniek.stokkermans@gmail.com> Co-authored-by: Morgan Schwartz <msschwartz21@gmail.com>
main
9 days ago
Dataframe/csv to geff (#407) # Proposed Change Convert a dataframe to a GEFF (and thus a csv read with pandas) Closes #319 @AnniekStok This would hopefully take over the functionality of your funtracks PR https://github.com/funkelab/funtracks/pull/186 - we can just call dataframes_to_memory_geff after splitting the edges into their own df. Current implementation doesn't support list elements, but I think we should just write them into separate columns (unsquish on write) and then manipulate the InMemoryGeff after calling dataframes_to_memory_geff to resquish them as needed. Just need a way to save which ones should be squished, which can go in custom metadata. Open questions: Should I add a benchmark? Should I implement the squishing? If so, how should I let the user tell us which columns to squish on read - optional argument of dict from new name to old column names, save GeffMetadata with csvs and add that dict to the extras, both as options? # Types of Changes What types of changes does your code introduce? Delete those that do not apply. - New feature or enhancement Which topics does your change affect? Delete those that do not apply. - Convert - Core io # Checklist - [x] I have read the [developer/contributing](https://github.com/live-image-tracking-tools/geff/blob/main/CONTRIBUTING) docs. - [x] I have added tests that prove that my feature works in various situations or tests the bugfix (if appropriate). - [x] I have checked that I maintained or improved code coverage. - [ ] I have written docstrings and checked that they render correctly by looking at the docs preview (link left as a comment on the PR). # Further Comments I would love at least the default value refactoring to live in GEFF, along with the construct_props helper. If we keep the dataframe specific code in funtracks, at least we can avoid duplicating constructing the values and missing arrays with the correct dtype of default values. --------- Co-authored-by: AnniekStok <anniek.stokkermans@gmail.com> Co-authored-by: Morgan Schwartz <msschwartz21@gmail.com>
main
15 days ago

Latest Branches

CodSpeed Performance Gauge
+23%
GEFF IO improvement#411
7 days ago
a18dd53
JoOkuma:jookuma/geff-io-improv
CodSpeed Performance Gauge
-72%
15 days ago
ac3e3a5
dependabot/github_actions/softprops/action-gh-release-3
CodSpeed Performance Gauge
-2%
16 days ago
f974165
danielskatz:patch-1
© 2026 CodSpeed Technology
Home Terms Privacy Docs