Avatar for the openvm-org user
openvm-org
openvm
BlogDocsChangelog

Performance History

Latest Results

docs: update tag for solidity-sdk (#2396)
main
2 days ago
test: run sha2 test vectors together
test/sha2-guest
2 days ago
test: run sha2 test vectors together
test/sha2-guest
2 days ago
feat: add gpu tracegen for boundary chip (#2378) Resolves INT-5725. ## Background: PersistentBoundaryAir Update (not in this PR) The `feat/access-adapter-removal` branch removes memory access adapters from the persistent memory path. Previously, separate `AccessAdapterAir` circuits handled the conversion between `CONST_BLOCK_SIZE=4` (the granularity of memory bus interactions) and `CHUNK=8` (the granularity of Merkle tree hashing/Poseidon2 digests). The updated `PersistentBoundaryAir` eliminates this by operating on 8-byte chunks directly while tracking **per-sub-block timestamps**: ``` Old PersistentBoundaryCols: expand_direction | address_space | leaf_label | values[8] | hash[8] | timestamp ^^^^^^^^^ single timestamp New PersistentBoundaryCols: expand_direction | address_space | leaf_label | values[8] | hash[8] | timestamps[2] ^^^^^^^^^^^^^ one per 4-byte block ``` Each 8-byte chunk contains `BLOCKS_PER_CHUNK = CHUNK / CONST_BLOCK_SIZE = 2` sub-blocks. The boundary AIR emits **two** memory bus interactions per row (one per 4-byte sub-block), each with its own timestamp. Untouched sub-blocks within a touched chunk keep `timestamp=0`, which naturally balances against the initial-state row (also at `t=0`). --- ## This PR: GPU Trace Generation for the Updated Boundary Chip This PR adapts the GPU trace generation pipeline to the new per-sub-block-timestamp design. The core challenge: the CPU-side "touched memory" partition arrives as sorted **4-byte records**, but the boundary chip and Merkle tree need **8-byte records** with per-block timestamps and initial-memory fill for untouched sub-blocks. ### New CUDA Kernel: `inventory.cu` — Merge Records on GPU A new `merge_records` kernel converts `InRec = MemoryInventoryRecord<4, 1>` into `OutRec = MemoryInventoryRecord<8, 2>` in two phases: 1. **`cukernel_build_candidates`** — Each thread inspects one input record. If it starts a new 8-byte output chunk (different `ptr/8` than the previous record), it: - Reads the full 8-byte initial memory from device - Overwrites the touched 4-byte sub-block with final values + timestamp - If the next input record belongs to the same output chunk, also patches the other sub-block - Sets `flag[i] = 1`; otherwise `flag[i] = 0` (duplicate within same chunk) 2. **`cukernel_scatter_compact`** — CUB `ExclusiveSum` on flags produces output positions; flagged records are scattered into a compact output array. ### Updated `boundary.cu` — Per-Block Timestamps The `BoundaryRecord` struct is parameterized on `BLOCKS`: ```c++ template <size_t CHUNK, size_t BLOCKS> struct BoundaryRecord { uint32_t address_space; uint32_t ptr; uint32_t timestamps[BLOCKS]; // was: uint32_t timestamp; uint32_t values[CHUNK]; }; ``` The persistent trace gen kernel writes `timestamps=[0,0]` for initial rows and the actual per-block timestamps for final rows. ### Updated `memory.rs` — Host-Side Orchestration The Rust side: 1. Converts `TimestampedEquipartition<F, CONST_BLOCK_SIZE>` into GPU-compatible `MemoryInventoryRecord<4,1>` structs 2. Uploads to device and calls the merge kernel 3. Sends merged records to the boundary chip (`finalize_records_persistent_device`) 4. Converts merged records into Merkle records with `timestamp = max(timestamps[0], timestamps[1])` for the tree update --- ## Walkthrough: Sample Trace Suppose the VM touches 3 memory cells at `CONST_BLOCK_SIZE=4` granularity: | addr_space | ptr | timestamp | values | |------------|-----|-----------|-----------------| | 1 | 0 | 5 | [1, 2, 3, 4] | | 1 | 4 | 10 | [5, 6, 7, 8] | | 1 | 16 | 3 | [9, 0, 0, 0] | Initial memory is all zeros. ### Step 1 — Convert to `InRec<4, 1>` ``` InRec[0]: { as=1, ptr=0, timestamps=[5], values=[1,2,3,4] } InRec[1]: { as=1, ptr=4, timestamps=[10], values=[5,6,7,8] } InRec[2]: { as=1, ptr=16, timestamps=[3], values=[9,0,0,0] } ``` ### Step 2 — GPU `cukernel_build_candidates` Each thread computes `output_chunk = ptr / 8`: | idx | ptr | output_chunk | same as prev? | flag | |-----|-----|-------------|---------------|------| | 0 | 0 | 0 | N/A (first) | 1 | | 1 | 4 | 0 | yes | 0 | | 2 | 16 | 2 | no | 1 | **Thread 0** (flag=1): Builds an `OutRec` for chunk `ptr=0`: - Reads initial memory `[0,0,0,0,0,0,0,0]` from device - `block_idx = (0 % 8) / 4 = 0` → patches `values[0..4] = [1,2,3,4]`, `timestamps[0] = 5` - Next record (idx=1) is same chunk: `block_idx2 = (4 % 8) / 4 = 1` → patches `values[4..8] = [5,6,7,8]`, `timestamps[1] = 10` - Result: `{ as=1, ptr=0, timestamps=[5,10], values=[1,2,3,4,5,6,7,8] }` **Thread 1** (flag=0): Skipped (same output chunk as thread 0). **Thread 2** (flag=1): Builds an `OutRec` for chunk `ptr=16`: - Reads initial memory `[0,0,0,0,0,0,0,0]` from device - `block_idx = (16 % 8) / 4 = 0` → patches `values[0..4] = [9,0,0,0]`, `timestamps[0] = 3` - No next record → `timestamps[1] = 0`, `values[4..8] = [0,0,0,0]` (from initial memory) - Result: `{ as=1, ptr=16, timestamps=[3,0], values=[9,0,0,0,0,0,0,0] }` ### Step 3 — Prefix sum + scatter compact ``` flags = [1, 0, 1] positions = [0, 1, 1] (exclusive prefix sum) out[0] = thread 0's record out[1] = thread 2's record out_num_records = 2 ``` ### Step 4 — Boundary chip trace (2 rows per record = 4 active rows) | Row | expand_dir | as | leaf_label | values | hash | timestamps | |-----|------------|----|------------|---------------------------|---------------|------------| | 0 | +1 (init) | 1 | 0 | [0,0,0,0,0,0,0,0] | H([0,..0]) | [0, 0] | | 1 | -1 (final) | 1 | 0 | [1,2,3,4,5,6,7,8] | H([1,..,8]) | [5, 10] | | 2 | +1 (init) | 1 | 2 | [0,0,0,0,0,0,0,0] | H([0,..0]) | [0, 0] | | 3 | -1 (final) | 1 | 2 | [9,0,0,0,0,0,0,0] | H([9,0,..0]) | [3, 0] | Each final row generates **two** memory bus sends: - Row 1: send `(as=1, ptr=0, values=[1,2,3,4], ts=5)` and `(as=1, ptr=4, values=[5,6,7,8], ts=10)` - Row 3: send `(as=1, ptr=16, values=[9,0,0,0], ts=3)` and `(as=1, ptr=20, values=[0,0,0,0], ts=0)` The `ts=0` sends from initial rows balance against the `ts=0` sub-blocks of the final rows for untouched memory, eliminating the need for access adapters. ### Step 5 — Merkle tree records For the Merkle tree, each record uses a single timestamp = `max(timestamps[0], timestamps[1])`: | as | ptr | merkle_ts | values | |----|-----|-----------|---------------------------| | 1 | 0 | 10 | [1,2,3,4,5,6,7,8] | | 1 | 16 | 3 | [9,0,0,0,0,0,0,0] | These feed into the existing `update_with_touched_blocks` for Merkle root computation. --- ## Other Changes - **`merkle_tree/mod.rs`**: Added `MERKLE_TOUCHED_BLOCK_WIDTH = 3 + DIGEST_WIDTH` constant (distinct from `TIMESTAMPED_BLOCK_WIDTH = 3 + CONST_BLOCK_SIZE`) since the Merkle tree now consumes 8-value records directly. Also fixed a potential `ilog2(0)` panic in `calculate_unpadded_height`. - **New test** `test_empty_touched_memory_uses_full_chunk_values` validates that the empty-partition edge case correctly reads initial memory at `DIGEST_WIDTH` granularity and produces a matching Merkle root vs CPU. --------- Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com> Co-authored-by: Golovanov399 <Golovanov12345@gmail.com>
feat/access-adapter-removal
2 days ago
feat: add gpu tracegen for boundary chip (#2378) Resolves INT-5725. ## Background: PersistentBoundaryAir Update (not in this PR) The `feat/access-adapter-removal` branch removes memory access adapters from the persistent memory path. Previously, separate `AccessAdapterAir` circuits handled the conversion between `CONST_BLOCK_SIZE=4` (the granularity of memory bus interactions) and `CHUNK=8` (the granularity of Merkle tree hashing/Poseidon2 digests). The updated `PersistentBoundaryAir` eliminates this by operating on 8-byte chunks directly while tracking **per-sub-block timestamps**: ``` Old PersistentBoundaryCols: expand_direction | address_space | leaf_label | values[8] | hash[8] | timestamp ^^^^^^^^^ single timestamp New PersistentBoundaryCols: expand_direction | address_space | leaf_label | values[8] | hash[8] | timestamps[2] ^^^^^^^^^^^^^ one per 4-byte block ``` Each 8-byte chunk contains `BLOCKS_PER_CHUNK = CHUNK / CONST_BLOCK_SIZE = 2` sub-blocks. The boundary AIR emits **two** memory bus interactions per row (one per 4-byte sub-block), each with its own timestamp. Untouched sub-blocks within a touched chunk keep `timestamp=0`, which naturally balances against the initial-state row (also at `t=0`). --- ## This PR: GPU Trace Generation for the Updated Boundary Chip This PR adapts the GPU trace generation pipeline to the new per-sub-block-timestamp design. The core challenge: the CPU-side "touched memory" partition arrives as sorted **4-byte records**, but the boundary chip and Merkle tree need **8-byte records** with per-block timestamps and initial-memory fill for untouched sub-blocks. ### New CUDA Kernel: `inventory.cu` — Merge Records on GPU A new `merge_records` kernel converts `InRec = MemoryInventoryRecord<4, 1>` into `OutRec = MemoryInventoryRecord<8, 2>` in two phases: 1. **`cukernel_build_candidates`** — Each thread inspects one input record. If it starts a new 8-byte output chunk (different `ptr/8` than the previous record), it: - Reads the full 8-byte initial memory from device - Overwrites the touched 4-byte sub-block with final values + timestamp - If the next input record belongs to the same output chunk, also patches the other sub-block - Sets `flag[i] = 1`; otherwise `flag[i] = 0` (duplicate within same chunk) 2. **`cukernel_scatter_compact`** — CUB `ExclusiveSum` on flags produces output positions; flagged records are scattered into a compact output array. ### Updated `boundary.cu` — Per-Block Timestamps The `BoundaryRecord` struct is parameterized on `BLOCKS`: ```c++ template <size_t CHUNK, size_t BLOCKS> struct BoundaryRecord { uint32_t address_space; uint32_t ptr; uint32_t timestamps[BLOCKS]; // was: uint32_t timestamp; uint32_t values[CHUNK]; }; ``` The persistent trace gen kernel writes `timestamps=[0,0]` for initial rows and the actual per-block timestamps for final rows. ### Updated `memory.rs` — Host-Side Orchestration The Rust side: 1. Converts `TimestampedEquipartition<F, CONST_BLOCK_SIZE>` into GPU-compatible `MemoryInventoryRecord<4,1>` structs 2. Uploads to device and calls the merge kernel 3. Sends merged records to the boundary chip (`finalize_records_persistent_device`) 4. Converts merged records into Merkle records with `timestamp = max(timestamps[0], timestamps[1])` for the tree update --- ## Walkthrough: Sample Trace Suppose the VM touches 3 memory cells at `CONST_BLOCK_SIZE=4` granularity: | addr_space | ptr | timestamp | values | |------------|-----|-----------|-----------------| | 1 | 0 | 5 | [1, 2, 3, 4] | | 1 | 4 | 10 | [5, 6, 7, 8] | | 1 | 16 | 3 | [9, 0, 0, 0] | Initial memory is all zeros. ### Step 1 — Convert to `InRec<4, 1>` ``` InRec[0]: { as=1, ptr=0, timestamps=[5], values=[1,2,3,4] } InRec[1]: { as=1, ptr=4, timestamps=[10], values=[5,6,7,8] } InRec[2]: { as=1, ptr=16, timestamps=[3], values=[9,0,0,0] } ``` ### Step 2 — GPU `cukernel_build_candidates` Each thread computes `output_chunk = ptr / 8`: | idx | ptr | output_chunk | same as prev? | flag | |-----|-----|-------------|---------------|------| | 0 | 0 | 0 | N/A (first) | 1 | | 1 | 4 | 0 | yes | 0 | | 2 | 16 | 2 | no | 1 | **Thread 0** (flag=1): Builds an `OutRec` for chunk `ptr=0`: - Reads initial memory `[0,0,0,0,0,0,0,0]` from device - `block_idx = (0 % 8) / 4 = 0` → patches `values[0..4] = [1,2,3,4]`, `timestamps[0] = 5` - Next record (idx=1) is same chunk: `block_idx2 = (4 % 8) / 4 = 1` → patches `values[4..8] = [5,6,7,8]`, `timestamps[1] = 10` - Result: `{ as=1, ptr=0, timestamps=[5,10], values=[1,2,3,4,5,6,7,8] }` **Thread 1** (flag=0): Skipped (same output chunk as thread 0). **Thread 2** (flag=1): Builds an `OutRec` for chunk `ptr=16`: - Reads initial memory `[0,0,0,0,0,0,0,0]` from device - `block_idx = (16 % 8) / 4 = 0` → patches `values[0..4] = [9,0,0,0]`, `timestamps[0] = 3` - No next record → `timestamps[1] = 0`, `values[4..8] = [0,0,0,0]` (from initial memory) - Result: `{ as=1, ptr=16, timestamps=[3,0], values=[9,0,0,0,0,0,0,0] }` ### Step 3 — Prefix sum + scatter compact ``` flags = [1, 0, 1] positions = [0, 1, 1] (exclusive prefix sum) out[0] = thread 0's record out[1] = thread 2's record out_num_records = 2 ``` ### Step 4 — Boundary chip trace (2 rows per record = 4 active rows) | Row | expand_dir | as | leaf_label | values | hash | timestamps | |-----|------------|----|------------|---------------------------|---------------|------------| | 0 | +1 (init) | 1 | 0 | [0,0,0,0,0,0,0,0] | H([0,..0]) | [0, 0] | | 1 | -1 (final) | 1 | 0 | [1,2,3,4,5,6,7,8] | H([1,..,8]) | [5, 10] | | 2 | +1 (init) | 1 | 2 | [0,0,0,0,0,0,0,0] | H([0,..0]) | [0, 0] | | 3 | -1 (final) | 1 | 2 | [9,0,0,0,0,0,0,0] | H([9,0,..0]) | [3, 0] | Each final row generates **two** memory bus sends: - Row 1: send `(as=1, ptr=0, values=[1,2,3,4], ts=5)` and `(as=1, ptr=4, values=[5,6,7,8], ts=10)` - Row 3: send `(as=1, ptr=16, values=[9,0,0,0], ts=3)` and `(as=1, ptr=20, values=[0,0,0,0], ts=0)` The `ts=0` sends from initial rows balance against the `ts=0` sub-blocks of the final rows for untouched memory, eliminating the need for access adapters. ### Step 5 — Merkle tree records For the Merkle tree, each record uses a single timestamp = `max(timestamps[0], timestamps[1])`: | as | ptr | merkle_ts | values | |----|-----|-----------|---------------------------| | 1 | 0 | 10 | [1,2,3,4,5,6,7,8] | | 1 | 16 | 3 | [9,0,0,0,0,0,0,0] | These feed into the existing `update_with_touched_blocks` for Merkle root computation. --- ## Other Changes - **`merkle_tree/mod.rs`**: Added `MERKLE_TOUCHED_BLOCK_WIDTH = 3 + DIGEST_WIDTH` constant (distinct from `TIMESTAMPED_BLOCK_WIDTH = 3 + CONST_BLOCK_SIZE`) since the Merkle tree now consumes 8-value records directly. Also fixed a potential `ilog2(0)` panic in `calculate_unpadded_height`. - **New test** `test_empty_touched_memory_uses_full_chunk_values` validates that the empty-partition edge case correctly reads initial memory at `DIGEST_WIDTH` granularity and produces a matching Merkle root vs CPU. --------- Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com> Co-authored-by: Golovanov399 <Golovanov12345@gmail.com>
feat/access-adapter-removal
2 days ago
chore: update download setup
docs/update-tag-solidity-sdk
2 days ago
feat: partially write load/store
rv64-instruction-spec
2 days ago
feat: partially write load/store
rv64-instruction-spec
2 days ago

Active Branches

feat: remove memory access adapters for address space `!= 4`
last run
2 days ago
#2382
CodSpeed Performance Gauge
-83%
last run
2 days ago
#2391
CodSpeed Performance Gauge
-3%
last run
1 month ago
#2304
CodSpeed Performance Gauge
+1%
© 2026 CodSpeed Technology
Home Terms Privacy Docs