Latest Results
feat: add gpu tracegen for boundary chip (#2378)
Resolves INT-5725.
## Background: PersistentBoundaryAir Update (not in this PR)
The `feat/access-adapter-removal` branch removes memory access adapters
from the persistent memory path. Previously, separate `AccessAdapterAir`
circuits handled the conversion between `CONST_BLOCK_SIZE=4` (the
granularity of memory bus interactions) and `CHUNK=8` (the granularity
of Merkle tree hashing/Poseidon2 digests).
The updated `PersistentBoundaryAir` eliminates this by operating on
8-byte chunks directly while tracking **per-sub-block timestamps**:
```
Old PersistentBoundaryCols:
expand_direction | address_space | leaf_label | values[8] | hash[8] | timestamp
^^^^^^^^^
single timestamp
New PersistentBoundaryCols:
expand_direction | address_space | leaf_label | values[8] | hash[8] | timestamps[2]
^^^^^^^^^^^^^
one per 4-byte block
```
Each 8-byte chunk contains `BLOCKS_PER_CHUNK = CHUNK / CONST_BLOCK_SIZE
= 2` sub-blocks. The boundary AIR emits **two** memory bus interactions
per row (one per 4-byte sub-block), each with its own timestamp.
Untouched sub-blocks within a touched chunk keep `timestamp=0`, which
naturally balances against the initial-state row (also at `t=0`).
---
## This PR: GPU Trace Generation for the Updated Boundary Chip
This PR adapts the GPU trace generation pipeline to the new
per-sub-block-timestamp design. The core challenge: the CPU-side
"touched memory" partition arrives as sorted **4-byte records**, but the
boundary chip and Merkle tree need **8-byte records** with per-block
timestamps and initial-memory fill for untouched sub-blocks.
### New CUDA Kernel: `inventory.cu` — Merge Records on GPU
A new `merge_records` kernel converts `InRec = MemoryInventoryRecord<4,
1>` into `OutRec = MemoryInventoryRecord<8, 2>` in two phases:
1. **`cukernel_build_candidates`** — Each thread inspects one input
record. If it starts a new 8-byte output chunk (different `ptr/8` than
the previous record), it:
- Reads the full 8-byte initial memory from device
- Overwrites the touched 4-byte sub-block with final values + timestamp
- If the next input record belongs to the same output chunk, also
patches the other sub-block
- Sets `flag[i] = 1`; otherwise `flag[i] = 0` (duplicate within same
chunk)
2. **`cukernel_scatter_compact`** — CUB `ExclusiveSum` on flags produces
output positions; flagged records are scattered into a compact output
array.
### Updated `boundary.cu` — Per-Block Timestamps
The `BoundaryRecord` struct is parameterized on `BLOCKS`:
```c++
template <size_t CHUNK, size_t BLOCKS> struct BoundaryRecord {
uint32_t address_space;
uint32_t ptr;
uint32_t timestamps[BLOCKS]; // was: uint32_t timestamp;
uint32_t values[CHUNK];
};
```
The persistent trace gen kernel writes `timestamps=[0,0]` for initial
rows and the actual per-block timestamps for final rows.
### Updated `memory.rs` — Host-Side Orchestration
The Rust side:
1. Converts `TimestampedEquipartition<F, CONST_BLOCK_SIZE>` into
GPU-compatible `MemoryInventoryRecord<4,1>` structs
2. Uploads to device and calls the merge kernel
3. Sends merged records to the boundary chip
(`finalize_records_persistent_device`)
4. Converts merged records into Merkle records with `timestamp =
max(timestamps[0], timestamps[1])` for the tree update
---
## Walkthrough: Sample Trace
Suppose the VM touches 3 memory cells at `CONST_BLOCK_SIZE=4`
granularity:
| addr_space | ptr | timestamp | values |
|------------|-----|-----------|-----------------|
| 1 | 0 | 5 | [1, 2, 3, 4] |
| 1 | 4 | 10 | [5, 6, 7, 8] |
| 1 | 16 | 3 | [9, 0, 0, 0] |
Initial memory is all zeros.
### Step 1 — Convert to `InRec<4, 1>`
```
InRec[0]: { as=1, ptr=0, timestamps=[5], values=[1,2,3,4] }
InRec[1]: { as=1, ptr=4, timestamps=[10], values=[5,6,7,8] }
InRec[2]: { as=1, ptr=16, timestamps=[3], values=[9,0,0,0] }
```
### Step 2 — GPU `cukernel_build_candidates`
Each thread computes `output_chunk = ptr / 8`:
| idx | ptr | output_chunk | same as prev? | flag |
|-----|-----|-------------|---------------|------|
| 0 | 0 | 0 | N/A (first) | 1 |
| 1 | 4 | 0 | yes | 0 |
| 2 | 16 | 2 | no | 1 |
**Thread 0** (flag=1): Builds an `OutRec` for chunk `ptr=0`:
- Reads initial memory `[0,0,0,0,0,0,0,0]` from device
- `block_idx = (0 % 8) / 4 = 0` → patches `values[0..4] = [1,2,3,4]`,
`timestamps[0] = 5`
- Next record (idx=1) is same chunk: `block_idx2 = (4 % 8) / 4 = 1` →
patches `values[4..8] = [5,6,7,8]`, `timestamps[1] = 10`
- Result: `{ as=1, ptr=0, timestamps=[5,10], values=[1,2,3,4,5,6,7,8] }`
**Thread 1** (flag=0): Skipped (same output chunk as thread 0).
**Thread 2** (flag=1): Builds an `OutRec` for chunk `ptr=16`:
- Reads initial memory `[0,0,0,0,0,0,0,0]` from device
- `block_idx = (16 % 8) / 4 = 0` → patches `values[0..4] = [9,0,0,0]`,
`timestamps[0] = 3`
- No next record → `timestamps[1] = 0`, `values[4..8] = [0,0,0,0]` (from
initial memory)
- Result: `{ as=1, ptr=16, timestamps=[3,0], values=[9,0,0,0,0,0,0,0] }`
### Step 3 — Prefix sum + scatter compact
```
flags = [1, 0, 1]
positions = [0, 1, 1] (exclusive prefix sum)
out[0] = thread 0's record
out[1] = thread 2's record
out_num_records = 2
```
### Step 4 — Boundary chip trace (2 rows per record = 4 active rows)
| Row | expand_dir | as | leaf_label | values | hash | timestamps |
|-----|------------|----|------------|---------------------------|---------------|------------|
| 0 | +1 (init) | 1 | 0 | [0,0,0,0,0,0,0,0] | H([0,..0]) | [0, 0] |
| 1 | -1 (final) | 1 | 0 | [1,2,3,4,5,6,7,8] | H([1,..,8]) | [5, 10] |
| 2 | +1 (init) | 1 | 2 | [0,0,0,0,0,0,0,0] | H([0,..0]) | [0, 0] |
| 3 | -1 (final) | 1 | 2 | [9,0,0,0,0,0,0,0] | H([9,0,..0]) | [3, 0] |
Each final row generates **two** memory bus sends:
- Row 1: send `(as=1, ptr=0, values=[1,2,3,4], ts=5)` and `(as=1, ptr=4,
values=[5,6,7,8], ts=10)`
- Row 3: send `(as=1, ptr=16, values=[9,0,0,0], ts=3)` and `(as=1,
ptr=20, values=[0,0,0,0], ts=0)`
The `ts=0` sends from initial rows balance against the `ts=0` sub-blocks
of the final rows for untouched memory, eliminating the need for access
adapters.
### Step 5 — Merkle tree records
For the Merkle tree, each record uses a single timestamp =
`max(timestamps[0], timestamps[1])`:
| as | ptr | merkle_ts | values |
|----|-----|-----------|---------------------------|
| 1 | 0 | 10 | [1,2,3,4,5,6,7,8] |
| 1 | 16 | 3 | [9,0,0,0,0,0,0,0] |
These feed into the existing `update_with_touched_blocks` for Merkle
root computation.
---
## Other Changes
- **`merkle_tree/mod.rs`**: Added `MERKLE_TOUCHED_BLOCK_WIDTH = 3 +
DIGEST_WIDTH` constant (distinct from `TIMESTAMPED_BLOCK_WIDTH = 3 +
CONST_BLOCK_SIZE`) since the Merkle tree now consumes 8-value records
directly. Also fixed a potential `ilog2(0)` panic in
`calculate_unpadded_height`.
- **New test** `test_empty_touched_memory_uses_full_chunk_values`
validates that the empty-partition edge case correctly reads initial
memory at `DIGEST_WIDTH` granularity and produces a matching Merkle root
vs CPU.
---------
Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com>
Co-authored-by: Golovanov399 <Golovanov12345@gmail.com>feat/access-adapter-removal feat: add gpu tracegen for boundary chip (#2378)
Resolves INT-5725.
## Background: PersistentBoundaryAir Update (not in this PR)
The `feat/access-adapter-removal` branch removes memory access adapters
from the persistent memory path. Previously, separate `AccessAdapterAir`
circuits handled the conversion between `CONST_BLOCK_SIZE=4` (the
granularity of memory bus interactions) and `CHUNK=8` (the granularity
of Merkle tree hashing/Poseidon2 digests).
The updated `PersistentBoundaryAir` eliminates this by operating on
8-byte chunks directly while tracking **per-sub-block timestamps**:
```
Old PersistentBoundaryCols:
expand_direction | address_space | leaf_label | values[8] | hash[8] | timestamp
^^^^^^^^^
single timestamp
New PersistentBoundaryCols:
expand_direction | address_space | leaf_label | values[8] | hash[8] | timestamps[2]
^^^^^^^^^^^^^
one per 4-byte block
```
Each 8-byte chunk contains `BLOCKS_PER_CHUNK = CHUNK / CONST_BLOCK_SIZE
= 2` sub-blocks. The boundary AIR emits **two** memory bus interactions
per row (one per 4-byte sub-block), each with its own timestamp.
Untouched sub-blocks within a touched chunk keep `timestamp=0`, which
naturally balances against the initial-state row (also at `t=0`).
---
## This PR: GPU Trace Generation for the Updated Boundary Chip
This PR adapts the GPU trace generation pipeline to the new
per-sub-block-timestamp design. The core challenge: the CPU-side
"touched memory" partition arrives as sorted **4-byte records**, but the
boundary chip and Merkle tree need **8-byte records** with per-block
timestamps and initial-memory fill for untouched sub-blocks.
### New CUDA Kernel: `inventory.cu` — Merge Records on GPU
A new `merge_records` kernel converts `InRec = MemoryInventoryRecord<4,
1>` into `OutRec = MemoryInventoryRecord<8, 2>` in two phases:
1. **`cukernel_build_candidates`** — Each thread inspects one input
record. If it starts a new 8-byte output chunk (different `ptr/8` than
the previous record), it:
- Reads the full 8-byte initial memory from device
- Overwrites the touched 4-byte sub-block with final values + timestamp
- If the next input record belongs to the same output chunk, also
patches the other sub-block
- Sets `flag[i] = 1`; otherwise `flag[i] = 0` (duplicate within same
chunk)
2. **`cukernel_scatter_compact`** — CUB `ExclusiveSum` on flags produces
output positions; flagged records are scattered into a compact output
array.
### Updated `boundary.cu` — Per-Block Timestamps
The `BoundaryRecord` struct is parameterized on `BLOCKS`:
```c++
template <size_t CHUNK, size_t BLOCKS> struct BoundaryRecord {
uint32_t address_space;
uint32_t ptr;
uint32_t timestamps[BLOCKS]; // was: uint32_t timestamp;
uint32_t values[CHUNK];
};
```
The persistent trace gen kernel writes `timestamps=[0,0]` for initial
rows and the actual per-block timestamps for final rows.
### Updated `memory.rs` — Host-Side Orchestration
The Rust side:
1. Converts `TimestampedEquipartition<F, CONST_BLOCK_SIZE>` into
GPU-compatible `MemoryInventoryRecord<4,1>` structs
2. Uploads to device and calls the merge kernel
3. Sends merged records to the boundary chip
(`finalize_records_persistent_device`)
4. Converts merged records into Merkle records with `timestamp =
max(timestamps[0], timestamps[1])` for the tree update
---
## Walkthrough: Sample Trace
Suppose the VM touches 3 memory cells at `CONST_BLOCK_SIZE=4`
granularity:
| addr_space | ptr | timestamp | values |
|------------|-----|-----------|-----------------|
| 1 | 0 | 5 | [1, 2, 3, 4] |
| 1 | 4 | 10 | [5, 6, 7, 8] |
| 1 | 16 | 3 | [9, 0, 0, 0] |
Initial memory is all zeros.
### Step 1 — Convert to `InRec<4, 1>`
```
InRec[0]: { as=1, ptr=0, timestamps=[5], values=[1,2,3,4] }
InRec[1]: { as=1, ptr=4, timestamps=[10], values=[5,6,7,8] }
InRec[2]: { as=1, ptr=16, timestamps=[3], values=[9,0,0,0] }
```
### Step 2 — GPU `cukernel_build_candidates`
Each thread computes `output_chunk = ptr / 8`:
| idx | ptr | output_chunk | same as prev? | flag |
|-----|-----|-------------|---------------|------|
| 0 | 0 | 0 | N/A (first) | 1 |
| 1 | 4 | 0 | yes | 0 |
| 2 | 16 | 2 | no | 1 |
**Thread 0** (flag=1): Builds an `OutRec` for chunk `ptr=0`:
- Reads initial memory `[0,0,0,0,0,0,0,0]` from device
- `block_idx = (0 % 8) / 4 = 0` → patches `values[0..4] = [1,2,3,4]`,
`timestamps[0] = 5`
- Next record (idx=1) is same chunk: `block_idx2 = (4 % 8) / 4 = 1` →
patches `values[4..8] = [5,6,7,8]`, `timestamps[1] = 10`
- Result: `{ as=1, ptr=0, timestamps=[5,10], values=[1,2,3,4,5,6,7,8] }`
**Thread 1** (flag=0): Skipped (same output chunk as thread 0).
**Thread 2** (flag=1): Builds an `OutRec` for chunk `ptr=16`:
- Reads initial memory `[0,0,0,0,0,0,0,0]` from device
- `block_idx = (16 % 8) / 4 = 0` → patches `values[0..4] = [9,0,0,0]`,
`timestamps[0] = 3`
- No next record → `timestamps[1] = 0`, `values[4..8] = [0,0,0,0]` (from
initial memory)
- Result: `{ as=1, ptr=16, timestamps=[3,0], values=[9,0,0,0,0,0,0,0] }`
### Step 3 — Prefix sum + scatter compact
```
flags = [1, 0, 1]
positions = [0, 1, 1] (exclusive prefix sum)
out[0] = thread 0's record
out[1] = thread 2's record
out_num_records = 2
```
### Step 4 — Boundary chip trace (2 rows per record = 4 active rows)
| Row | expand_dir | as | leaf_label | values | hash | timestamps |
|-----|------------|----|------------|---------------------------|---------------|------------|
| 0 | +1 (init) | 1 | 0 | [0,0,0,0,0,0,0,0] | H([0,..0]) | [0, 0] |
| 1 | -1 (final) | 1 | 0 | [1,2,3,4,5,6,7,8] | H([1,..,8]) | [5, 10] |
| 2 | +1 (init) | 1 | 2 | [0,0,0,0,0,0,0,0] | H([0,..0]) | [0, 0] |
| 3 | -1 (final) | 1 | 2 | [9,0,0,0,0,0,0,0] | H([9,0,..0]) | [3, 0] |
Each final row generates **two** memory bus sends:
- Row 1: send `(as=1, ptr=0, values=[1,2,3,4], ts=5)` and `(as=1, ptr=4,
values=[5,6,7,8], ts=10)`
- Row 3: send `(as=1, ptr=16, values=[9,0,0,0], ts=3)` and `(as=1,
ptr=20, values=[0,0,0,0], ts=0)`
The `ts=0` sends from initial rows balance against the `ts=0` sub-blocks
of the final rows for untouched memory, eliminating the need for access
adapters.
### Step 5 — Merkle tree records
For the Merkle tree, each record uses a single timestamp =
`max(timestamps[0], timestamps[1])`:
| as | ptr | merkle_ts | values |
|----|-----|-----------|---------------------------|
| 1 | 0 | 10 | [1,2,3,4,5,6,7,8] |
| 1 | 16 | 3 | [9,0,0,0,0,0,0,0] |
These feed into the existing `update_with_touched_blocks` for Merkle
root computation.
---
## Other Changes
- **`merkle_tree/mod.rs`**: Added `MERKLE_TOUCHED_BLOCK_WIDTH = 3 +
DIGEST_WIDTH` constant (distinct from `TIMESTAMPED_BLOCK_WIDTH = 3 +
CONST_BLOCK_SIZE`) since the Merkle tree now consumes 8-value records
directly. Also fixed a potential `ilog2(0)` panic in
`calculate_unpadded_height`.
- **New test** `test_empty_touched_memory_uses_full_chunk_values`
validates that the empty-partition edge case correctly reads initial
memory at `DIGEST_WIDTH` granularity and produces a matching Merkle root
vs CPU.
---------
Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com>
Co-authored-by: Golovanov399 <Golovanov12345@gmail.com>feat/access-adapter-removal Active Branches
#2382-83%
#2391-3%
#2304+1%
© 2026 CodSpeed Technology