This article explains regressions in micro-benchmarks, caused by different cache behavior. These regressions are typically < 1μs, which is why they are often not noticeable in walltime measurements.
Toolchain Updates
It is generally recommended to pin your toolchain (e.g. compilers, dependencies, etc.) to avoid unintended changes, which can be bugs, malware but also performance regressions. Common issues:- Stable compiler toolchains: Using
dtolnay/rust-toolchain@stablein Github Actions is non-deterministic. When Rust stable updates (e.g. from 1.92 to 1.93), your toolchain and Rust compiler will update, which can change compiler optimizations. - CI runner/image updates:
runs-on: ubuntu-latestcan move to a new Ubuntu release, changing glibc, LLVM, and other system libraries. - Dependency updates: Changes in
Cargo.lock(new crate versions) or JS/Python lockfiles can alter inlining decisions and code layout. - Target architecture changes: Compiling for different CPU
microarchitectures (e.g.
x86-64-v2vsx86-64-v3vsx86-64-v4) enables different instruction sets (SSE, AVX, AVX-512).
latest or stable tags and use a specific
version instead. Commit your lockfiles to version control to ensure reproducible
builds.
Compiler Non-determinism
When can this happen? Any change to the code or compiler can trigger different compiler decisions. However, recompiling the entire codebase with the same source code, compiler version, and flags is usually deterministic. Compiling optimized code is hard, because it is a tradeoff between compilation resource usage (speed, memory, …), runtime execution speed and binary size, because you don’t want your simple code to take 1 hour to compile or take up 1GB of disk space. This tradeoff is balanced by using heuristics and thresholds, that cover most cases while being fast enough. An example for this is inlining: By inlining a function, the overhead of a call at runtime is removed and also allows the compiler to better optimize the function body. However, if every function is inlined the binary would be much bigger, which decreases the performance due to the increased instruction cache pressure. There are many other optimizations that can affect cache behavior:- Basic block reordering: Moving cold error paths into separate functions,
rearranging
ifbranches, … - Loop transformations: Loop unrolling, peeling, fusing, …
- Bounds checks: Compilers are often smart enough to eliminate bounds checks, but if they are not, they may hinder loop unrolling/vectorization.
Function Alignment
Do you think this function always has the same performance?foo
function, the CPU may need to fetch another cache line to get the instructions
after the jmp label instruction. This is counted as an instruction cache miss,
which may cost anywhere from 10-40 cycles (if found in L2/L3) to 100+ cycles (if
it goes all the way to RAM).
Because of that, compilers try to align functions and keep the hot paths close
to each other to minimize cache misses.
Allocators
Most allocators are designed to be fast, while keeping fragmentation and memory usage at a minimum. Just like compilers, they use heuristics to decide when to allocate more memory, which can lead to unpredictable performance. Here are some examples that can cause different performance:- Time-Based Memory Decay: Allocators like jemalloc implement “decay” logic, where unused “dirty” memory pages are returned to the OS after a specific duration (e.g., 10 seconds).
- Adaptive Thread-Cache Sizing: High-performance allocators (like tcmalloc) dynamically resize thread-local caches based on “demand history.”
- Memory Fragmentation Patterns: Allocation order determines fragmentation. If one benchmark allocates 1MB while another allocates 64B, then the allocator may have to allocate more memory to satisfy the larger request, leading to fragmentation.
