Creating a performance gate in a CI environment, preventing significant performance regressions from being deployed has been a long-standing goal of dozens of software teams. But measuring in hosted CI runners is a particularly challenging task, mostly because of noisy neighbors leaking through virtualization layers.
Still, it's worth the effort. Performance regressions are harder to catch and more expensive to fix the longer they go unnoticed. Mostly because:
For many teams, the easiest way to get started is by running benchmarks in their existing CI environment.
Let's measure this noise by using various benchmarking suites from popular performance-focused open-source projects: next.js/turbopack by Vercel, ruff and uv by Astral, and reflex by Reflex.
To measure consistency, we'll use the coefficient of variation which is the standard deviation divided by the mean. This metric is useful since it helps expressing and comparing the relative dispersion of the results.
Each run is executed on a different machine, simulating real CI conditions. Within each run, each result is the outcome of multiple executions of the same benchmark, which is done by benchmarking framework in use.
This approach ensures that:
Here are the results on GitHub Actions hosted runners after 100 runs for each benchmark suite:
Overall, the coefficient of variation is 2.66% on GitHub-hosted runners. Well... it might seem okay but let's see what's the consequence of this number for incorrectly detected performance changes (both regressions and improvements).
Let's say we want to create a 2% regression performance gate to catch small regressions in each benchmark. Considering a normal distribution of the results, we can estimate the chance to observe a false positive.
Starting with , the coefficient of variation, and , the mean, we can model the false positive rate for different performance gate thresholds.
Model one run as a random variable: .
With , the standard deviation.
Define the performance gate threshold.
We observe a false positive (flags a regression or improvement) when the run time drifts more than from the mean:
Standardise to a Z-score. Divide both sides by to turn it into a standard normal problem:
Let .
The event becomes .
Compute the probability:
Because the standard normal is symmetric:
where is the standard normal CDF.
A coefficient of variation of 2.66% gives us 45% chance of having a false positive with this 2% performance gate. About 1 out of 2 runs would be a false positive. In an active development pipeline, this level of noise makes the results completely unreliable, turning performance checks into distractions rather than signals, and ultimately eroding trust in performance testing.
To determine a more consistent performance gate threshold for the GitHub runners, we can plot the false positive rate for different values of the threshold:
Here, to guarantee a 1% false positive we'd have to use a 7% performance gate. That's not convincing since any smaller performance changes will go unnoticed. Moreover, those regressions can compound pretty fast since every time a regression will be merged, the baseline will be higher, allowing further regressions to go unnoticed.
We've been working on isolating CI runners to eliminate the kind of noise that makes performance testing unreliable, while keeping the costs reasonable. This work spans both infrastructure and OS-level optimizations. Our Macro Runners run on bare-metal instances in the cloud, with additional stability configurations specifically designed for high-precision benchmarking.
We'll cover the technical details in an upcoming blog post (linked here once published), but in short: you can run your existing tests on this infrastructure with no changes to your benchmarks, and minimal changes in your CI workflow.
We reran the same benchmark suites on these runners, again, 100 times each. The improvement in variance was immediate:
Overall, we measure a 0.56% average variance, 5 times less than on GitHub-hosted runners.
Now, with this 2% performance gate, we get a 0.04% chance (1 in 2500 runs) of having a false alarms instead of 45% on GitHub-hosted runners before.
Consequently, this shifts the false positive curve dramatically:
Now, we can reach a sub 1% false positive rate with a 1.5% performance gate, catching finer-grained regressions without overwhelming contributors with false alarms!
The configuration changes required to run on CodSpeed Macro Runners are minimal
on a GitHub Actions workflow, boiling down to changing the runs-on
field:
jobs:
benchmarks:
runs-on: ubuntu-latest
runs-on: codspeed-macro
steps:
- uses: actions/checkout@v4
# ...
- run: <Benchmark command>
- uses: CodSpeedHQ/action@v3
with:
run: <Benchmark command>
Wrapping the benchmark command with the CodSpeed upload action allows collection and upload the performance results so you can configure performance gates and see the detailed results in your CodSpeed dashboard:
Example result from uv
(check it out
here)
If you've tried building continuous benchmarking in CI before and felt like the noise made it impossible, you're not alone and most CI environments just weren't built for this kind of precision.
CodSpeed Macro Runners give you a clean foundation to actually rely on your performance data. Less noise means fewer false alarms, and fewer false alarms means you can finally trust small regressions won't slip through unnoticed. And just better performance overall means happier users, lower costs, better developer feedback loops, and less time spent debugging post-deploy slowdowns.
We'll soon publish a deep dive to share how we built this infrastructure, from choosing the hardware to the stability tweaks we had to layer on top of the OS, and why these details matter so much when you're chasing < 1% regressions.
In the meantime, feel free to explore live examples we presented here, check out the docs, or reach out if you want help setting things up.
Are Benchmarks From Cloud CI Services Reliable? by the author of Criterion.rs
All the benchmarking suites are taken in projects using CodSpeed to run the benchmarks already:
GitHub-hosted runners | CodSpeed Macro runners | |
---|---|---|
Image | Ubuntu 24.04 GitHub Image | Ubuntu 24.04 with additional stability fine-tuning |
CPU Architecture | x86-64 | AArch64 |
CPU | AMD EPYC 7763 64-Core Processor | AWS Graviton Cortex-A72 |
Allocated | 4 vCPUs | All 16 CPUs (bare-metal) |
RAM | 16GB | 32GB |
Cost | $0.008 / min | $0.032 / min |