Benchmarks in CI: Escaping the Cloud Chaos

Posted on July 30th, 2025 by

Arthur Pastel

Creating a performance gate in a CI environment, preventing significant performance regressions from being deployed has been a long-standing goal of dozens of software teams. But measuring in hosted CI runners is a particularly challenging task, mostly because of noisy neighbors leaking through virtualization layers.

Still, it's worth the effort. Performance regressions are harder to catch and more expensive to fix the longer they go unnoticed. Mostly because:

Catching issues in production is too late: Once it's deployed, the damage (slowdown, cost, UX impact) is already done.
Feedback stays close to the code: Catching regressions during development reduces costly context switching and rework.
Performance regressions can hide functional bugs: A sudden slowdown might point to deeper issues, like accidental $\mathcal{O}(n^2)$ logic or misuse of an API, which won't necessarily show up in unit tests.

For many teams, the easiest way to get started is by running benchmarks in their existing CI environment.

Attempt on GitHub-hosted runners

Let's measure this noise by using various benchmarking suites from popular performance-focused open-source projects: next.js/turbopack by Vercel, ruff and uv by Astral, and reflex by Reflex.

To measure consistency, we'll use the coefficient of variation which is the standard deviation divided by the mean. This metric is useful since it helps expressing and comparing the relative dispersion of the results.

Each run is executed on a different machine, simulating real CI conditions. Within each run, each result is the outcome of multiple executions of the same benchmark, which is done by benchmarking framework in use.

This approach ensures that:

We stay close to real-world CI variability (one machine per run),
But still benefit from statistical robustness within each run (via repeated local executions).

Here are the results on GitHub Actions hosted runners after 100 runs for each benchmark suite:

Overall, the coefficient of variation is 2.66% on GitHub-hosted runners. Well... it might seem okay but let's see what's the consequence of this number for incorrectly detected performance changes (both regressions and improvements).

Let's say we want to create a 2% regression performance gate to catch small regressions in each benchmark. Considering a normal distribution of the results, we can estimate the chance to observe a false positive.

Modeling the false positive probability by performance gate threshold

Starting with $CV$ , the coefficient of variation, and $\mu$ , the mean, we can model the false positive rate for different performance gate thresholds.

Model one run as a random variable: $X \sim \mathcal N(\mu,\,\sigma^2)$ .

With $\sigma = \text{CV} \times \mu$ , the standard deviation.
Define the performance gate threshold.

We observe a false positive (flags a regression or improvement) when the run time drifts more than $\pm \Delta\%$ from the mean:
$\lvert X - \mu\rvert > \Delta \,\mu .$
Standardise to a Z-score. Divide both sides by $\sigma$ to turn it into a standard normal problem:
$\Bigl\lvert\tfrac{X-\mu}{\sigma}\Bigr\rvert \;>\; \frac{\Delta \,\mu}{\sigma} \;=\; \frac{\Delta}{\text{CV}} .$
Let $Z = (X-\mu)/\sigma \sim \mathcal N(0,1)$ .

The event becomes $|Z| > \Delta/\text{CV}$ .
Compute the probability:
$P(\text{false alert}) = P\!\bigl(|Z| > \tfrac{\Delta}{\text{CV}}\bigr)$
Because the standard normal is symmetric:
$P(\text{false alert}) = 2\bigl[\,1-\Phi\!\bigl(\tfrac{\Delta}{\text{CV}}\bigr)\bigr]$
where $\Phi$ is the standard normal CDF.

A coefficient of variation of 2.66% gives us 45% chance of having a false positive with this 2% performance gate. About 1 out of 2 runs would be a false positive. In an active development pipeline, this level of noise makes the results completely unreliable, turning performance checks into distractions rather than signals, and ultimately eroding trust in performance testing.

To determine a more consistent performance gate threshold for the GitHub runners, we can plot the false positive rate for different values of the threshold:

Here, to guarantee a 1% false positive we'd have to use a 7% performance gate. That's not convincing since any smaller performance changes will go unnoticed. Moreover, those regressions can compound pretty fast since every time a regression will be merged, the baseline will be higher, allowing further regressions to go unnoticed.

Let's try something else

We've been working on isolating CI runners to eliminate the kind of noise that makes performance testing unreliable, while keeping the costs reasonable. This work spans both infrastructure and OS-level optimizations. Our Macro Runners run on bare-metal instances in the cloud, with additional stability configurations specifically designed for high-precision benchmarking.

We'll cover the technical details in an upcoming blog post (linked here once published), but in short: you can run your existing tests on this infrastructure with no changes to your benchmarks, and minimal changes in your CI workflow.

We reran the same benchmark suites on these runners, again, 100 times each. The improvement in variance was immediate:

Overall, we measure a 0.56% average variance, 5 times less than on GitHub-hosted runners.

Now, with this 2% performance gate, we get a 0.04% chance (1 in 2500 runs) of having a false alarms instead of 45% on GitHub-hosted runners before.

Consequently, this shifts the false positive curve dramatically:

Now, we can reach a sub 1% false positive rate with a 1.5% performance gate, catching finer-grained regressions without overwhelming contributors with false alarms!

Integration

The configuration changes required to run on CodSpeed Macro Runners are minimal on a GitHub Actions workflow, boiling down to changing the runs-on field:

jobs:
  benchmarks:
    runs-on: ubuntu-latest
    runs-on: codspeed-macro
    steps:
      - uses: actions/checkout@v4
      # ...
      - run: <Benchmark command>
      - uses: CodSpeedHQ/action@v3
        with:
          run: <Benchmark command>

Wrapping the benchmark command with the CodSpeed upload action allows collection and upload the performance results so you can configure performance gates and see the detailed results in your CodSpeed dashboard:

Example of a CodSpeed dashboard with macro runner
results — Example result from `uv` (check it out here)

Toward Reliable Performance Checks

If you've tried building continuous benchmarking in CI before and felt like the noise made it impossible, you're not alone and most CI environments just weren't built for this kind of precision.

CodSpeed Macro Runners give you a clean foundation to actually rely on your performance data. Less noise means fewer false alarms, and fewer false alarms means you can finally trust small regressions won't slip through unnoticed. And just better performance overall means happier users, lower costs, better developer feedback loops, and less time spent debugging post-deploy slowdowns.

We'll soon publish a deep dive to share how we built this infrastructure, from choosing the hardware to the stability tweaks we had to layer on top of the OS, and why these details matter so much when you're chasing < 1% regressions.

In the meantime, feel free to explore live examples we presented here, check out the docs, or reach out if you want help setting things up.

Resources

Detailed variance results

Macro runners and WallTime instrument documentation
Are Benchmarks From Cloud CI Services Reliable? by the author of Criterion.rs
All the benchmarking suites are taken in projects using CodSpeed to run the benchmarks already:

Appendix: Runner specs

	GitHub-hosted runners	CodSpeed Macro runners
Image	Ubuntu 24.04 GitHub Image	Ubuntu 24.04 with additional stability fine-tuning
CPU Architecture	x86-64	AArch64
CPU	AMD EPYC 7763 64-Core Processor	AWS Graviton Cortex-A72
Allocated	4 vCPUs	All 16 CPUs (bare-metal)
RAM	16GB	32GB
Cost	$0.008 / min	$0.032 / min

It's Not a Regression
If It's Not in Production

Book a Demo Get Started For Free

Resources Home Pricing Docs Blog GitHub Changelog Advent 🎄

Getting Started Sample repository Explore repositories Support

AboutCareers Contact Us Terms of Service Privacy Policy

{531882}Analyzed Commits

Explore Repos

Backed by