Choosing the Correct Python Benchmarking Strategy

Performance isn’t just about making your code work—it’s about making it work well. When you’re building Python applications, understanding exactly how fast your code runs becomes the difference between software that feels responsive and software that makes users reach for the coffee while they wait. Benchmarking Python code isn’t rocket science, but it does require the right tools and techniques. Let’s explore four powerful approaches that will transform you from someone who hopes their code is fast to someone who knows exactly how fast it is!

If you have a Python performance question, try asking it to p99.chat, the assistant for code performance optimization. It can run, measure, and optimize any given code!

Starting with the`time` command

time is only available on UNIX-based systems, so if you’re working with Windows, you can skip this first step.

Sometimes the simplest tools are the most revealing. The Unix time command gives you a bird’s-eye view of your script’s performance, measuring everything from CPU usage to memory consumption. Let’s start with a practical example. Create a script that demonstrates different algorithmic approaches:

bench.py

import sys
import random

def bubble_sort(arr):
    """Inefficient but educational sorting algorithm"""
    n = len(arr)
    for i in range(n):
        for j in range(0, n - i - 1):
            if arr[j] > arr[j + 1]:
                arr[j], arr[j + 1] = arr[j + 1], arr[j]
    return arr

def quick_sort(arr):
    """More efficient divide-and-conquer approach"""
    if len(arr) <= 1:
        return arr
    pivot = arr[len(arr) // 2]
    left = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    return quick_sort(left) + middle + quick_sort(right)

if __name__ == " __main__":
    # Generate test data
    size = int(sys.argv[1]) if len(sys.argv) > 1 else 1000
    data = [random.randint(1, 1000) for _ in range(size)]
    # Pick the algorithm from command line arguments
    algorithm = sys.argv[2] if len(sys.argv) > 2 else "bubble"
    if algorithm == "bubble":
        result = bubble_sort(data.copy())
    else:
        result = quick_sort(data.copy())
    print(f"Sorted {len(result)} elements using {algorithm} sort")

Now let’s see the power of the time command in action:

$ time python3 bench.py 5000 bubble
Sorted 5000 elements using bubble sort
python3 bench.py 5000 bubble  0.89s user 0.01s system 99% cpu 0.910 total

$ time python3 bench.py 5000 quick
Sorted 5000 elements using quick sort
python3 bench.py 5000 quick  0.03s user 0.01s system 75% cpu 0.049 total

Look at that dramatic difference! Bubble sort consumed 0.89 seconds of CPU time while quicksort finished in just 0.03 seconds—nearly 30x faster. The 99% CPU utilization for bubble sort shows it’s working hard but inefficiently, while quicksort’s lower CPU percentage reflects its brief execution time. Different systems format the output slightly differently. On Linux systems for example, you might see:

$ time python bench.py 5000 bubble
Sorted 5000 elements using bubble sort

real	0m0.825s
user	0m0.806s
sys	0m0.022s

$ time python bench.py 5000 quick
Sorted 5000 elements using quick sort

real	0m0.091s
user	0m0.054s
sys	0m0.040s

The time command reveals three crucial metrics:

Real time (real or total): Wall-clock time from start to finish
User time (user): CPU time spent in user mode (your Python code executing—loops, calculations, memory operations)
System time (sys or system): CPU time spent in kernel mode (system calls, file I/O, memory allocation from the OS)

This approach is perfect when you want to understand your script’s overall resource consumption, including startup overhead and system interactions.

Precision Benchmarking with `hyperfine`

While time gives you the basics, hyperfine transforms benchmarking into a science. It runs multiple iterations, provides statistical analysis, and even generates beautiful comparison charts. After having installed hyperfine, you can get started pretty quicly:

$ hyperfine python bench.py 5000 quick
Benchmark 1: python
  Time (mean ± σ):      19.9 ms ±   2.4 ms    [User: 14.0 ms, System: 4.0 ms]
  Range (min … max):    17.8 ms …  36.5 ms    74 runs

Instead of running only once, hyperfine automatically ran your code 74 times and calculated meaningful statistics. That ± 2.4 ms standard deviation tells you how consistent your performance is, a crucial information that a single time run can’t provide. You can also compare commands with hyperfine:

$ hyperfine "python bench.py 5000 quick" "python bench.py 5000 bubble"
Benchmark 1: python bench.py 5000 quick
  Time (mean ± σ):      28.1 ms ±   2.0 ms    [User: 21.6 ms, System: 4.5 ms]
  Range (min … max):    26.5 ms …  40.4 ms    68 runs

Benchmark 2: python bench.py 5000 bubble
  Time (mean ± σ):     917.0 ms ±  24.6 ms    [User: 895.4 ms, System: 8.3 ms]
  Range (min … max):   899.3 ms … 969.3 ms    10 runs

Summary
  python bench.py 5000 quick ran
   32.59 ± 2.52 times faster than python bench.py 5000 bubble

Now we’re talking! Hyperfine not only confirms our 30x performance difference but quantifies the uncertainty in that measurement. The ”± 2.52” tells us the speedup could range from about 30x to 35x.

Function-Level Precision with `timeit`

When you need to focus on specific functions rather than entire scripts, Python’s built-in timeit module becomes your microscope. It’s designed to minimize timing overhead and provide accurate measurements of small code snippets. Here is an example measuring the functions we previously created:

time.py

import timeit
from bench import bubble_sort, quick_sort

# Generate test data
data = [random.randint(1, 1000) for _ in range(5000)]

bubble_time = timeit.timeit(
    lambda: bubble_sort(data.copy()),
    number=10
)

quick_time = timeit.timeit(
    lambda: quick_sort(data.copy()),
    number=10
)

$ python time.py
Bubble sort: 0.6165 seconds
Quick sort: 0.0100 seconds
Speedup: 31.78x

The conclusion remains the same—quicksort dramatically outperforms bubble sort—but notice something interesting about these numbers. Both measurements are significantly smaller than our earlier script-level benchmarks. We’ve eliminated the noise of Python interpreter startup, module imports, and command-line argument parsing. Now we’re measuring pure algorithmic performance, which gives us a clearer picture of what’s happening inside our functions.

Notice how we use lambda functions to wrap our calls—this approach is cleaner than string-based timing and provides better IDE support. The data.copy() call ensures each iteration works with fresh data, preventing any side effects from skewing our results.

The beauty of timeit lies in its surgical precision. While our previous tools measured entire script execution, timeit isolates the exact performance characteristics of individual functions. This granular approach becomes invaluable when you’re optimizing specific bottlenecks rather than entire applications.

Create Benchmarks from Existing Test Suites

First, let’s create proper tests for our sorting functions. The first step is to install the testing library: pytest:

$ uv add --dev pytest

We recommend using uv to create a project, you can simply run uv init and it will turn your directory into a Python project.

Then we can create some tests:

test_sort.py

import random
from bench import bubble_sort, quick_sort

data = [random.randint(1, 1000) for _ in range(5000)]

def test_bubble_sort_performance():
    """Benchmark bubble sort with 5000 elements"""
    result = bubble_sort(data)
    assert result == sorted(data)


def test_quick_sort_performance():
    """Benchmark quick sort with 5000 elements"""
    result = quick_sort(data)
    assert result == sorted(data)

Now let’s run those tests:

$ uv run pytest test_sort.py
============================== test session starts ===============================
platform darwin -- Python 3.13.0, pytest-8.4.0, pluggy-1.6.0
rootdir: /private/tmp/bench
configfile: pyproject.toml
collected 2 items

test_sort.py ..                                                            [100%]

=============================== 2 passed in 0.78s ================================

Great! Your tests validate that both sorting algorithms produce correct results.

Turning the test cases into benchmarks

Now comes the real magic. Install pytest-codspeed and transform these correctness tests into performance benchmarks with minimal changes:

$ uv add --dev pytest-codspeed

Update your tests, adding the benchmark fixture as a parameter and using it to wrap the execution of the sort algorithm:

test_sort.py

import random
from bench import bubble_sort, quick_sort

data = [random.randint(1, 1000) for _ in range(5000)]

def test_bubble_sort_performance(benchmark):
    """Benchmark bubble sort with 5000 elements"""
    result = benchmark(lambda: bubble_sort(data))
    assert result == sorted(data)


def test_quick_sort_performance(benchmark):
    """Benchmark quick sort with 5000 elements"""
    result = benchmark(lambda: quick_sort(data))
    assert result == sorted(data)

Finally, let’s burn the CPU for a bit:

$ uv run pytest --codspeed test_sort.py
============================== test session starts ===============================
platform darwin -- Python 3.13.0, pytest-8.4.0, pluggy-1.6.0
codspeed: 3.2.0 (enabled, mode: walltime, timer_resolution: 41.7ns)
rootdir: /private/tmp/bench
configfile: pyproject.toml
plugins: codspeed-3.2.0
collected 2 items

test_sort.py ..                                                            [100%]

                                Benchmark Results
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┓
┃                    Benchmark ┃   Time (best) ┃ Rel. StdDev ┃ Run time ┃ Iters ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━┫
┃ test_bubble_sort_performance ┃ 448,626,916ns ┃        3.1% ┃    2.74s ┃     6 ┃
┃  test_quick_sort_performance ┃     194,546ns ┃        9.7% ┃    3.04s ┃ 1,005 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━┻━━━━━━━━━━┻━━━━━━━┛

================================= 2 benchmarked ==================================
=============================== 2 passed in 8.94s ================================

Now we’re seeing the true power of statistical benchmarking! Look at those numbers—bubble sort clocked in at 448 milliseconds while quicksort blazed through in just 194 microseconds. That’s a staggering 2,300x performance difference. Notice how pytest-codspeed automatically determined the optimal number of iterations: 6 runs for the slow bubble sort versus 1,005 runs for the lightning-fast quicksort. This intelligent adaptation ensures statistical significance regardless of your algorithm’s performance characteristics. Learn more about the plugin in the pytest-codspeed reference.

The Foundation for Continuous Performance Testing

What makes this approach transformative isn’t just the numbers—it’s how easily it integrates into your existing workflow. You’ve just created the foundation for a performance monitoring system that can run locally during development and automatically in CI/CD pipelines. This is the first step toward performance-conscious development. While you can now validate performance locally, the real power emerges when you integrate these benchmarks into your continuous integration pipeline. Every pull request becomes a performance checkpoint, every deployment includes performance validation, and performance regressions are caught before they reach production. The CodSpeed ecosystem makes this transition seamless—from local development to continuous testing in just a few configuration steps. Check out this guide:

Choosing Your Benchmarking Strategy

Each tool serves a specific purpose in your performance toolkit:

Use the time command when you need a quick sanity check of overall script performance or want to understand system resource usage. It’s perfect for comparing different implementations at the application level.
Choose hyperfine when you need statistical rigor for command-line tools or want to track performance across different input parameters. Its warmup runs and statistical analysis make it ideal for detecting small performance changes.
Reach for timeit when you’re optimizing specific functions or comparing different algorithmic approaches. Its focus on eliminating timing overhead makes it perfect for micro-benchmarks.
Implement pytest-codspeed when performance becomes a first-class concern in your development process. It transforms performance testing from an afterthought into an integral part of your test suite.

How to Benchmark my Code?

Specialized Guides

Choosing the Correct Python Benchmarking Strategy

Starting with the`time` command

Precision Benchmarking with `hyperfine`

Function-Level Precision with `timeit`

Create Benchmarks from Existing Test Suites

Turning the test cases into benchmarks

The Foundation for Continuous Performance Testing

Choosing Your Benchmarking Strategy

Suggested Reading

pytest-codspeed documentation

Running Python Benchmarks in your CI

Resources

How to Benchmark my Code?

Specialized Guides

​Starting with thetime command

​Precision Benchmarking with hyperfine

​Function-Level Precision with timeit

​Create Benchmarks from Existing Test Suites

​Turning the test cases into benchmarks

​The Foundation for Continuous Performance Testing

​Choosing Your Benchmarking Strategy

​Suggested Reading

pytest-codspeed documentation

Running Python Benchmarks in your CI

​Resources

Starting with the`time` command

Precision Benchmarking with `hyperfine`

Function-Level Precision with `timeit`

Create Benchmarks from Existing Test Suites

Turning the test cases into benchmarks

The Foundation for Continuous Performance Testing

Choosing Your Benchmarking Strategy

Suggested Reading

Resources