How to Benchmark Python Code?
Master essential benchmarking techniques, from basic timing to production-ready performance testing.
Performance isn’t just about making your code work—it’s about making it work well. When you’re building Python applications, understanding exactly how fast your code runs becomes the difference between software that feels responsive and software that makes users reach for the coffee while they wait.
Benchmarking Python code isn’t rocket science, but it does require the right tools and techniques. Let’s explore four powerful approaches that will transform you from someone who hopes their code is fast to someone who knows exactly how fast it is!
If you have a Python performance question, try asking it to p99.chat, the assistant for code performance optimization. It can run, measure, and optimize any given code!
Starting with thetime
command
time
is only available on UNIX-based systems, so if you’re working with Windows, you can skip this first step.
Sometimes the simplest tools are the most revealing. The Unix time
command gives you a bird’s-eye view of your script’s performance, measuring everything from CPU usage to memory consumption.
Let’s start with a practical example. Create a script that demonstrates different algorithmic approaches:
Now let’s see the power of the time
command in action:
Look at that dramatic difference! Bubble sort consumed 0.89 seconds of CPU time while quicksort finished in just 0.03 seconds—nearly 30x faster. The 99% CPU utilization for bubble sort shows it’s working hard but inefficiently, while quicksort’s lower CPU percentage reflects its brief execution time.
Different systems format the output slightly differently. On Linux systems for example, you might see:
The time
command reveals three crucial metrics:
- Real time (
real
ortotal
): Wall-clock time from start to finish - User time (
user
): CPU time spent in user mode (your Python code executing—loops, calculations, memory operations) - System time (
sys
orsystem
): CPU time spent in kernel mode (system calls, file I/O, memory allocation from the OS)
This approach is perfect when you want to understand your script’s overall resource consumption, including startup overhead and system interactions.
Precision Benchmarking with hyperfine
While time
gives you the basics, hyperfine transforms benchmarking into a science. It runs multiple iterations, provides statistical analysis, and even generates beautiful comparison charts.
After having installed hyperfine, you can get started pretty quicly:
Instead of running only once, hyperfine automatically ran your code 74 times and calculated meaningful statistics. That ± 2.4 ms standard deviation tells you how consistent your performance is, a crucial information that a single time
run can’t provide.
You can also compare commands with hyperfine:
Now we’re talking! Hyperfine not only confirms our 30x performance difference but quantifies the uncertainty in that measurement. The ”± 2.52” tells us the speedup could range from about 30x to 35x.
Function-Level Precision with timeit
When you need to focus on specific functions rather than entire scripts, Python’s built-in timeit
module becomes your microscope. It’s designed to minimize timing overhead and provide accurate measurements of small code snippets.
Here is an example measuring the functions we previously created:
The conclusion remains the same—quicksort dramatically outperforms bubble sort—but notice something interesting about these numbers. Both measurements are significantly smaller than our earlier script-level benchmarks. We’ve eliminated the noise of Python interpreter startup, module imports, and command-line argument parsing. Now we’re measuring pure algorithmic performance, which gives us a clearer picture of what’s happening inside our functions.
Notice how we use lambda
functions to wrap our calls—this approach is cleaner than string-based timing and provides better IDE support. The data.copy()
call ensures each iteration works with fresh data, preventing any side effects from skewing our results.
The beauty of timeit
lies in its surgical precision. While our previous tools measured entire script execution, timeit
isolates the exact performance characteristics of individual functions. This granular approach becomes invaluable when you’re optimizing specific bottlenecks rather than entire applications.
Create Benchmarks from Existing Test Suites
First, let’s create proper tests for our sorting functions. The first step is to install the testing library: pytest
:
We recommend using uv to create a project, you can simply run uv init
and it will turn your directory into a Python project.
Then we can create some tests:
Now let’s run those tests:
Great! Your tests validate that both sorting algorithms produce correct results.
Turning the test cases into benchmarks
Now comes the real magic. Install pytest-codspeed
and transform these correctness tests into performance benchmarks with minimal changes:
Update your tests, adding the benchmark
fixture as a parameter and using it to wrap the execution of the sort algorithm:
Finally, let’s burn the CPU for a bit:
Now we’re seeing the true power of statistical benchmarking! Look at those numbers—bubble sort clocked in at 448 milliseconds while quicksort blazed through in just 194 microseconds. That’s a staggering 2,300x performance difference.
Notice how pytest-codspeed
automatically determined the optimal number of iterations: 6 runs for the slow bubble sort versus 1,005 runs for the lightning-fast quicksort. This intelligent adaptation ensures statistical significance regardless of your algorithm’s performance characteristics.
The Foundation for Continuous Performance Testing
What makes this approach transformative isn’t just the numbers—it’s how easily it integrates into your existing workflow. You’ve just created the foundation for a performance monitoring system that can run locally during development and automatically in CI/CD pipelines.
This is the first step toward performance-conscious development. While you can now validate performance locally, the real power emerges when you integrate these benchmarks into your continuous integration pipeline. Every pull request becomes a performance checkpoint, every deployment includes performance validation, and performance regressions are caught before they reach production.
The CodSpeed ecosystem makes this transition seamless—from local development to continuous testing in just a few configuration steps. Check out this guide:
Running Python Benchmarks in your CI
A more advanced resource on continuous performance testing in Python
Choosing Your Benchmarking Strategy
Each tool serves a specific purpose in your performance toolkit:
- Use the
time
command when you need a quick sanity check of overall script performance or want to understand system resource usage. It’s perfect for comparing different implementations at the application level. - Choose
hyperfine
when you need statistical rigor for command-line tools or want to track performance across different input parameters. Its warmup runs and statistical analysis make it ideal for detecting small performance changes. - Reach for
timeit
when you’re optimizing specific functions or comparing different algorithmic approaches. Its focus on eliminating timing overhead makes it perfect for micro-benchmarks. - Implement
pytest-codspeed
when performance becomes a first-class concern in your development process. It transforms performance testing from an afterthought into an integral part of your test suite.