Back to blog

Why Rust doesn't need a standard div_rem: An LLVM tale

Why Rust doesn't need a standard div_rem: An LLVM tale
Posted on October 9th, 2023 by
Arthur Pastel avatar
Arthur Pastel

During the last advent of code, while starting with Rust, I discovered that the Rust standard library was missing the usual (a.k.a. ) function that would simultaneously return the quotient and remainder. I got confused since I already used the operator in C++ with std::div or Python with divmod. It seemed a basic way of improving performance since both values would come out with the same assembly instruction, nicely avoiding extra CPU cycles.

Looking it up, I found an existing RFC created in 2017, but it hasn't received much attention since.

So, do we need a implementation in the standard library? I decided to investigate.

A naive approach

Let's naively implement the function:

To dissect what is executed we're gonna analyze the compilation steps.

Rust Compilation Chain

The compilation chain of our Rust program

First, let's have a look at the LLVM Intermediate Representation(IR). If you're not familiar with it, is an extremely low-level programming language close to assembly but without being tied to specific hardware. It's extremely useful for compiler writers since it's a common language that can be used to target multiple architectures. It's also a great tool to understand what the compiler is doing and how it "interpret" our code.

For the previous code, here is the annotated LLVM IR we can obtain:

We can witness that this code contains redundant checks for the division and modulo by 0. Also, we can note two operations to get our quotient and remainder: and . This is expected at this stage since LLVM doesn't have an instruction combining them but hopefully, the compiler will be able to optimize this while generating the assembly.

Let's have a look at the next compilation step so we can dive into the assembly code generated from the IR( with optimizations enabled):

And voila! We can see that the compiler was able to optimize our code and fused the two operations into a single instruction. So having a dedicated implementation wouldn't be that helpful for performance while using since this optimization should almost always be done by the IR compiler.

You might have noticed quite a few things in the assembly code, let's have a look at them by creating our own assembly version.

An inline assembly version

A few things to note in this implementation:

  • The instruction divides the register by its operand value. It will store the quotient in and the remainder in .
  • holds the dividend before our assembly is executed. It will contain the quotient after. This is because its value is tied to the register and it will be used both for input and output thanks to the operand.
  • is also passed with the operand, allowing us to clear the register before the operation is done since is 0 before the assembly is executed. At the end, the register will hold the remainder value that we return in the output tuple.
  • The divisor is passed through any general purpose register as a simple input value that we pass directly to the instruction.
  • Finally, we pass a few flags that will help the compiler to know how our inline assembly will behave with . Here: our block has no side effect, we're neither reading nor writing to the memory and we don't push anything to the stack.

Compiling this, we can see that it ends up a bit more concise than the previous compiled assembly we built:

It should be faster but we're cheating a bit on rust here since we're not taking into account the panic check that was done in the previous version.

Switching to an inline assembly version
32.7 ns
30.5 ns
Measured with CodSpeed

It's faster! The overhead of those panic checks is not that dramatic in the end. Branch prediction is probably contributing a lot to this. It was eluded in the first IR snippets but in there, the rust compiler hints the rest of the pipeline that those checks are likely to succeed. The compiler's ability to optimize and hint that these checks are more likely to pass than fail can lead to surprisingly efficient code, even when it appears to be doing "much more" than the assembly version.

The cost of panic handling: an unchecked version

Now that we built our unchecked version in assembly, we can get quite close to it by using unchecked div and rem from :

And its intermediate representation:

As you can see this IR doesn't contain anything about overflow checks and the performance should be close to or even better than our assembly version.

Switching to the unchecked version
30.5 ns
30.5 ns
Measured with CodSpeed

That's exactly the same time our assembly version takes. So let's have a look at the assembly:

And it's the same as our assembly version! So bothering writing our assembly version is not really worth it in the end.

To conclude, there is no performance drawback of not having a single function since the compiler optimizations will take care of this burden for us. It might seem odd to write those two operations but unless very specific optimization killer cases they will be fused as a single assembly instruction.

If you enjoyed this article, you can follow us on Twitter to get notified when we publish new articles.


Share this:

Ready to bench?

Unlock the full potential of your code today. Don't guess, just measure.
Get started
Request a Demo
Copyright © 2023 CodSpeed Technology SAS. All rights reserved.