During the last advent of code, while starting with Rust, I discovered that the Rust standard library was missing the usual div_mod
(a.k.a. div_rem
) function that would simultaneously return the quotient and remainder. I got confused since I already used the operator in C++ with std::div or Python with divmod. It seemed a basic way of improving performance since both values would come out with the same DIV
assembly instruction, nicely avoiding extra CPU cycles.
Looking it up, I found an existing RFC created in 2017, but it hasn't received much attention since.
So, do we need a div_rem
implementation in the standard library? I decided to investigate.
Let's naively implement the function:
pub fn naive_div_rem(a: u32, b: u32) -> (u32, u32) { (a / b, a % b) }
To dissect what is executed we're gonna analyze the compilation steps.
First, let's have a look at the LLVM Intermediate Representation(IR). If you're not familiar with it, is an extremely low-level programming language close to assembly but without being tied to specific hardware. It's extremely useful for compiler writers since it's a common language that can be used to target multiple architectures. It's also a great tool to understand what the compiler is doing and how it "interpret" our code.
For the previous code, here is the annotated LLVM IR we can obtain:
; Define a function called 'example::naive_div_rem define { i32, i32 } @example::naive_div_rem(i32 %a, i32 %b) unnamed_addr { start: %_0 = alloca { i32, i32 }, align 4; Allocate memory for our return tuple %_4 = icmp eq i32 %b, 0; Check if divisor %b is zero br i1 %_4, label %panic, label %bb1; If %b is zero, panic, else continue bb1: %_3 = udiv i32 %a, %b; Perform unsigned division(only get the quotient) %_6 = icmp eq i32 %b, 0; Check again for the divisor being zero (for the modulo) br i1 %_6, label %panic1, label %bb2; If %b is zero, panic, else continue bb2: %_5 = urem i32 %a, %b; Compute the unsigned remainder ; ... ; Stores quotient and the remainder in a tuple, load them back, ; and constructs a return tuple with the values. ; ... ret { i32, i32 } %8 ; ... omitted panic handlers ... }
We can witness that this code contains redundant checks for the division and modulo by 0. Also, we can note two operations to get our quotient and remainder: udiv
and
urem
. This is expected at this stage since LLVM doesn't have an instruction combining
them but hopefully, the compiler will be able to optimize this while generating the assembly.
Let's have a look at the next compilation step so we can dive into the assembly code generated from the IR(x86_64
with optimizations enabled):
example::naive_div_rem: test esi, esi ; Test if esi (divisor) is zero je .LBB0_2 ; Jump to panic block if esi is zero mov eax, edi ; Move the dividend into eax xor edx, edx ; Zero out edx to prepare for the division div esi ; Unsigned division, quotient goes into eax, remainder into edx ret .LBB0_2: ; ... "attempt to divide by zero" panic handler ...
And voila! We can see that the compiler was able to optimize our code and fused the two
operations into a single div
instruction. So having a dedicated implementation
wouldn't be that helpful for performance while using rustc
since this optimization
should almost always be done by the IR compiler.
You might have noticed quite a few things in the assembly code, let's have a look at them by creating our own assembly version.
use std::arch::asm; pub fn asm_div_rem(a: u32, b: u32) -> (u32, u32) { let mut tmp: u32 = a; let mut remainder: u32 = 0; unsafe { asm!( "div {divisor}", inout("eax") tmp, inout("edx") remainder, divisor = in(reg) b, options(pure, nomem, nostack), ); } (tmp, remainder) }
A few things to note in this implementation:
div
instruction divides the eax
register by its operand value. It will store the quotient in eax
and the remainder in edx
.tmp
holds the dividend before our assembly is executed. It will contain the quotient after. This is because its value is tied to the eax
register and it will be used both for input and output thanks to the inout
operand.remainder
is also passed with the inout
operand, allowing us to clear the edx
register before the operation is done since remainder
is 0 before the assembly is executed. At the end, the register will hold the remainder value that we return in the output tuple.div
instruction.options
. Here: our block has no side effect, we're neither reading nor writing to the memory and we don't push anything to the stack.Compiling this, we can see that it ends up a bit more concise than the previous compiled assembly we built:
example::asm_div_rem: mov eax, edi ; Move the value of the dividend from EDI to EAX. xor edx, edx ; Zero out EDX to prepare for the division. div rsi ; Signed divide the dividend in EAX by the value in RSI('b'). ret ; Return from the function. quotient -> EAX, remainder -> EDX.
It should be faster but we're cheating a bit on rust here since we're not taking into account the panic check that was done in the previous version.
It's faster! The overhead of those panic checks is not that dramatic in the end. Branch prediction is probably contributing a lot to this. It was eluded in the first IR snippets but in there, the rust compiler hints the rest of the pipeline that those checks are likely to succeed. The compiler's ability to optimize and hint that these checks are more likely to pass than fail can lead to surprisingly efficient code, even when it appears to be doing "much more" than the assembly version.
Now that we built our unchecked version in assembly, we can get quite close to it by using unchecked div and rem from std::intrisics
:
#![feature(core_intrinsics)] pub unsafe fn unchecked_naive_div_rem(a: u32, b: u32) -> (u32, u32) { let quotient = std::intrinsics::unchecked_div(a, b); let remainder = std::intrinsics::unchecked_rem(a, b); (quotient, remainder) }
And its intermediate representation:
; Define a function named 'example::unchecked_naive_div_rem' that takes two arguments define { i32, i32 } @example::unchecked_naive_div_rem(i32 noundef %a, i32 noundef %b) unnamed_addr { start: %_0 = alloca { i32, i32 }, align 4; Allocate memory for a tuple to store the quotient and remainder %quotient = udiv i32 %a, %b; Perform unsigned division of %a by %b %remainder = urem i32 %a, %b; Compute the unsigned remainder of %a / %b ; ... ; Stores quotient and remainder in a tuple, loads them back, and constructs a return tuple with the values. ; ... ret { i32, i32 } %1 }
As you can see this IR doesn't contain anything about overflow checks and the performance should be close to or even better than our assembly version.
That's exactly the same time our assembly version takes. So let's have a look at the assembly:
example::unchecked_naive_div_rem: mov eax, edi xor edx, edx div esi ret
And it's the same as our assembly version! So bothering writing our assembly version is not really worth it in the end.
To conclude, there is no performance drawback of not having a single div_rem
function
since the compiler optimizations will take care of this burden for us. It might seem odd
to write those two operations but unless very specific optimization killer cases they
will be fused as a single assembly instruction.
If you enjoyed this article, you can follow us on Twitter to get notified when we publish new articles.