Rust Iterator Optimizations

2024-05-22

I recently read Nicole Tietz-Sokolskaya’s article “Rust’s iterators optimize nicely—and contain a footgun”. It’s a great investigation – read it before continuing!

Her conclusion is:

Rust will optimize iterator usage in much the same way that Haskell does. It will combine arbitrary iterator usage and reduce it down to a for loop. That’s pretty neat!

I agree, that’s neat! She also asks:

Now, how does it do it? That’s beyond my expertise. It happens somewhere in the compiler. I’d love to find that out, though!

I’ve been doing some work with LLVM’s optimization pipelines recently, so I thought I knew the answer. (Spoiler: I was right!) I played around with Compiler Explorer to confirm, and learned some more about how to use it along the way. Here’s what I found!

Hypothesis: Monomorphization and inlining

My hypothesis was that this comes from a combinination of two techniques: monomorphization in the Rust compiler, and inlining in the LLVM optimizer.

The Rust compiler monomorphizes generic types and functions (all the things with <...> in their name). With a type or function MyGeneric<X: ...>, Rust creates a new type for each fill of X (MyGeneric<Foo>, MyGeneric<Bar>), and treats them separately down the line: when typechecking, when generating code, etc.

For Iterator::map(), Iterator::filter(), and similar combinators, that means filling in the type of the underlying iterator (my_iter in my_iter.map(f)) and the functor applied (f in my_iter.map(f)). These concrete types are probably only used in one place: where the particular functor f is applied.

When the Rust compiler is done with doing “Rust stuff” (type checking, borrow checking…), it outputs to a codegen backend to optimize and compile machine code. By default, this backend is LLVM, so the Rust compiler outputs code in the LLVM Intermediate Representation.

The LLVM optimizer can see when a function is only called in one place: the monomorphized Map<MyArray, F1>::next is only called from Filter<Map<MyArray, F1>, F2>::next, which is only called from my_example. In these cases, instead of paying the overhead of a function call, the compiler might chose to inline the code from the callee, into the caller; and then optimize that caller further.

I thought between these two techniques, the Rust compiler and LLVM optimizer would get most of the way towards “all these functions are the same”; additional optimizations would drive them even closer together.

Turns out – yes, as far as I can tell! Let’s see it in action!

Monomorphization, experimentally

Matt Godbolt et al.’s Compiler Explorer is a purpose-built tool for this kind of investigation: taking a small code sample, and seeing how the code is treated by different stages of compilation.

I’m using rustc 1.78.0 as the Rust compiler, which should lead to pretty consistent output below; let me know if you spot any errors!

I’ll take a slightly different example than Nicole’s code for this, to start off with. Check this out (in another tab)!

The upper-left panel has the source file, with each line highlighted in a different color. The bottom-left panel has the fully-compiled code (we’ll ignore that), and the right panel has the LLVM bytecode.

Note how on the right, there’s a matching color to the source line: this is “what that source line was compiled to”.

If we look at the @example::plus_evens::h3e97a5ef4523fe5e and @example::minus_odds::h9a0c327b983eb5fa, we see they have the same structure: allocate some values, make some calls, return. The calls are mostly the same, even: Iterator::filter, then Iterator::map, then Iterator::for_each.

But! Look closely- as of the compiler version I used, I got:

plus_evens minus_odds
_ZN4core5slice29_[...]hba5496e8cfd60364E (same)
filter::hd00cdd721ef593af filter::he3404cc7f49effb9
map::h4dc70664041aed73 map::h9ffee8949f77ed3e
for_each::h2fc5b39b6cc128db for_each::h99898954021474f8

That first call – to core5slice29-something-something – is the same. That makes sense: in both cases, we’re taking a 5-element array and getting an iterator for it.

But the other calls are different!

The net return type that we call for_each on is something like Map<Filter<SliceIter, AnonPredicate>, AnonMapper>.

This has occasionally annoyed me when using Rust iterators: I have wound up with deeply nested types, of the form Fold<FilterMap<Enumerate<...>>> or similar, because each call “captures” the previous type in its return type.

But! Because of that, the functions generated for these types – their next, map, for_each methods – are only used in one place. That makes it easy for LLVM to decide to inline!

Inlining, experimentally

Inlining a function definition – replacing “call to function” with “text of function” – has some tradeoffs: it skips the call overhead and allows new optimizations, but expands the (instruction) size of each caller. If a function is only called from one place, though, it can be a net decrease in program size – usually a win!

I’ve heard and seen that LLVM is pretty aggressive in inlining. Let’s see how it works on these two examples, using Compiler Explorer’s “Opt Pipeline” view.

We can see what optimizations the compiler is doing by going to the compiler pane (the one that says rustc) and clicking “Add new…” > “Opt Pipeline”. It might take a minute to load,1 then we get this view.

The new “Opt Pipeline Viewer” panel has a drop-down to select which function we’re looking at. Along the right are the LLVM optimization passes that the Rust compiler is running; I’ve filtered down to only those that have an effect. Clicking the pass name on the left-hand subpanel shows what the pass did, as a left/right diff; for instance, the “Eliminate PHI nodes pass” just changed a comment at the top.

By default, the Rust compiler isn’t doing a lot of optimization. If we add the -C opt-level=1 flag to the compiler invocation,2 we get these results – check out how many more passes have an effect!

Looking at filter_then_map,3 we can see “InlinerPass” is active (towards the top of the passes list). It changes the LLVM IR significantly, turning the function from “just a few calls” to “several basic blocks”. And the only call is to std::io::stdio::_print – we’ve inlined the calls to iter, filter, map, and for_each!

If you look carefully – and have been reading LLVM IR for the last six weeks – you can see the control flow of filter and for_each:

Control-flow graph of filter_then_map

There’s some interesting numeric optimizations I’ll defer to this footnote.4

There’s also missed optimizations, like the conditional edges from the start block – I think that’s “left over” for_each code, for skipping the iteration if it’s empty, but I’m not sure.

The nice thing about the “pass” architecture is that not all passes have to do everything; this pass “just” inlines, later passes can do optimizations like unrolling and eliminating the loop altogether.5

Loops and combinators

OK, so we’ve gone through Compiler Explorer on these examples, up to opt-level=1. We’ve done .filter().map() and filter_map(); let’s add the for ... in version to cover all the cases from [Nicole’s post][ntiez-iter].

This is what we get! The active passes aren’t quite the same: for_in activates soem earlier passes compared to filter_map. (A guess: since more of the control flow is in the function before inlining, these passes can act ~immediately.)

A lot of passes seem to be repeated, though – I think this is because one optimization can unlock new opportunities for another. The net result is exactly what Nicole says: the filter, filter_map, and for_in implementations all optimize to the same machine code.6

I like this! Write the code you find readable – let the machine figure out how to run it quickly!

Acknowledgements

Thanks to Nicole for writing the post that kicked this off, and Anvay for working with me through a bunch of LLVM experiments recently.


  1. Why does it take so long? I think Compiler Explorer is doing N invocations of the opt pipeline, one for each pass, and storing the differences. Since CE needs the “before” and “after”, each of those steps would be serialized… and there’s several of them. This is just speculation, though – I haven’t checked the source code↩︎

  2. Nicole used the “release” build profile for her post, which implies opt-level=3 for rustc (at time of writing). A higher opt-level (generally) adds more passes, which take more time but give better results; since level 1 offers inlining, I didn’t bother bumping up to the higher opt-levels. ↩︎

  3. You can switch to filter_map by going to the “Opt Pipeline Viewer” panel and changing the “Function” dropdown. It looks basically the same, so I’m won’t go over it in detail. ↩︎

  4. In block 30, the compiler has turned *x % 2 == 0 into and i32 %33, 1. This is a pretty typical trick, converting something that would require division into (faster) bit-masking. The fancier version is turning x + 1 into or disjoint i32 %33, 1: because the compiler “knows” that x is an even number, it can turn the addition of 1 into a single bit-set. In hardware, that can be “cheaper” than addition, since it doesn’t involve carry dependencies between bits. Try changing the condition to “print evens” in the source and see what happens! ↩︎

  5. Loop unrolling is, like function inlining, one of those optimizations that expands instruction size in order to make control-flow more linear; good for branch predictors, bad for instruction cache. I don’t know what heuristics LLVM’s unrolling pass uses to figure out whether it should or shouldn’t unroll. ↩︎

  6. Interestingly, there are some differences at the LLVM IR layer at opt-level=1; different order-of-variable-initializtation and other small changes. These differences disappear when complied to x86_64 assembly; it seems like there are some ISA-specific optimization passes, which might drive towards similarity. I don’t know if there’s an explicit “canonicalize” step. ↩︎