Besides altering your algorithms, another easy way to improve the performance of your code is by reducing heap allocations. Compared to the heap, stack allocations are significantly faster to both allocate and free, as they are usually done by simply updating a pointer, potentially in a single operation, and often by a dedicated CPU register.
The heap on the other hand is a complicated beast. There is a fair amount of bookkeeping involved when acquiring and potentially releasing memory for your application, and avoiding this overhead will usually net you some decent performance gains, especially if these are happening frequently in your code.
Importantly Rust is a systems programming language, albeit with a slight twist. One of the important aspects of systems programming languages is that you usually have to pay a little more attention to memory than you otherwise would in a language where memory is managed for you. Given the “modern” feel that Rust has, it can sometimes be surprising how and where memory is being used, especially for those new to these types of languages, or perhaps just uninitiated with the idiosyncrasies of Rust itself.
To hopefully demystify this subject a little, we’re going to delve into Rust’s memory allocators, replace the default global allocator, whilst adding a bit of instrumentation to surface where and when allocations are issued.
Obviously this will not be a full robust solution we’re developing here, but more of an introduction into the topic, and one that you can choose to venture further with if you so wish.
It should go without saying there are likely (definitely) better options out there. Ones that are more robust, with fewer bugs and fully featured, but writing our own gives us some exposure to the inner workings of the Rust standard library, and also an understanding as to why the alternatives might be better or worse than our simple solution.
Blindly reaching for a dependency when you don’t fully understand the problem you’re solving, let alone the solution, is not a good habit in my opinion. Understanding the benefits and drawbacks of your dependency choices is an often overlooked skill that you should always seek to exercise as a software developer. And how can you possibly do that if you never look behind the curtain?
For those who don’t want to go through the exercise, or who just want to see the code in its final form before going through the steps below, you can take a look at this gist.
Onwards…
How to replace the global allocator
Firstly, the global allocator can be replaced by simply adding the #[global_allocator]
annotation to your allocator of choice:
use std::alloc::{System};
#[global_allocator]
static GLOBAL_ALLOCATOR: System = System;
This bit of code is clearly redundant as we are simply setting the global allocator back to the default System
allocator. This usually happens regardless, but now you know how it is done.
Making our very own allocator
First off we need to make our own allocator type. We’ll call it InstrumentedAllocator
!
pub struct InstrumentedAllocator<T>(T);
You’ve probably noticed that our allocator is a generic type, with a single field holding that generic type. We will cover the reason for this in a little bit.
Up next we need to implement GlobalAlloc
for our struct. This is a mandatory trait that all global allocators must implement for us to be able to successfully annotate our allocator with #[global_allocator]
.
unsafe impl<T> GlobalAlloc for InstrumentedAllocator<T>
where
T: GlobalAlloc,
{
unsafe fn alloc(&self, layout: Layout) -> *mut u8 {}
unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {}
}
There are just a couple of mandatory functions (alloc
and dealloc
) which the GlobalAlloc
trait requires we implement. I think their purposes are fairly self-explanatory. One acquires the memory, the other releases it.
To answer the question from before, “why is our allocator a generic type”, we can see now that we want to to store something that implements GlobalAlloc
in our type. Why would we need to store another GlobalAlloc
in our allocator that is already implementing GlobalAlloc
?
The answer is that we need some way to actually allocate memory. We could write all the allocation code ourselves, but the purpose of this exercise is only to instrument allocations. So to keep this simple we will use the system allocator to do the heavy lifting for us:
#[global_allocator]
static ALLOCATOR: InstrumentedAllocator<System> = InstrumentedAllocator(System);
It is worth noting that by using the system allocator we are avoiding portability issues. Out of the box the system allocator supports all the different platforms that Rust runs on. If we wrote our own memory allocator we would need to take those different platforms into consideration! Definitely out of scope for now.
Implementing alloc
Our code obviously won’t compile yet:
|
12 | unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
| ----- ^^^^^^^ expected *-ptr, found `()`
| |
| implicitly returns `()` as its body has no tail or `return` expression
|
= note: expected raw pointer `*mut u8`
found unit type `()`
We first need to provide implementations for our trait functions. We’ll start with alloc
.
This function takes just one parameter which is the desired layout of the memory being requested, and in response will return a mutable pointer to the start of our block of bytes.
unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
let ptr = self.0.alloc(layout);
ptr
}
About the only thing in here that might need a bit of an explanation is the use of self.0.alloc
. Since our struct definition uses an unnamed field, we need to refer to it by its position, rather than by its name.
pub struct InstrumentedAllocator<T>(T);
// self.0 refers to this field ----^
// The code below is equivaltent to the line above.
// The code below won't compile though, since it is against the rules to use
// pure numbers for field names.
pub struct InstrumentedAllocator<T>
{
pub T 0;
}
So what self.0.alloc(layout)
is saying is simply that we are calling alloc()
on the type that is set to that first field in our struct, which just so happens to be a generic type. In our case this generic type is the system allocator, since that is what we passed in earlier:
static ALLOCATOR: InstrumentedAllocator<System> = InstrumentedAllocator(System);
Hopefully that clears everything up. Regardless, onto dealloc
…
Implementing dealloc
Much like alloc
, dealloc
is very simple for us to implement:
unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
self.0.dealloc(ptr, layout)
}
There isn’t much to explain here, other than we are expecting to pass in the same raw pointer we received from our call to alloc
along with the Layout
. Once again we are calling the system dealloc
function, and then calling it a day.
And with that, we now should have a working allocator. If we stick something simple in main
we can confirm that things are working.
fn main() {
// Try something that we know will allocate on the heap
let s = String::from("Hello, Clarice!");
println!("{}", s);
}
Success!
Finished dev [unoptimized + debuginfo] target(s) in 1.64s
Running `target/debug/playground`
Hello, Clarice!
Obviously there is nothing new happening yet, but this just proves that our new allocator is working as it should.
On to the interesting stuff…
Instrumentation
Ok, with the ground work done (the allocation), we can now start to work on instrumenting that allocation with a bit of information. Lets just add a simple print to STDOUT with the number of bytes we’ve allocated.
unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
let ptr = self.0.alloc(layout);
println!("{} bytes", layout.size());
ptr
}
error: process didn't exit successfully: `target\debug\alloc.exe` (exit code: 0xc0000005, STATUS_ACCESS_VIOLATION)
Uh, what? How does a println!
cause a panic?
It is actually fairly simple if you think about it. While allocating the memory we are trying to call println!
. Given that the string we’re trying to display is constructed at runtime, we first need to allocate some memory for that string before we can write to it. Of course, this will need to call the global allocator to request that extra bit of memory. Wait. We’re the global allocator! So we call our own alloc
function again, which then calls println!
again, which then needs to call alloc
again… I think you get the picture.
What we have here is an infinite loop of allocations which we need to somehow break if we are going to be able to call println!
from within our allocator.
The simplest way to solve this is to have a boolean value to guard the println!
and prevent further calls while the flag is still set. We’ll set the flag the first time we enter alloc
, and then on subsequent calls we’ll check that flag, and if it is set, we won’t bother trying to call println!
again. Once we have done our work, we’ll unset that flag, allowing future calls to println!
again.
So after all that, our alloc
function should look like this:
unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
let ptr = self.0.alloc(layout);
if !GUARD {
GUARD = true;
println!("{} bytes", layout.size());
GUARD = false;
}
ptr
}
If we run our application and check the output now we should get something like this:
5 bytes
48 bytes
64 bytes
15 bytes
Hello, Clarice!
Success! We can now see some allocations are being made and our program completes successfully.
Interestingly though there are a few more allocations than the one we would expect from our simple application. The last one for 15 bytes makes sense though. We need exactly 15 bytes to hold our "Hello, Clarice!"
string, but what are the other allocations for?
For that we’re unfortunately going to need a bit of extra information, which leads us to…
Backtracing the IP
What we really need is some information about where in our application we are when an allocation is requested. For that I recommend we use a crate called backtrace
. This will allow us to get some runtime information about the stack frame we’re in, which will in turn also allow us to filter out some of the data we’re not interested in.
As I mentioned earlier, blindly pulling a crate into your project is not something that I would normally advocate for. However, this is an experiment, and understanding how backtraces are captured is somewhat out of scope for us at the moment. I do suggest taking a look at the source for backtrace when you get the chance though.
Also, I won’t go into too much detail about what backtraces are in this article either. For our purposes, all you need to know is that a backtrace allows us to step back through time to see where we were in our code, when an allocation was requested. We’ll get helpful information like the name of the file, function and even the line number where this occurred, assuming we are running our application in debug mode of course. Debug mode will leave all that information in the final binary, whereas release mode will strip it out for performance reasons. There are ways to add this back in for release mode, but I’ll leave that up to you to figure out.
With all that said, lets add the following to our cargo.toml
file to get the backtrace
crate:
[dependencies]
backtrace = "0.3.66"
Then in the guarded section of our alloc
function we’re going to capture a backtrace
, step through all the stack frames, and pull out the information we’re looking for.
// Don't forget to import Backtrace
use backtrace::Backtrace;
// A global value we can use if we don't find the values we're looking for
const UNKNOWN: &str = "<unknown>";
unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
let ptr = self.0.alloc(layout);
if !GUARD {
GUARD = true;
let bt = Backtrace::new();
for frame in bt.frames() {
for symbol in frame.symbols() {
// Here we are pulling out the name of the function, in which file
// the function can be found, and the line number where the
// allocation occurs.
let name = symbol
.name()
.map(|n| n.to_string())
.unwrap_or(UNKNOWN.to_string());
let filename = symbol
.filename()
.map(|f| f.display().to_string())
.unwrap_or(UNKNOWN.to_string());
let lineno = symbol
.lineno()
.map(|l| l.to_string())
.unwrap_or(UNKNOWN.to_string());
println!("{}:{} {} {} bytes",
name,
lineno,
filename,
layout.size());
}
}
GUARD = false;
}
ptr
}
Besides the nuances of backtrace
itself, I think the code here should be fairly self-explanatory for those with some familiarity with Rust. There are some definite inefficiencies with our implementation above, but it’ll do for our purposes for now.
Running this now and we should start to see the bigger picture:
"backtrace::backtrace::dbghelp::trace":98 ".cargo\\registry\\src\\github.com-1ecc6299db9ec823\\backtrace-0.3.66\\src\\backtrace\\dbghelp.rs" 5 bytes
"backtrace::backtrace::trace_unsynchronized<backtrace::capture::impl$1::create::closure_env$0>":66 " .cargo\\registry\\src\\github.com-1ecc6299db9ec823\\backtrace-0.3.66\\src\\backtrace\\mod.rs" 5 bytes
"backtrace::backtrace::trace<backtrace::capture::impl$1::create::closure_env$0>":53 " .cargo\\registry\\src\\github.com-1ecc6299db9ec823\\backtrace-0.3.66\\src\\backtrace\\mod.rs" 5 bytes
"backtrace::capture::Backtrace::create":176 " .cargo\\registry\\src\\github.com-1ecc6299db9ec823\\backtrace-0.3.66\\src\\capture.rs" 5 bytes
"backtrace::capture::Backtrace::new":140 " .cargo\\registry\\src\\github.com-1ecc6299db9ec823\\backtrace-0.3.66\\src\\capture.rs" 5 bytes
"alloc::impl$0::alloc<std::alloc::System>":20 "rust-bits-and-pieces\\src\\alloc.rs" 5 bytes
"alloc::_::__rg_alloc":140 "rust-bits-and-pieces\\src\\alloc.rs" 5 bytes
And that bigger picture is a bit of a mess.
What we have here is the full path that our code traverses at the point a backtrace is captured. Including the inner functions of backtrace itself, the Rust standard library and even kernel/OS level calls. It is kind of hard to make out the bit we’re interested in, so in order to facilitate this we’re going to need to cut back a lot of the cruft.
Here are a couple of conditionals we can add to the code block to remove all the things we’re not interested in:
// Add this as a global outside of the alloc function
const IGNORED_WORDS: [&str; 4] = [
"invoke_main",
"__scrt_common_main_seh",
"alloc::_::__rg_alloc",
"backtrace::",
];
...
// Back inside our alloc function again, we add in the conditionals
let filename = symbol
.filename()
.map(|f| f.display().to_string())
.unwrap_or(UNKNOWN.to_string());
// This strips out most of the rust std library code and anything without a filename
if filename.starts_with("/rustc") || filename.eq(UNKNOWN) {
continue;
}
let name = symbol
.name()
.map(|n| n.to_string())
.unwrap_or(UNKNOWN.to_string());
// And this strips out all the rest of the things we don't care about
if IGNORED_WORDS
.iter()
.any(|n| name.starts_with(n) || name.ends_with(n))
|| name.contains("std::alloc::System")
{
continue;
}
...
Running this should now give something far more relevant:
"alloc::main":162 "rust-bits-and-pieces\\src\\alloc.rs" 15 bytes
Hello, world!
Success! We can now see the exact spot where we allocated our string in main
and how many bytes were allocated.
Bonus objective: Threading
There is one major problem with our solution. It isn’t thread safe. More specifically the guard we’re using isn’t thread safe. We could potentially end up in a situation where multiple threads are trying to allocate memory at the same time, and one of those threads could be blocked by the guard being set, causing us to miss reporting the allocation.
A quick way to resolve this is to have a guard for each thread, which can be achieved quite simply with thread_local storage. I won’t explain the intricacies of thread local storage here, but suffice it to say that it should achieve our stated goal of a single instance of GUARD
per thread. Once again, I suggest you read up a little about it after this exercise is done.
thread_local! {
static mut GUARD: bool = false;
}
Unfortunately for us, this doesn’t compile. thread_local
does not support mutable values, so we’ll need to reach for a Cell.
This article is pretty long already, so I won’t go into too much detail with regards to how Cell
works, but importantly it allows us to mutate a value without it being marked as mut
. It does this by creating a copy of the value in the Cell
when you get
it, and replacing the value inside when you set
it, rather than sharing the underlying value around. The linked documentation on Cell
is pretty good, and you should definitely give it a read if you are going to be doing Rust in any serious capacity.
GUARD.with(|guard| {
if !guard.get() {
guard.set(true);
// Guarded code goes in here
guard.set(false);
}
});
With that change in we, things should work the same as they did before, except that you’ll get slightly less buggy behaviour in a threaded environment.
Once again there are likely better and more efficient ways to solve this problem. Writing to STDOUT during a performance critical section of your code is definitely going to suffer using this allocator, but as a simple learning exercise this will do fine. However, if want to identify some simple places where allocations are taking place in your code, this should do the job.
Congratulations, you did it!
Once again, here is the final implementation for you to review.
Conclusion
If you take a look at the final implementation there really isn’t that much code here. Yes it isn’t very efficient, and sure it doesn’t cover every scenario, but if your goal is to actually learn and take on this information, this exercise will hopefully have helped in that regard.
For mission critical projects you can certainly find a more complete solution, but if you’re just working on a hobby project, or experimenting, making your own will give a healthy dose of perspective, and perhaps a greater appreciation for how simple or hard things actually are.
References
- The documentation and source for the backtrace crate
- Rust documentation for std::alloc
- Rust documentation for Layout
- Rust documentation for Cell
- Rust documentation for thread_local storage
- Article by James Golick on how memory allocators work under the hood
- Rust’s secret sauce