
Rust: Dropping heavy things in another thread can make your code 10000x faster - timooo
https://abramov.io/rust-dropping-things-in-another-thread
======
fpgaminer
Some important things I think people should note before blindly commenting:

* The example code is obviously contrived. The real gist is that massive deallocations in the UI thread cause lag, which the example code proves. That very thing can easily happen in the real world.

* I didn't see any difference on my machine between a debug build and a release build.

* The example is preforming 1 _million_ deallocations. That's why it's so pathological. It's not just a "large" vector. It's a vector of 1 million vectors. While that may seem contrived, consider a vector of 1 million strings, something that's not too uncommon, and which would likely suffer the same performance penalty.

* Rust is not copying anything, nor duplicating the structures here. In the example code the structures would be moved, not copied, which costs nothing. The deallocation is taking up 99% of the time.

* As an aside, compilers have used the trick of not free-ing data structures before, because it provides a significant performance boost. Instead of calling free on all those billions of tiny data structures a compiler would generate during its lifetime, they just let them leak. Since a compiler is short lived its not a problem, they get a free lunch (pun unintended), and the OS takes care of cleaning up after all is said and done. My point is that this post isn't theoretical, we do deallocation trickery in the real world.

~~~
papaf
This deallocation trick is neat but in C and C++ you could use a memory pool
to do this.

In theory, you could also use a memory pool in Rust but I think the standard
library uses malloc without some way of overriding this behaviour.

~~~
orf
You can change the global allocator in any rust project. You can write your
own easy enough, or use one like jemalloc

~~~
josephg
Sure; but when you’re using arenas and things like that you usually you will
want different objects allocated into different pools (or with different
lifetime properties). Rust only lets you pick one allocator for the entire
process, so you can’t specify “all the children of this data structure go in
arena A, and this other allocation goes into a traditional heap”.

It’s more awkward, but I much prefer Zig’s approach here where everything that
allocates takes an allocator as a parameter. Usually the allocator is
specified at compile time - in which case zig can generate identical code to
the rust compiler. But when you want the flexibility, it’s there.

Aside from compilers, this is heavily used in video games where there’s often
a lot of objects that get allocated per frame, and can be discarded all
together. And in that case rust’s lifetime tracking would be a huge asset. The
dovecot email server (C) also makes superb use of a mix of memory containers
for different tasks. Given how messy email parsing is, dovecot is an absolute
pleasure to read.

~~~
afiori
As someone that is interested in the topic and in Zig's "provide your own
allocator" approach I have a question: would it be possible to make an
allocator wrapper that moves values to be deallocated to a different thread?

As far as I know it would require both Rust's borrowing semantics and Zig's
architectural choice.

From purely my own fan-boy perspective Zig approach is something that I would
have really liked for Rust to adopt (I have no idea of which one came first
chronologically).

~~~
steveklabnik
It is not an allocator wrapper, but [https://docs.rs/defer-
drop/1.0.1/defer_drop/index.html](https://docs.rs/defer-
drop/1.0.1/defer_drop/index.html)

------
chowells
This is the standard problem with tracing data structures to free them. You
frequently run into it with systems based on malloc/free or reference
counting. The underlying problem is that freeing the structure takes time
proportional to the number of pointers in the structure it has to chase.

Generational/compacting GC has the opposite problem. Garbage collection takes
time proportional to the live set, and the amount of memory collected is
unimportant.

It's actually a lot to be said for rust that the ownership system lets you
transfer freeing responsibility off-thread safely and cheaply in order to not
have it block the critical path.

But overall, there's nothing really unexpected here, if you're familiar with
memory management.

~~~
arcticbull
> This is the standard problem with tracing data structures to free them. You
> frequently run into it with systems based on malloc/free or reference
> counting. The underlying problem is that freeing the structure takes time
> proportional to the number of pointers in the structure it has to chase.

That doesn't seem to make intuitive sense. A GC has the same problem.

A garbage collector has to traverse the data structure in a similar way to
determine whether it (and it's embedded keys and values) are part of the live
set or not, and to invoke finalizers. You're beginning your comparison after
the mark step, which isn't a fair assessment since what Rust is doing is akin
both both the mark and sweep phases.

The only way to drop an extensively nested structure like this any faster than
traversing it would be an arena allocator, and forgetting about the entire
arena.

The difference between a GC and this kind of memory management is that the GC
does the traversal later, at some point, non-deterministically. Rust allows
you to decide between deallocating it in place, immediately, or deferring it
to a different thread.

~~~
chowells
I said generational/compacting collector. You're talking about a mark and
sweep collector.

A generational/compacting collector traverses pointers from the live roots,
and copies everything it finds to the start of its memory space, and then
declares the rest unused. If there is 1GB of unused memory, it's irrelevant.
Only the things that can be reached are even examined.

As I said, this has the opposite problem. When the live set becomes huge, this
can drag performance. When the live set is small, it doesn't matter how much
garbage it produces, performance is fast.

~~~
arcticbull
How are finalizers invoked if the structure isn't traversed? Would it just be
optimized away none of the objects have finalizers? Hence my suggestion about
the area allocators being a better point of comparison.

~~~
zucker42
Java is an example of a language with a generational copy collector by
default. Most objects in Java don't have a finalizer, since after all the main
point of, for example, destructors in C++ is to make sure you don't leak
memory, which the GC solves. But when the `finalize` method is used is causes
significant overhead.

> Objects with finalizers (those that have a non-trivial finalize() method)
> have significant overhead compared to objects without finalizers, and should
> be used sparingly. Finalizeable objects are both slower to allocate and
> slower to collect. At allocation time, the JVM must register any
> finalizeable objects with the garbage collector, and (at least in the
> HotSpot JVM implementation) finalizeable objects must follow a slower
> allocation path than most other objects. Similarly, finalizeable objects are
> slower to collect, too. It takes at least two garbage collection cycles (in
> the best case) before a finalizeable object can be reclaimed, and the
> garbage collector has to do extra work to invoke the finalizer. [1]

Sure, you're technically correct that if the objects all had finalizers that
did the same thing as C++ destructors, it would be equivalent, but because of
the existence of a GC we don't have to do any work for most objects. A GC is
equivalent to an arena allocator in this sense.

Another point is the C++/Rust pattern of each object recursively freeing the
objects it owns presumably leads to slower deallocation, because in the
general case it involves pointer following and non-local access.

[1]
[https://www.ibm.com/developerworks/java/library/j-jtp01274/i...](https://www.ibm.com/developerworks/java/library/j-jtp01274/index.html)

~~~
msclrhd
Destructors in C++ aren't just for making sure you leak memory. They are used
for many lifetime controlled things such as: 1\. general resource cleanup
(file handle, database connection, etc.) using RAII (Resource Aquisition Is
Initialization); 2\. tracing function entry/exit.

~~~
mcguire
Apparently, that doesn't work in Rust:
[https://news.ycombinator.com/item?id=23363647](https://news.ycombinator.com/item?id=23363647)

~~~
the8472
It does work in rust. you just cannot rely on Drop _for memory-safety_. If you
mem::forget a struct that holds onto some other resource then all that means
is that you're committing that resource to the lifetime of the process. We
usually call that a leak but it can be intentional.

------
littlestymaar
The title is slightly wrong: it's not going to make your code _faster_ , it's
going to reduce _latency_ on the given thread.

It maybe a net win if this is the UI thread of a desktop app, but overall, it
will come at a performance cost: because modern allocators have thread-local
memory pools, and now you're moving away from it. And if you're running you
code on a NUMA system (most server nowadays), when moving from one thread to
another, you can end up freeing non-local memory instead of local one. Also,
you won't have any backpressure on your allocations, and you are susceptible
to run out of memory (especially because your deallocations now occur more
slowly than they should)

Main takeaway: if you use it blindly it's an anti-pattern, but it can be a
good idea in its niche: the UI thread of a GUI.

~~~
pshc
Yes it’ll reduce latency, but doesn’t it also increase parallelism? A single-
threaded program ought to improve overall, unless the extra overhead you
mentioned dominates. A parallel program might improve or not.

I think if you wanted to do deferred destruction right, ideally you’d mod an
allocator to have functions like (alloc_local, alloc_global, free_now,
free_deferred) to avoid exhausting memory. Traits could make this ergonomic.

Also I admit I don’t understand why “you won’t have any backpressure on your
allocations,” shouldn’t deferred destruction give you more backpressure if
anything? I am probably confused.

~~~
tsimionescu
> Also I admit I don’t understand why “you won’t have any backpressure on your
> allocations,” shouldn’t deferred destruction give you more backpressure if
> anything? I am probably confused.

I think the point is that, if the same thread is doing both allocation and de-
allocation, the thread is naturally prevented from allocating too much by the
work it must do to de-allocate. If you move the de-allocation to another
thread, your first thread may now be allocating like crazy, and the de-
allocation thread may not be able to keep up.

In a real GC system, this is not that much of a problem, as the allocator and
de-allocator can work with each other (if the allocator can't allocate any
more memory, it will generally pause until the de-allocator can provide more
memory before failing). But in this naive implementation, the allocator thread
can exhaust all available memory and fail, even though there are a lot of
objects waiting in the de-allocation queue.

~~~
pshc
Ah, I see what you mean, thanks!

------
cesarb
Just be careful, because moving heavy things to be dropped to another thread
can change the _semantics_ of the program. For instance, consider what happens
if within that heavy thing you had a BufWriter: unless its buffer is empty,
dropping it writes the buffer, so now your file is being written and closed in
a random moment in the future, instead of being guaranteed to have been sent
to the kernel and closed when the function returns.

And it can even be worse if it's holding a limited resource, like a file
descriptor or a database connection. That is, I wouldn't recommend using this
trick unless you're sure that the only thing the "heavy thing" is holding is
memory (and even then, keep in mind that _memory_ can also be a limited
resource).

~~~
lostmyoldone
I only know a very little rust, but since it's generally a good practice to
never defer writing (or other side effects) to an ambiguous future point in
time - with memory allocations as the only plausible exception - is there any
way in rust to make sure one doesn't accidentally move complex objects with
drop side-effects into other threads?

Granted the way the type system work you usually know the type of a variable
quite well, but could this happen with opaque types?

I'm very much out of my depth, but it felt like one of those things that could
really bite you if you are unaware, as happened with finalizers in Java
decades ago.

~~~
masklinn
> I only know a very little rust, but since it's generally a good practice to
> never defer writing (or other side effects) to an ambiguous future point in
> time - with memory allocations as the only plausible exception - is there
> any way in rust to make sure one doesn't accidentally move complex objects
> with drop side-effects into other threads?

If you're the one creating the structure, you could opt it out of Send, that'd
make it… not sendable. So it wouldn't be able to cross thread-boundaries. For
instance Rc is !Send, you simply can not send it across a thread-boundary
(because it's a non-threadsafe reference-counting handle).

If you don't control the type, then you'd have to wrap it (newtype pattern) or
remember to manually mem::drop it. The latter would obviously have no safety
whatsoever, the former you might be able to lint for I guess, though even that
is limited or complicated (because of type inference the problematic type
might never get explicitly mentioned).

------
pierrebai
I've seen variations on this trick multiple times. Using threads, using a
message sent to self, using a list and a timer to do the work "later", using a
list and waiting for idle time...

They all have one thing in common: pampering over a bad design.

In the particular example given, the sub-vector probably come from a common
source. One could keep a big buffer (a single allocation) and an array of
internal pointers. For example of such a design to hold a large array of text
strings, see for example this blog entry and its associated github repo:

    
    
        https://www.spiria.com/en/blog/desktop-software/optimizing-shared-data/
        https://github.com/pierrebai/FastTextContainer
    

Roughly it is this:

    
    
        struct TextHolder
        {
            const char* common_buffer;
            std::vector<const char*> internal_pointers;
        };
    
    

This is of course addressing the example, but the underlying message is
generally applicable: change your flawed design, don't hide your flaws.

~~~
viraptor
Yes. There's also a number of pool/arena allocators in rust which could be
used here instead to drop All entries at once.

------
saagarjha
Why would you ever write a get_size function that drops the object you call it
on? Surely in an actual, non-contrived usecase spawning another thread and
letting the drop occur there would just be plain worse?

~~~
epage
I believe this is contrived to prove a point.

And this isn't just a help in these contrived examples. I believe process
cleanup (an extreme case of cleaning up objects) is one of cases where garbage
collection performs better because it doesn't have to unwind the stack, call
cleanup functions that are not in the cache, and make a lot of `free` calls to
the allocator.

I vaguely remember reading about Google killing processes rather than having
them clean up correctly, relying on the OS to properly clean up any resources
of significance.

Now this doesn't mean you should do this in all cases. Profile first, see if
you can avoid the large objects, and then look into deferred de-allocations
... if the timing of resource cleanup meets your application's guarantees.

~~~
seventh-chord
Killing a process without freeing all allocations is, as far as I can tell,
routine in C. Especially for memory it makes no sense "freeing" allocations,
the whole memory space is getting scrapped anyways. Of course, once you add
RAAI the compiler cant reason about which destructors it can skip on program
exit, and if programmers are negligent of this you get programs that are slow
to close.

~~~
estebank
> Killing a process without freeing all allocations is, as far as I can tell,
> routine in C.

Many times by accident :)

> if programmers are negligent of this you get programs that are slow to
> close.

I wouldn't call that negligence, just not fully optimized.

------
epage
For those wanting a real world example where this can be useful:

I am writing a static site generator. When run in "watch" mode, it deletes
everything and starts over (I'd like to reduce these with partial updates but
can't always do it). Moving that cleanup to a thread would make "watch" more
responsive.

~~~
elcomet
That's not really the same issue that is mentionned in the article though, is
it ?

The issue from the article would be solved by just passing a reference to the
variable.

In your case, cleanup is an action that _needs_ to be done before writing new
files. So you have to wait for cleanup anyway, don't you ?

~~~
ashtonkem
That's not true.

Typically any server with a watch functionality will have a mutable reference
to the data that's being watched. When you change that data out you're both
changing the mutable reference, and also deallocating any memory that was
previously used. One _could_ separate these two steps, moving the watched data
to another variable that's dropped in another thread, if you wanted.

------
ncmncm
There is nothing unique to Rust about this; it is a very old technique. It is
usually much inferior to the "arena allocator" method, where all the discarded
allocations are coalesced and released in a single, cheap operation that could
as well be done without another thread. That method is practical in many
languages, Rust possibly included. C++ supports it in the Standard Library,
for all the standard containers.

If important work must be done in the destructors, it is still better to farm
the work out to a thread pool, rather than starting another thread. Again, C++
supports this in its Standard Library, as I think Rust does too.

One could suggest that the only reason to present the idea in Rust is the
cynical one that Rust articles get free upvotes on HN.

~~~
ShroudedNight
> C++ supports it in the Standard Library, for all the standard containers.

I don't know what the situation is today, but in the past, the GCC standard
library containers had non-trivial destructors when running in debug mode.
Ensuring their proper invocation was required to avoid dangling pointers in
their book keeping. Non-obvious and painful to debug.

------
jeffdavis
Speedup numbers should be given when optimizing constant factors -- e.g. "I
made this operation 5X faster using SIMD" or "By employing readahead, I sped
up this file copy by 10X".

The points raised in this article are really different:

* don't do slow stuff in your latency-critical path

* threads are a nice way to unload slow stuff that you don't need done right away (especially if you have spare cores)

* dropping can be slow

The first and second points are good, but not really related to rust,
deallocations, or the number 10000.

The last point is worth discussing, but still not really related to the number
10000 and barely related to rust. Rust encourages an eager deallocation
strategy (kind of like C), whereas many other languages would use a more
deferred strategy (like many GCs).

It seems like deferred (e.g. GC) would be better here, because after the main
object is dropped, the GC doesn't bother to traverse all of the tiny
allocations because they are all dead (unreachable by the root), and it just
discards them. But that's not the full story either.

It's not terribly common to build up zillions of allocations and then
immediately free them. What's more common is to keep the structure (and its
zillions of allocations) around for a while, perhaps making small random
modifications, and then eventually freeing them all at once. If using a GC,
while the large structure is alive, the GC needs to scan all of those objects,
causing a pause each time, which is not great. The eager strategy is also not
great: it only needs to traverse the structure once (at deallocation time),
but it needs to individually deallocate.

The answer here is to recognize that all of the objects in the structure will
be deallocated together. Use a separate region/arena/heap for the entire
structure, and wipe out that region/arena/heap when the structure gets
dropped. You don't need to traverse anything while the structure is alive, or
when it gets dropped.

In rust, probably the most common way to approximate this is by using slices
into a larger buffer rather than separate allocations. I wish there was a
little better way of doing this, though. It would be awesome if you could make
new heaps specific to an object (like a hash table), then allocate the
keys/values on that heap. When you drop the structure, the memory disappears
without traversal.

------
heftig
If I seriously wanted to move object destruction off-thread, I would use at
least a dedicated thread with a channel, so I could make sure the dropper is
done at some point (before the program terminates, at the latest). It also
avoids starting and stopping threads constantly.

Something like this: [https://play.rust-
lang.org/?version=stable&mode=debug&editio...](https://play.rust-
lang.org/?version=stable&mode=debug&edition=2018&gist=bf35e0f30b0a730f41b215e5bff8270a)

You could have an even more advanced version spawning tasks into something
like rayon's thread pool, I assume.

~~~
ReactiveJelly
Someone is working on this as a direct response to this blog:

[https://www.reddit.com/r/rust/comments/go4xcp/new_crate_defe...](https://www.reddit.com/r/rust/comments/go4xcp/new_crate_deferdrop_defer_dropping_your_data_to_a/)

And yes, spawning a thread for every drop is horrible. It's just to prove the
concept. The defer_drop crate uses a global worker thread.

------
maxton
I'm not very familiar with Rust, but I don't understand why you wouldn't just
use a reference-to-HeavyThing as the function argument, so that the object
isn't moved and then dropped in the `get_size` function?

~~~
epage
For these contrived cases, yes, you would just pass a reference to the
function but I think the point is to simplify the case down to demonstrate a
point.

~~~
burpsnard
In the olden days, it was just out.flush(); out.close();

------
dathinab
One thing I just noticed is that the example doesn't make sure to actually run
the new thread to completion before the main thread exists.

This means that if you do a "drop in other thread" and then main exists, the
drop might never run. Which is often fine as the exit of main causes process
termination and as such will free the memory normally anyway.

But it would be a problem one some systems where memory cleanup on process
exit is less reliable. Through such systems are more rare by now I think.

~~~
ReactiveJelly
It would have to be a non-desktop system.

I'm pretty sure Linux will always free process-private memory, and threads,
and file descriptors when a process exits.

The only things that can leak in typical cases are some kinds of shared memory
and maybe child processes?

------
jkoudys
It'd be interesting to implement this on a type that would defer all of these
drop threads (or one big drop threads built off a bunch of futures) until the
end of some major action, like sending the http response on an actix-web
thread. Could be a great way to get the fastest possible response time, since
then the client has their response before any delay on cleanup.

~~~
AaronFriel
There is no such thing as a free lunch here, so it would reduce the unloaded
response time but should have no effect (or a negative impact) on a highly
loaded server's response time. I'm finding this out when benchmarking a
message passing/queue management system. Anything I do to defer work onto a
separate threadpool improves latency up to a point, then reduces throughput.

~~~
jkoudys
If you're bottlenecked, then certainly. There's no free lunch, but for us,
problems that can be solved by simply scaling up the resources on the host as
relatively cheap as free vs expensive developer time. When we're purely
focused on sales and not anywhere close to hitting a full mem/cpu bottleneck,
this would bee good.

This situation you describe sounds a lot like dealing with garbage-collection
cycles, so you give a good recommendation on something to watch out for, as
rust performing at the level of a GC'd language removes a big reason for
choosing rust.

------
chubot
Looks like Evan Wallace ran into the same issue in practice in esbuild

[https://news.ycombinator.com/item?id=22336284](https://news.ycombinator.com/item?id=22336284)

 _I actually originally wrote esbuild in Rust and Go, and Go was the clear
winner._

 _The parser written in Go was both faster to compile and faster to execute
than the parser in Rust. The Go version compiled something like 100x faster
than Rust and ran at something around 10% faster (I forget the exact numbers,
sorry). Based on a profile, it looked like the Go version was faster because
GC happened on another thread while Rust had to run destructors on the same
thread._

ESBuild is a really impressive performance-oriented project:

[https://github.com/evanw/esbuild](https://github.com/evanw/esbuild)

 _The Rust version also had other problems. Many places in my code had switch
statements that branched over all AST nodes and in Rust that compiles to code
which uses stack space proportional to the total stack space used by all
branches instead of just the maximum stack space used by any one branch_ :
[https://github.com/rust-lang/rust/issues/34283](https://github.com/rust-
lang/rust/issues/34283)

(copy of lobste.rs comment)

------
Ididntdothis
I used to do this sometimes with C++ when I realized that clearing out a
vector with lots of objects was slow. Is Rust basically based on unique_ptr?
One problem with this approach was that you still had to wait for these
threads when the application would shut down.

~~~
saagarjha
Rust basically gives the compiler understanding of unique_ptr and prevents you
from using it after you’ve moved it.

~~~
Ididntdothis
Would you have to keep track of these threads in Rust? I have done a lot of
desktop development where you have to be aware of what happens during
shutdown. Seems a lot of server guys write their code under the assumption
that it will never shut down.

~~~
pornel
You would need to add `thread.join()` at the end of main, or have some RAII
guard that does it for you.

In practice that's probably optional, because the heap and all resources are
usually torn down with the process anyway. Important things, like saving data
or committing transactions, shouldn't be done in destructors.

------
Animats
There's a worse case in deallocation. Tracing through a data structure being
released for a long-running program can cause page faults, unused data having
been swapped out. This is part of why some programs take far too long to exit.

------
SilasX
Completely different dynamic (because no Rust GC), but this reminds me of how
Twitch made their server, written in Go, a lot faster by allocating a bunch of
dummy memory at the beginning so the garbage collector doesn't trigger nearly
as often:

[https://news.ycombinator.com/item?id=21670110](https://news.ycombinator.com/item?id=21670110)

~~~
the8472
The java equivalent to the Go case would simply be adjusting the -Xms flag.
The Go approach is a needlessly convoluted because the runtime doesn't offer
any tuning knobs.

As for the rust case, if you squint then it's similar to a concurrent
collector.

------
grogers
Contrived examples like this are ridiculous. Creating such a heavy thing is
likely even more expensive than tearing it down. So unless you create it on a
separate thread, you probably shouldn't be freeing it on a separate one. It's
not going to solve your interactivity problem. If you are creating the object
on a separate thread then it's already going to be natural to free it on a
separate one too.

~~~
ReactiveJelly
Something is better than nothing.

------
earthboundkid
Maybe some sort of “collector” could come by a clean up “garbage” memory
periodically to improve performance…

------
snicker7
I wonder if it might be possible for OS's to provide a fast, asynchronous way
of deallocating memory.

------
andreygrehov
Does anyone know how would this work in Go?

~~~
echlebek
Lots to be learned at
[https://blog.golang.org/ismmkeynote](https://blog.golang.org/ismmkeynote)

------
wmichelin
Minor typo, `froget` instead of `forget`

------
thickice
Is this applicable for Go as well ?

------
thePunisher
The obvious solution would be to borrow the HeavyThing instead of having it
dropped inside the function.

------
crimsonalucard1
I guess choosing when or how a program deallocates is important in a language
that's close to the metal.

Rust tries to be zero cost while providing abstractions that make it seemingly
a high level language but ultimately things like this show that it's not
exactly zero cost because abstractions can incur hidden penalties. There needs
to be some internal syntax that allows a rust user to explicitly control
deallocation when needed.

If I started reading code where people would randomly move a value into
another thread and essentially do nothing I would be extremely confused. Any
language that begins to rely on "trick" or "hacks" as standard patterns
exposes a design flaw.

Maybe if rust provided special syntax that a function can be decorated with so
that it does deallocation in another thread automatically? Or maybe an
internal function called drop_async...? This would make this pattern an
explicit part of the language rather than a strange hack/trick.

------
rhacker
Pass by reference?

~~~
bszupnick
If you pass by reference the heavy object won't be dropped. If your goal is to
drop a heavy object, this is a cool way to do it.

------
cperciva
If _freeing_ the data structure in question takes this long, how much time are
you wasting _duplicating_ the data structure?

~~~
saagarjha
I’m actually very curious why it takes this long; is Rust memseting the buffer
when dropping it?

Edit: it seems like turning on optimizations seems to improve the situation
quite a bit. Not sure why they were profiling the debug build.

~~~
fpgaminer
> it seems like turning on optimizations seems to improve the situation quite
> a bit.

I'm not seeing that on my local machine? Were you comparing on the Playground
which would be quite variable in its results?

    
    
        > cargo build
           Compiling foo v0.1.0 (/private/tmp/foo)
            Finished dev [unoptimized + debuginfo] target(s) in 0.42s
        > ./target/debug/foo
        drop in another thread 52.121µs
        drop in this thread 514.687233ms
        >
        >
        > cargo build --release
           Compiling foo v0.1.0 (/private/tmp/foo)
            Finished release [optimized] target(s) in 0.47s
        > ./target/release/foo
        drop in another thread 48.418µs
        drop in this thread 548.005373ms

~~~
saagarjha
I saw an increase of about 2x on my computer, though I didn't take too much
effort to control for noise.

------
andrewfromx
hmm my first thought its, having to do that is a lot like c and cleaning up my
own allocations. This feels like something rust should automatically do for
me?

~~~
ReaLNero
In C, if you forget to clean up, you have a memory leak which is hard to track
down. In Rust, if you don't do this, you're not sacrificing memory leaks, only
performance. A profiler can tell you when you should drop asynchronously.

~~~
madmax96
>A profiler can tell you when you should drop asynchronously

Is there any profiler that does this today?

What are the drawbacks with asynchronous drops?

~~~
ehsanu1
See some discussion here:
[https://www.reddit.com/r/rust/comments/gntv7l/dropping_heavy...](https://www.reddit.com/r/rust/comments/gntv7l/dropping_heavy_objects_in_another_thread_can_make/)

------
dirtydroog
Oh my good god.

I'm hoping this is down to developer naivety rather than being a feature of
rust.

~~~
sockgrant
1) he should pass by reference to avoid the extra copy. So in his example yes
it’s dev naivety

2) but somewhere somehow this object will deallocate, so his trick of putting
it to another thread would work if the deal location takes awhile. Same for
cpp if you have a massive object in a unique ptr. So it’s not a rust issue

~~~
renewiltord
Where's the extra copy? I don't see one. He's moving the struct into the
function, getting size and then dropping it.

------
cs702
In other words, Rust's automagical memory deallocation is NOT a zero-cost
abstraction:

    
    
      fn get_len1(things: HeavyThings) -> usize {
          things.len()
      }
    
      fn get_len2(things: HeavyThings) -> usize {
          let len = things.len();
          thread::spawn(move || drop(things));
          len
      }
    

The OP shows an example in which a function like get_len2 is 10000x faster
than a function like get_len1 for a hashmap with 1M keys.

See also this comment by chowells:
[https://news.ycombinator.com/item?id=23362925](https://news.ycombinator.com/item?id=23362925)

~~~
dathinab
No the zero-cost refers to the abstraction (and runtime cost), which still is
zero-cost. Deallocating is part of the normal work load not the abstraction.

Also this isn't rust specific. Most (all?) RAII languages are affected and
many GC approaches have this effect, too. Some do add _additional_ abstraction
to magically always or sometimes put the de-allocation into another thread.

But de-allocating in another thread is not generally good or bad. There are a
lot of use-cases where doing so is rather bad or can't be done (in case TLS is
involved). Rust and other similar RAII languages at least let you decide what
you want to do.

Now it's (I think) generally known that certain kinds (not all) of GC do make
some thinks simpler for GUI-like usage. Through they also tend to have less
control.

Note that it's a common pattern for small user CLI facing tools (which are not
GC'ed) to leak resources instead of cleaning them up properly. You can do so
in rust too if you want but it's a potential problem for longer running
applications.

Also here is a faster get `get_len` then both which is also more idiomatic
rust then both:

``` fn get_len1(things: &HeavyThings) -> usize { things.len() } ```

If you have a certain thread (e.g. UI thread) in which you never want to do
any cleanup work you can consider using a container like:

``` struct DropElsewhere<T: Send>(pub Option<T>); impl<T: Send> Drop for
DropElsewhere<T> { fn drop(&mut self) { if let Some(value) = self.take() {
thread::spawn(move || drop(value)); } } } ```

You can optimize this with `ManualDrop` to have close to zero-runtime overhead
(removes the `take` and `if let` part).

~~~
cs702
> No the zero-cost refers to the abstraction (and runtime cost), which still
> is zero-cost. Deallocating is part of the normal work load not the
> abstraction.

Yeah, you're right. In hindsight my comment was poorly thought-out and poorly
written.

------
floppy123
Why should i ever need to drop a heavy object for only getting a size? Not in
C++ and also not in Rust, the diffent thread idea is just creativ stupidity,
sorry

------
staticfloat
It seems that this would be a great reason to not pass the entire heavy object
through your function, and to instead pass it as a reference. When passing an
object (rather than a reference to an object) there's a lot more work going on
both in function setup, and in object dropping. I'm not a rust guru, so I
don't know the precise wording, but it's simple enough to realize that if this
function, as claimed, must drop all the sub-objects within the `HeavyObject`
type, then those objects must have been copied from the original object.

If you instead define the function to take in a reference (by adding just two
`&` characters into your program), the single-threaded case is now almost 100x
faster than the multithreaded case.

Here's a link to a Rust Playground with just those two characters changed:
[https://play.rust-
lang.org/?version=stable&mode=debug&editio...](https://play.rust-
lang.org/?version=stable&mode=debug&edition=2018&gist=9bbe169c77cd31b6cd25b74c24635e12)

Note that the code that drops the data in a separate thread is not timing the
amount of time your CPU is spinning, dropping the data. So while this does
decrease the latency of the original thread, the best solution is to avoid
copying and then freeing large, complex objects as much as possible. While it
is of course necessary to do this sometimes, this particular example is just
not one of them. :)

As an aside, I'm somewhat surprised that the Rust compiler isn't inlining and
eliminating all the copying and dropping; this would seem to be a classic case
where compiler analysis should be able to determine that `a.size()` should be
computable without copying `a`, and it should be able to eliminate the
function call cost as well. Manually doing this gives the exact same timing as
my gist above, so I assume that this is happening when passing a reference,
but not happening when passing the object itself.

~~~
heftig
As already mentioned, Rust wasn't copying anything; the `HashMap` is not a
`Copy`-able type, so it was just moved around (it's also not very large: all
its items are behind a pointer to the heap).

All you did was move the drop from the `fn_that_drops_heavy_things` to the end
of `main`, where it is outside the timing function.

