
Facebook's std::vector optimization - iamsalman
https://github.com/facebook/folly/blob/master/folly/docs/FBVector.md
======
jlebar
If you're interested in these sorts of micro-optimizations, you may find
Mozilla's nsTArray (essentially std::vector) interesting.

One of its unusual design decisions is that the array's length and capacity is
stored next to the array elements themselves. This means that nsTArray stores
just one pointer, which makes for more compact DOM objects and so on.

To make this work requires some cooperation with Firefox's allocator
(jemalloc, the same one that FB uses, although afaik FB uses a newer version).
In particular, it would be a bummer if nsTArray decided to allocate space for
e.g. 4kb worth of elements and then tacked on a header of size 8 bytes,
because then we'd end up allocating 8kb from the OS (two pages) and wasting
most of that second page. So nsTArray works with the allocator to figure out
the right number of elements to allocate without wasting too much space.

We don't want to allocate a new header for zero-length arrays. The natural
thing to do would be to set nsTArray's pointer to NULL when it's empty, but
then you'd have to incur a branch on every access to the array's
size/capacity.

So instead, empty nsTArrays are pointers to a globally-shared "empty header"
that describes an array with capacity and length 0.

Mozilla also has a class with some inline storage, like folly's fixed_array.
What's interesting about Mozilla's version, called nsAutoTArray, is that it
shares a structure with nsTArray, so you can cast it to a const nsTArray*.
This lets you write a function which will take an const nsTArray& or const
nsAutoTArray& without templates.

Anyway, I won't pretend that the code is pretty, but there's a bunch of good
stuff in there if you're willing to dig.

[http://mxr.mozilla.org/mozilla-
central/source/xpcom/glue/nsT...](http://mxr.mozilla.org/mozilla-
central/source/xpcom/glue/nsTArray.h)

~~~
nly
> One of its unusual design decisions is that the array's length and capacity
> is stored next to the array elements itself.

GNU stdlibc++ does this for std::string so you get prettier output in the
debugger. The object itself only contains a char*.

~~~
byuu
Seems like that would prevent small string optimization (a union of a small
char array and the heap char pointer.) That lets me store about 85% of the
strings in my assembler without any heap allocation, and is a huge win in my
book.

~~~
nly
Indeed, they don't do SSO. Reference counted COW, which violates the standard.

------
userbinator
_When the request for growth comes about, the vector (assuming no in-place
resizing, see the appropriate section in this document) will allocate a chunk
next to its current chunk_

This is assuming a "next-fit" allocator, which is not always the case. I think
this is why the expansion factor of 2 was chosen - because it's an integer,
and doesn't assume any behaviour of the underlying allocator.

I'm mostly a C/Asm programmer, and dynamic allocation is one of the things
that I very much avoid if I don't have to - I prefer constant-space
algorithms. If it means a scan of the data first to find out the right size
before allocating, then I'll do that - modern CPUs are _very_ fast "going in a
straight line", and realloc costs add up quickly.

Another thing that I've done, which I'm not entirely sure would be possible in
"pure C++", is to adjust the pointers pointing to the object if reallocation
moves it (basically, add the difference between the old and new pointers to
each reference to the object); in theory I believe this involves UB - so it
might not be "100% standard C" either, but in practice, this works quite well.

~~~
kabdib
I solved an "realloc is really costly" problem by ditching the memory-is-
contiguous notion, paying a little more (really just a few cycles) for each
access rather than spending tons of time shuffling stuff around in memory.
This eliminated nearly all reallocations. The extra bit of computation was
invisible in the face of cache misses.

I'm guessing that most customers of std::vector don't really need contiguous
memory, they just need something that has fast linear access time. In this
sense, std::vector is a poor design.

~~~
colomon
No, that's quite intentional. If you don't need contiguous memory, you can
consider using std::deque.
[http://www.cplusplus.com/reference/deque/deque/](http://www.cplusplus.com/reference/deque/deque/)

~~~
kabdib
There are weasel words in the standard that let implementations still be
pretty inefficient. The problem is that memory reallocation is a leaky
abstraction, and an interface that doesn't make guarantees about memory
behavior can't be relied upon at scale.

The implementation of std::deque I just read uses a circular queue, resized to
a power of two upon growth. It would exhibit the same bad performance as
std::vector.

~~~
yoklov
Are you sure? You're not allowed to invalidate iterators or pointers to
elements in a deque, so it shouldn't be reallocating memory (aside from its
underlying map of nodes, which will need to grow very rarely).

Libstdc++ describes how it's implemented here: [https://github.com/gcc-
mirror/gcc/blob/master/libstdc%2B%2B-...](https://github.com/gcc-
mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/stl_deque.h#L650-L733)

Libc++ doesn't have as descriptive a comment but it's impemented basically the
same way here [https://github.com/llvm-
mirror/libcxx/blob/master/include/de...](https://github.com/llvm-
mirror/libcxx/blob/master/include/deque)

~~~
nly
You might not want to use deque because of how much memory it can use while
still small, e.g. libc++s implementation uses a 4KiB page:

[http://rextester.com/VIB96468](http://rextester.com/VIB96468)

In the GNU stdlibc++ implementation the object itself is pretty large (and
they use a page size of 512 bytes):

[http://rextester.com/RHYKB83240](http://rextester.com/RHYKB83240)

------
CJefferson
I'm glad to see this catch on and the C level primitives get greater use.

This has been a well known problem in the C++ community for years, in
particular Howard Hinnant put a lot of work into this problem. I believe the
fundamental problem has always been that C++ implementations always use the
underlying C implementations for malloc and friends, and the C standards
committee could not be pursaded to add the necessary primitives.

A few years ago I tried to get a reallic which did not move (instead returned
fail) into glibc and jealloc and failed. Glad to see someone else has
succeeded.

------
shin_lao
I think the Folly small vector library is much more interesting and can yield
better performance (if you hit the sweet spot).

[https://github.com/facebook/folly/blob/master/folly/docs/sma...](https://github.com/facebook/folly/blob/master/folly/docs/small_vector.md)

From what I understand, using a "rvalue-reference ready" vector implementation
with a good memory allocator must work at least as good as FBVector.

------
jeorgun
Apparently the libstdc++ people aren't entirely convinced by the growth factor
claims:

[https://gcc.gnu.org/ml/libstdc++/2013-03/msg00059.html](https://gcc.gnu.org/ml/libstdc++/2013-03/msg00059.html)

------
cliff_r
The bit about special 'fast' handling of relocatable types should be obviated
by r-value references and move constructors in C++11/14, right?

I.e. if we want fast push_back() behavior, we can use a compiler that knows to
construct the element directly inside the vector's backing store rather that
creating a temporary object and copying it into the vector.

~~~
marksamman
emplace_back was added in C++11 which does just that:
[http://en.cppreference.com/w/cpp/container/vector/emplace_ba...](http://en.cppreference.com/w/cpp/container/vector/emplace_back)

------
darkpore
You can get around a lot of these issues by reserving the size needed up
front, or using a custom allocator with std::vector. Not as easy, but still
doable.

The reallocation issue isn't fixable this way however...

~~~
thrownaway2424
You can use a feedback directed optimization pass to choose the initial size.

------
ajasmin
TLDR; The author of malloc and std::vector never talked to each other. We
fixed that!

... also most types are memmovables

------
pbw
Are there benchmarks, speedup? Seems strange to leave out that information or
did I just miss it?

~~~
shadytrees
Yes!
[https://www.google.com/search?q=folly+facebook+benchmarks](https://www.google.com/search?q=folly+facebook+benchmarks)

~~~
pbw
I don't see the results. Like a graph that shows std::vector vs. folly. I mean
isn't that the entire point?

------
14113
Is it normal to notate powers using double caret notation? (i.e. ^^) I've only
ever seen it using a single caret (^), in what presumably is meant to
correspond to an ascii version of Knuth up arrow notation
([https://en.wikipedia.org/wiki/Knuth's_up-
arrow_notation](https://en.wikipedia.org/wiki/Knuth's_up-arrow_notation)). I
found it a bit strange, and confusing in the article having powers denoted
using ^^, and had to go back to make sure I wasn't missing anything.

~~~
Sidnicious
It's likely to distinguish it from XOR, which is the carat operator in many
programming languages.

~~~
kevin_thibedeau
It is still weird. * * is the other established convention used for
exponentiation in languages like Python, Ada, m4, and others.

~~~
merraksh
But * * has an entirely different meaning in C/C++.

~~~
taejo
Only as a prefix operator; * has a pointer-related meaning for prefix and an
arithmetic meaning for infix, there's no reason * * shouldn't be the same.

~~~
phs2501
That'd be a syntactic ambiguity (in C-type languages) between "a * * b" being
a to the power of b (i.e. "(a) * * (b)") or "a * * b" being a times the
dereferencing of the pointer b (i.e. "(a) * (*b)"). You may be able to
disambiguate this by prescedence, though it would be very ugly in the lexer
(you could never have a STARSTAR token, it would have to be handled in the
grammar) and would be terribly confusing.

------
xroche
Yep, this is my biggest issue with C++: you now have lambdas functions and an
insane template spec, but you just can not "realloc" a new[] array. Guys,
seriously ?

~~~
bnegreve
If you need to realloc a fixed size array, souldn't you use a std::vector
instead?

~~~
marksamman
You probably should, but the problem is still there because std::vector
implementations don't use realloc. They call new[] with the new size, copy
over the data and delete[] the old chunk. This eliminates the possibility to
grow the vector in-place.

~~~
bnegreve
It's the same with realloc: there is no guarantee that it will grow the chunk
in place.

~~~
xroche
No. Modern realloc are efficient, when moving large memory blocks, because
they rely on the kernel ability to quickly relocate memory regions without
involving memcpy() (through mremap() on Linux).

Edit: shamelessly citing my blog entry on this subject:
[http://blog.httrack.com/blog/2014/04/05/a-story-of-
realloc-a...](http://blog.httrack.com/blog/2014/04/05/a-story-of-realloc-and-
laziness/)

~~~
davidtgoldblatt
This isn't true for either of the common high performance mallocs. tcmalloc
doesn't support mremap at all, and jemalloc disables it in its recommended
build configuration. glibc will remap, but any performance gains that get
realized from realloc tricks would probably be small compared to the benefits
of switching to a better malloc implementation.

------
general_failure
In Qt, you can mark types as Q_MOVABLE_TYPE and get optimizations from a lot
of containers

------
thomasahle
The factor-2 discussion is quite interesting. What if we could make the next
allocated element always fit exactly in the space left over by the old
elements?

Solving the equations suggest a fibonacci like sequence, seeded by something
like "2, 3, 4, 5". Continuing 9, 14, 23 etc.

~~~
judk
he golden ratio and then rounding down. What's the point of putting 4 into the
seed sequence?

~~~
thomasahle
Without _4_ , the sequence would be _2, 3, 5_. Then the next value would be
_9_ by fibonacci. But that's bigger than the _5_ we get from adding up all the
unallocated pieces ( _2_ and _3_ ).

We could use just _2, 3, 4_ as a seed, but we can't use the fibonacci formula
before the fourth element is added. Try some different seeds for yourself,
it's trickier than you'd think.

------
malkia
For big vectors, if there is obvious way, I always hint vector with reserve()
- for example knowing in advance how much would be copied, even if a bit less
gets copied (or even if a bit more, at the cost of reallocation :().

------
ck2
_Then the teleporting chief would have to shoot the original_

As an aside, there was a great Star Trek novel where there was a long range
transporter invented that accidentally cloned people.

(I think it was "Spock Must Die")

~~~
dalke
There's also the ST:TNG episode "Second Chances".
[http://en.wikipedia.org/wiki/Second_Chances_%28Star_Trek:_Th...](http://en.wikipedia.org/wiki/Second_Chances_%28Star_Trek:_The_Next_Generation%29)
where Riker is duplicated. (There's also the good/bad Kirk in 'The Enemy
Within', and the whole mirror universe concept, but those aren't duplicates.)

------
jheriko
i used to be a big fan of this sort of stuff, but the better solution for many
of the problems described is to avoid array resizing.

if std::vector is your bottleneck you have bigger problems i suspect.

reminds me a bit of eastl as well... which is much more comprehensive:
[http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2007/n227...](http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2007/n2271.html)

------
johnwbyrd
If you're spending a lot of time changing the size of a std::vector array,
then maybe std::vector isn't the right type of structure to begin with...

------
judk
How is it reasonable to expect that previously freed memory would be available
later for the vector to move to?

------
chickenandrice
Greetings Facebook, several decades ago welcomes you. Game programmers figured
out the same and arguably better ways of doing this since each version of
std::vector has been released. This is but a small reason most of us had in-
house stl libraries for decades now.

Most of the time if performance and allocation is so critical, you're better
off not using a vector anyway. A fixed sized array is much more cache
friendly, makes pooling quite easy, and eliminates other performance costs
that suffer from std::vector's implementation.

More to the point, who would use a c++ library from Facebook? Hopefully don't
need to explain the reasons here.

~~~
dbaupp
_> More to the point, who would use a c++ library from Facebook? Hopefully
don't need to explain the reasons here._

Could you explain them for those of us not in the loop? Does Facebook have a
bad reputation for C++?

~~~
DonPellegrino
I would also like expanations, because Facebook actually has a good reputation
when it comes to their compiled languages engineers. Their C++ and D engineers
built HHVM/Hack, their OCaml engineers built some great analysis tools and
much of the supporting code for the HHVM/Hack platform, etc., the list goes
on, so I'd like to know why someone would want to avoid their C++ library
based on the "Facebook" name only.

~~~
chickenandrice
Because Facebook also has a reputation of not playing nice with people, the
rules, intellectual property, and so on. This is hardly a company anyone
should support or trust and if you can't figure that out, I can't help you.

As far as their work on HHVM, it was necessary due to failure by bad
technology choices from the start. There's very little interesting about this
work unless you somehow love PHP, want to make debugging your production
applications more difficult, and refuse to address your real problems. I am
100% sure no one outside of the PHP community cares about anything Facebook
has done in C++.

Simply having a large company with lots of developers who might have even had
good reputations elsewhere or even be smart doesn't mean much. Having worked
in many places with lots of smart developers, I can tell you stories about too
many geniuses in the room. Calling Facebook developers engineers is also about
as apt as calling janitors sanitation engineers. We're programmers, or
developers, or perhaps software architects at best depending on the position.
I happen to have an EE and CS degree but given I do programming for a living,
I'd hardly call myself an engineer. But we're way off topic :)

~~~
otterley
> As far as their work on HHVM, it was necessary due to failure by bad
> technology choices from the start.

HHVM arose out of Facebook's desire to save on server purchasing and operating
costs. Facebook could run perfectly well without it on Plain Old PHP, but
they'd have to buy and power more servers.

I'd hardly call PHP a "bad technology choice" given the outstanding financial
success of many companies that use it.

------
boomshoobop
Isn't Facebook itself an STD vector?

~~~
general_failure
Funny :)

------
johnwbyrd
Show me a programmer who is trying to reoptimize the STL, and I'll show you a
programmer who is about to be laid off.

The guy who tried this at EA didn't last long there.

~~~
xroche
The STL is not optimized at all in this case, this is precisely the point. And
like it or not, but Facebook has talented engineers to do that.

~~~
richardwhiuk
The STL isn't a library that can be optimized - it's a interface definition
with expected complexity requirements. By it's nature (i.e. not tied to a
platform) it doesn't have specific benchmark numbers. Specific implementations
(e.g. MSVCRT, the GC++ implementation, the clang implementation) can be, and
are.

------
kenperkins
> ... Rocket surgeon

That's a new one. Usually it's rocket scientist or brain surgeon. What exactly
does a rocket surgeon do? :)

~~~
krallja
[http://www.sensible.com/rsme.html](http://www.sensible.com/rsme.html)

[http://tvtropes.org/pmwiki/pmwiki.php/Main/ThisAintRocketSur...](http://tvtropes.org/pmwiki/pmwiki.php/Main/ThisAintRocketSurgery)

[http://www.urbandictionary.com/define.php?term=rocket%20surg...](http://www.urbandictionary.com/define.php?term=rocket%20surgery)

