
PHP 7's new hashtable implementation - xrstf
http://nikic.github.io/2014/12/22/PHPs-new-hashtable-implementation.html
======
skrebbel
This is probably not a popular opinion, but I believe that PHP's associative
array is one of the best-designed data structures in programming languages.

Its main distinguishing property, as mentioned in this article, is that values
can be indexed by key, but are still iterated in the order they were set. This
is "do what I want" in so many cases that it's just nuts.

Sure, just as often it's just needless overhead, but as a programmer who
prefers to reason about domain and not performance, I often don't care about
that. I hate that many other languages, including C#, Ruby and Python, force
me to choose between either an unordered map or a list of _(key, value)_
tuples. EDIT: clearly, i'm behind the times with that remark. thanks
commenters :-)

I wish more languages had a native data type like this. It scares me that _in
practice_ JS objects have the same property, but officially the iteration
order is not specified.

(that said, PHP's choice to mix regular arrays and associative arrays into a
single type strikes me as a bit odd. i've also never seen a good use case of
arrays with mixed string/int keys)

~~~
halflings
> I hate that many other languages, including C#, Ruby and Python, force me to
> choose between either an unordered map or a list of (key, value) tuples

Another commenter mentioned Python's OrderedDict, and Ruby's hashtables are
ordered (from 1.9+): [https://www.igvita.com/2009/02/04/ruby-19-internals-
ordered-...](https://www.igvita.com/2009/02/04/ruby-19-internals-ordered-
hash/) It looks like C# also has an OrderedDictionary class:
[http://msdn.microsoft.com/en-
us/library/system.collections.s...](http://msdn.microsoft.com/en-
us/library/system.collections.specialized.ordereddictionary.aspx)

I never had to use any of these classes even if I do use hashmaps _all the
time_ , so IMHO the Python way is best (default to a 'normal' unordered
hashmap, offer an ordered alternative)

~~~
breuleux
I think there is one reason to default to something ordered, which is that
iterating through unordered hashmaps can be non-deterministic from one run to
the other. For instance, if you hash by id, the same objects may be assigned
different ids from a run to the next, meaning that the map will be iterated in
a different order. If there is a bug that's sensitive to iteration order, it
will not be reproducible and that can be very annoying to debug.

~~~
aidos
If you've got bugs due to iteration order you have a whole other issue. If you
know that iteration order is important, it should be _explicit_ in the way you
prepare/store/handle the data. If your code only works when you iterate in a
specific order _and you didn 't deliberately define that order_, it's only
really working by accident.

Don't get me wrong, it's actually a bug I've run into in the past, and you're
right, the non-deterministic issue makes it harder to debug. But when I
discovered it I blamed myself for not being explicit with my constraints.

------
adunn
This is great news. PHP doesn't have many structured data types, so arrays
(aka maps) are basically used for everything. Any improvement to them will
impact the entire application.

It would be nice to have separate types for arrays and maps though. I don't
understand why they were combined to begin with. Simplicity? Seems like there
are more edge cases and gotchas the way things are now.

~~~
_RPM
> It would be nice to have separate types for arrays and hash tables though. I
> don't understand why they were combined to begin with. Simplicity? Seems
> like there are more edge cases and gotchas the way things are now.

There is no "hash table" type in PHP user land.

~~~
imaginenore
There are no arrays in PHP.

There are only hash tables that are called "array" for simplicity.

~~~
colordrops
How do they maintain their order?

~~~
_RPM
By using a doubly linked list, where the first element in the "bucket",
contains the next pointer to the next element in the hash table. Read
zend_hash.h & zend_hash.c It is fairly complicated and explaining it in depth
is beyond the scope of this comment.

This "bucket" also handles the collisions by using separate chaining. There is
actually two "next" pointers, one for the chains, and one for the next element
in order of insertion. Very confusing and requires reading through the code
and playing with it.

------
allendoerfer
It is both nice and concerning, that an ubiquitous element of a ubiquitous
language has that much potential for performance optimizations after 19 years
of development. The optimizations were not even complicated hacks for edge
cases, just a simpler implementation overall.

But then again, PHP itself being stateless between requests is quite fast
already, nice to see even more performance getting squeezed out. Imagine the
decrease in global energy consumption due to this change. :D

~~~
username223
> a ubiquitous language has that much potential for performance optimizations
> after 19 years of development.

"Code in haste, repent at leisure." PHP was designed with some unholy amalgam
of an array and a hash table as its only data structure, so there's plenty of
room for repentance.

It makes me feel old to remember learning about (singly-) linked lists,
arrays, and hash tables, plus some other nice things, in a freshman course
called "Introduction to Algorithms and Data Structures." Each had its
advantages and disadvantages, and I quickly learned which to choose in which
situation. Does this course still exist, or is the modern equivalent
"Algorithms and HashArrayLists?"

~~~
allendoerfer
You are right. That PHP actually runs is a small wonder. They actually rewrote
the engine every few years.

When i was in college and visited said lecture, I was quite surprised, what an
array actually is. But on the other hand, if it had not been for PHP and its
small step from HTML, I imagine many young programmers like me would know
neither the false nor the right array.

------
halflings
Make sure you don't miss this part:

> PHP uses hashtables for all arrays. However in the rather common case of
> continuous, integer-indexed arrays (i.e. real arrays) the whole hashing
> thing doesn’t make much sense. This is why PHP 7 introduces the concept of
> “packed hashtables”.

> [...] We keep these useless values around so that buckets always have the
> same structure, independently of whether or not packing is used. This means
> that iteration can always use the same code. However we might switch to a
> “fully packed” structure in the future, where a pure zval array is used if
> possible.

It's nice that they're starting to consider the fact that "real" arrays are
unnecessarily mixed with hashtables, which comes with a pretty significant
overhead. Let's hope they'll soon add that different separate type for arrays
(or "fully packed hashtables" if they prefer :)).

~~~
scotty79
> Let's hope they'll soon add that different separate type for arrays (or
> "fully packed hashtables" if they prefer :)).

Could be good as long as it can be inferred whether compiler should use it or
not, without additional clutter in the code.

------
mmaunder
Awesome, but I still don't think it's enough. In benchmarks we did the memory
usage of PHP array() was horrific. Sorry I don't have actual numbers to post,
but we ended up using pack() and unpack() to store stuff that should have been
in an array because it would grow to 100's of megs using PHP's array() and
using a binary structure it stays under 10 megs. I just don't think a 2.5X
improvement is going to come close to as efficient as it could and should be.

~~~
porker
> we ended up using pack() and unpack() to store stuff that should have been
> in an array because it would grow to 100's of megs using PHP's array() and
> using a binary structure it stays under 10 megs.

How many items were you storing / how big was the data in each item?

~~~
porker
To whoever downvoted this comment, please say why.

------
gopalv
This is neat. Looking through it, looks like it makes regular numeric arrays
faster as well via the flags.

I wonder if the ->pDataPtr vs ->pData confusion has been resolved.

I'm probably a few years behind, but a lot of my confusion working with hashes
has been that pair of void* pointers.

~~~
nikic
Yes, pData and pDataPtr are no more. They have always been pretty pointless -
both could have been dropped even retaining the rest of the previous
implementation by using a struct hack layout. In the new implementation they
aren't needed because it's specialized to zvals (our 99%-or-so use case).

~~~
gopalv
Nice work. I read deeper into the changeset and I like.

The ->pDataPtr was the one thing behind 90% of the bugs I caused with exts
(obviously stuff like the frozen_array hashtable handling was a completely
odd-ball case).

I read deeper and found that you also fixed the "void * *" in
zend_hash_find(), which is another pain point in the old API - you cannot rely
on the compiler type-checking at all.

I no longer work with PHP, but avenge me for the hair I've lost over the
IS_REF madness (copy_ctor vs separate_zval) :)

------
bcheung
You mean they are not using strlen as the hash function anymore?
[http://news.php.net/php.internals/70691](http://news.php.net/php.internals/70691)

~~~
overgard
Wow, I constantly think that nothing about the horribleness of PHP can
surprise me, and then Rasmus says something even more insane than his already
insane statements. I mean choosing function names based on length because you
didn't bother to write an actual hash function? AMAZING.

Anyway I'm just going to leave this here because it's great fun:
[http://en.wikiquote.org/wiki/Rasmus_Lerdorf](http://en.wikiquote.org/wiki/Rasmus_Lerdorf)

Some favorites:

"There are people who actually like programming. I don't understand why they
like programming."

"I'm not a real programmer. I throw together things until it works then I move
on. The real programmers will say "Yeah it works but you're leaking memory
everywhere. Perhaps we should fix that." I’ll just restart Apache every 10
requests."

Ladies and gentlemen, the author of pretty much the most popular web
programming language out there!

~~~
gburt
To be fair, his "get things done" attitude is a big part of why PHP is so
popular. It avoids all pedantry and nerdiness (even when it absolutely
shouldn't), erring entirely on the side of "let's just get something working."

That might not be the right attitude for all programming languages, but it is
working for PHP in at least some sense.

~~~
overgard
I will admit that's a fair point. The one virtue PHP has is that it's pretty
direct in what it does.

------
Bahamut
What happened to PHP 6?

~~~
brvs
It's out being almost as successful as Perl6 and Python3 :-)

e: Actually wasn't there a blog post posted to HN suggesting Perl skip to 7
too?

~~~
esaym
Indeed:

[http://blogs.perl.org/users/ovid/2013/02/perl-7.html](http://blogs.perl.org/users/ovid/2013/02/perl-7.html)

[http://www.slideshare.net/andy.sh/perl7-light](http://www.slideshare.net/andy.sh/perl7-light)

[https://www.youtube.com/watch?v=E_8bjsimLsk](https://www.youtube.com/watch?v=E_8bjsimLsk)

------
rbanffy
Would anyone like to comment how these are implemented in Hack?

~~~
debacle
HHVM is using 2MB of RAM for the same code.

So this non-production code is using about twice as much RAM as HHVM
production.

------
bhouston
How does this compare to the optimized hashtable implementations in the
various JavaScript runtimes? I imagine their requirements are similar?

~~~
bzbarsky
First off, JavaScript runtimes (or at least V8 and SpiderMonkey, which are the
ones I've looked at) don't convert their arrays to hashtables unless they
really have to. If your array is not sparse and has no properties defined on
it with non-integer names, then it's an actual array of values in memory.

Past that, even if you start defining non-integer names you still store the
integer-named properties in a contiguous chunk of memory and store the named
properties separately. This is why in V8 and SpiderMonkey the property order
for an object (as seen by a for-in loop, say) doesn't match property addition
order. Instead, the integer-named properties are enumerated first (in
SpiderMonkey up to a certain limit) and then the other properties in addition
order. So yes, the requirements are similar and JS engines throw some of them
(property order preservation) under the bus to improve performance.

In SpiderMonkey, even once you have lots of properties (and are sparse and
whatnot) and have converted to a hashtable, things are not that simple. The
values are still stored in a contiguous memory buffer. There is a linked list
of property descriptors which contain things like the property name and an
index into the buffer, as well as metadata like whether the property is
readonly and whatnot. This list is shared across objects that have the same
property names added to them in the same order (though possibly with different
values). What's stored in the hashtable, which can also be shared across
objects is a mapping from property name to nodes in this linked list.

In practice, objects that have a dedicated hashtable just for that one object
instead of sharing property descriptors with multiple other objects end up
being pretty rare.

Lastly, JS engines are at least experimenting with unboxed storage for arrays.
That is, instead of having a memory region filled with JS values, which might
be of any type, detect at runtime that your array happens to only contain
integers and have a memory region filled with integers; storing a non-integer
will then cause a realloc and boxing of the data.

~~~
esailija
doubles and SMis have been unboxed in arrays (the internal array that backs
the integer key properties of an object) for a long time at least in V8.

~~~
bzbarsky
Ah, good to know. SpiderMonkey is in the process of adding that right now.

------
aruggirello
It would have been nice to see performance comparisons too - though I
understand the new codebase might not be optimized for performance yet.

~~~
arenaninja
I'm also wondering what the effect will be on performance. Can't wait for an
alpha preview!

------
krick
So, the only thing I'm actually interested is: API for it stays the same? That
is it's the same old "all in one" data structure with the same behavior for
all standard functions, with all old gotchas left in place and no new added,
right?

~~~
McGlockenshire
This is only a change to the underlying implementation. Extension authors
_may_ need to update their code, but probably not in most cases.

There are no userland changes here.

------
nly
> The hash returned from the hashing function (DJBX33A for string keys) is a
> 32-bit or 64-bit unsigned integer

I thought there was a big hooha about PHP and other dynamic languages using
ill-suited hash functions and ultimately most runtimes moved to SipHash?

~~~
McGlockenshire
Can you provide a citation for "most"?

~~~
nly
I'm pretty sure the Perl, Python and Ruby reference implementations all use
it.

------
nly
An important variable determining memory consumption is going to be the
maximum bucket load factor. Does anyone know what it is as currently
implemented?

~~~
nikic
The maximum load factor is 1. A lower load factor only makes sense if open
addressing is used.

~~~
scottlamb
> The maximum load factor is 1. A lower load factor only makes sense if open
> addressing is used.

I don't think that's quite true for their data structure. Consider a full hash
table which is repeatedly used like a queue (first element removed; another
added). (I'd bet some PHP code out there is doing this.)

"The arHash array has the same size (nTableSize) as arData and both are
actually allocated as one chunk of memory." As arData (and thus the arHash)
becomes full, the arHash IS_UNDEF optimization becomes useless. Every
insertion is O(n) because every element has to be moved up one. On the other
hand, if there were 2n slots, all 2n would have to be touched only once every
2n insertions, which means insertion requires amortized constant time.

On the other hand, that'd perhaps cause there to be n-1 IS_UNDEF values at the
beginning, so iteration could be problematic. They could do various things to
avoid long runs of IS_UNDEF, but given that they could occur anywhere in the
hash (not just at the beginning), I think the best might be to use an unrolled
linked list as well. Then they could bound the number of consecutive IS_UNDEF
values while still getting much of the benefit of fewer pointers and better
locality. They could still put all the nodes in one allocation if they were so
inclined; there would just be some extra pointers and not strictly linear
iteration.

------
ape4
I assumed they used C++ std::map<>

~~~
nly
PHP isn't written in C++. It's written in some pretty macro dense C. Even if
it were, <map> isn't a hash table.

~~~
acmd
But std::unordered_map is. I wonder what would be the actual C++ memory
consumption using that.

~~~
sharpneli
C++ unordered map does not maintain the order of the elements, which PHP does.
So it is not a fair comparison.

~~~
nly
Something approximately equivalent would be Boosts multi_index with a couple
of indexes.

    
    
        using php_array_t = multi_index_container<
            int64_t,
            indexed_by<
                random_access<>,
                hashed_unique<identity<int64_t>>
            >
        >;
    

There are so many design considerations at play though that such a comparison
would be pointless.

------
debacle
Copied from /r/php (care of [http://www.hhvm.rocks](http://www.hhvm.rocks)):

    
    
        dev@aerilon ~/dev $ php --version
        PHP 5.5.20-pl0-gentoo (cli) (built: Dec 22 2014 13:44:21)
        dev@aerilon ~/dev $ hhvm --version
        HipHop VM 3.5.0-dev (rel)
    
        dev@aerilon ~/dev $ php memusage.php
        13.97 MBs [14649088 bytes]
        dev@aerilon ~/dev $ hhvm memusage.php
        2 MBs [2097152 bytes]
    

So basically this implementation still uses 100% more RAM (hhvm is 64bit) by
default compared to the current production version of HHVM.

Great job, PHP internals team...

~~~
smsm42
The funny thing 5.5 isn't even the current stable version of PHP, which is
5.6. Let alone the whole post is about PHP 7. It's like you're not even paying
attention, just doing the whole post for the last phrase.

~~~
debacle
I'm not sure if you know how to read:

    
    
        dev@aerilon ~/dev $ hhvm memusage.php
        2 MBs [2097152 bytes]
    

That is using the _same_ benchmark that nikic is using, and it's using _half
the RAM of her PHP 7 example_.

Please try and be less apologistic and use some reading comprehension. It
makes you look more intelligent and puts less stress on other people to
accommodate your intellectual laziness.

~~~
smsm42
There's no such word as "apologistic" (really, look in the worduary, which
should be available in your local bookery).

And the result shows PHP 7 reduced memory usage by hashtables threefold since
5.5. Yes, this is pretty good. No reason to be bitter or sarcastic.

~~~
locusm
LOL My 11 & 8 yr old thought that was pretty funny too...

