As the author mentions, if you run into a use-case where you need to store 100000 integers in memory, then you should use one of the many alternative structures available. Some of them were explicitly designed to store integers in an efficient manner. Arrays weren't designed to store integers efficiently or anything else for that matter. They were designed to be fast and easy to use.
> then you should use one of the many alternative structures available. Some of them were explicitly designed to store integers in an efficient manner.
From the article:
> But if you do want to save memory you could consider using an SplFixedArray for large, static arrays. ... It basically does the same thing, but if you run it, you’ll notice that it uses “only” 5600640 bytes. That’s 56 bytes per element ...
That is only one of the alternatives and in my opinion, not a very good one. I forget the exact details, but there is an extension by the guy who wrote igbinary that is specifically designed for this use-case.
Considering PHP's scope and its limitations (lacks things like threads for example) I would say that is an extremely rare scenario. If you're storing 100.000 integers on a PHP data structure you're most likely doing it wrong.
Isn't PHP-FPM a fastcgi process manager? I don't see any references to threading in it's documentation. I also recall the php documentation saying php is unsafe with multiple threads due to a large number of libraries that are not coded to be thread-safe.
Wow. I always thought it was threaded. My mistake. It seems it launches multiple child processes. There is some form of memory sharing going on, though. Looks like it's some sort of a hybrid. Now I'm just thoroughly confused ...
How would you evoke such code? Squeezing it into an http reply? Calling it from PHP CLI?
If doing it on your webapp, don't! You're doing it wrong. Otherwise why would you need php specifically? I mean I guess you could, but what advantage does it poses comparing to other languages available in pretty much every *nix system nowadays (perl, python)?
For the record, I have a nice collection of downvotes for defending PHP on point-and-laugh-at-PHP threads here at HN. I just don't see how the use case you're presnting is realistic.
Firstly, I think the point of the demo of loading that many integers into the array, was to get a good measure of the memory usage per array item; by making the array big small one-time overhead memory usage is washed out in the average.
As to why you would want to do that in real life -- if you have written a large system with accompanying libraries, objects, or other infrastructure, then even if you pass of some sort of calculation to background or crontab tasks, you still might want to use your PHP code. I do this for some of the Drupal sites I work on - I setup drush commands that will do intensive calculations like most-related-article and etc to run behind the scenes have the results cached.
The best solution would be to improve PHP's memory usage so that the benefits may be had on page loads as well. I suspect (but have little hard evidence) that the size of PHP code and data structures impedes performance as threads and processes get pulled in and out of the CPU.
A secondary solution would be to take some of the libraries out there for doing specific operations, like scientific commputing and numerical libraries, and make them available to your code as a PHP extension. Obviously that doesn't solve as many problems as fixing PHP does.
That's the only place it's useful. Once the request comes in, I would split it off to a separate PHP CLI process. When the analysis is complete, I can transfer it to the user via web sockets.
So they would request a report. I would tell them great, it'll take a bit, keep working on something else and we'll let you know when it's ready. When it's ready, a notice shows up telling them to check it out.
> I mean I guess you could, but what advantage does it poses comparing to other languages available in pretty much every *nix system nowadays (perl, python)?
For something like this, the advantage would only be that you're sticking to the same language as the rest of your code base. It would make things simpler. So if you're using PHP, stick with that. If Perl, Python, etc. stick with that. It doesn't really matter. They can all do the work.
Yep, I already got some comments on that. You are either using a 32 bit system or a 32 bit binary (at least I think that the binaries PHP distributes for Windows are compiled for 32 bit, so even if you are on a 64 bit Windows you'll still get 32 bit numbers).
The Windows number still is 8 bytes per element larger than the number I wrote (76 per element). This might have various reasons, one could be that it was compiled with head protection :)
> A union is a means to make some value accessible as various types. For example if you do a zvalue_value->lval you’ll get the value interpreted as an integer. If you use zvalue_value->ht on the other hand the value will be interpreted as a pointer to a hashtable (aka array).
This is not valid C usage of unions. They are _only_ for use as a method to save space, not for conversion between types, despite it being a very common usage of unions.
This can cause all manner of problems when compiler optimizations such as type-based alias analysis are used.
EDIT: Turns out I'm completely wrong on this and it's fine from C99 onwards.
Actually, the upcoming C1X standard makes type punning via union legal. In C99, alias analysis works behaves exactly as you said, but using unions to convert between floats and ints is so common that it's legal in C1X, VC++, and I believe gcc.
Don't feel bad about it - the C standard can be quite subtle, and I've been known to spread lies about it as well.
Things about which I have stumbled somewhat recently:
* restrict-qualified pointer-to-const parameters do not guarantee that the pointed-to object won't be modified as restrict only applies if the pointer is actually used to access the object, which calling code can't know (ie restrict only enables optimizations in the called code and not reordering in calling code)
* functions with differently qualified, but otherwise compatible parameter types have compatible type (which is only mentioned in the last, parenthesized sentence of section 6.7.5.3)
This is not valid C usage of unions. They are _only_ for use as a method to save space, not for conversion between types
That's incorrect. The following footnote was added to section 6.5.2.3 with TC3 in 2007 to clear up this particular misconception:
"If the member used to access the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called "type punning"). This might be a trap representation."
If you have multiple items of which you're only going to store one at a time but want them in the same "structure" then you use a union. (i.e. temporally disjoint)
Has anyone run similar tests on the soon-to-be-released PHP 5.4? From what I understand, one of the changes in that release is reduced memory consumption. Andi Gutmans has said that PHP 5.4 could lower PHP's memory footprint by as much as 35%.
I actually did most of my tests with PHP 5.4 and trunk binaries, but also tested PHP 5.3 and the numbers didn't change. PHP 5.2 used 8 bytes less, because the circular GC was introduced only in PHP 5.3.
By the way, you can test that yourself too. The codepad I posted the same on has a switch for PHP 5.2, 5.3 and 5.4, so you can easily see for yourself :)
Lua is actually closer to PHP in how arrays are used in it.
"Until Lua 4.0, tables were implemented strictly as hash tables: all pairs were
explicitly stored. Lua 5.0 brought a new algorithm to optimize the use of tables
as arrays: it optimizes pairs with integer keys by not storing the keys and storing
the values in an actual array. More precisely, in Lua 5.0, tables are implemented
as hybrid data structures: they contain a hash part and an array part. Figure 2
shows a possible configuration for a table with the pairs "x" → 9.3, 1 → 100,
2 → 200, 3 → 300. Note the array part on the right: it does not store the integer
keys. This division is made only at a low implementation level; access to table
fields is transparent, even to the virtual machine. Tables automatically and dy-
namically adapt their two parts according to their contents: the array part tries
to store the values corresponding to integer keys from 1 to some limit n. Values
corresponding to non-integer keys or to integer keys outside the array range are
stored in the hash part.
When a table needs to grow, Lua recomputes the sizes for its hash part and
its array part. Either part may be empty. The computed size of the array part
is the largest n such that at least half the slots between 1 and n are in use
(to avoid wasting space with sparse arrays) and there is at least one used slot
between n/2 + 1 and n (to avoid a size n when n/2 would do). After computing
the new sizes, Lua creates the new parts and re-inserts the elements from the
old parts into the new ones. As an example, suppose that a is an empty table;
both its array part and hash part have size zero. If we execute a[1]=v, the table
needs to grow to accommodate the new key. Lua will choose n = 1 for the size
of the new array part (with a single entry 1 → v). The hash part will remain
empty."
Since PHP arrays are within a factor of two for the theoretical optimum for a dynamically typed language that unifies arrays and hashes, all these languages (add javascript to the list) should have roughly comparable behaviour unless they do some ugly special casing for hashes where all keys are integers. [edit: see the comment by InfernalH for a measurement that indicates that ruby seems to do this.] You have to look that up in the relevant documentatio and/or source code. Perl on the other hand should fare better, as it distinguishes arrays and hahes.
If you want performance, use a real programming language.
JavaScript engines have many types of arrays. They do indeed resort to "ugly special casing" for objects in which all keys are integers: for instance, SpiderMonkey calls them "dense arrays". V8 has a similar optimization. Likewise, Lua stores a hashtable part and an array part for all objects.
V8 and SpiderMonkey have optimizations for the values, too. When a value is a 31-bit integer (in the case of V8) or any number (in the case of SpiderMonkey), the value itself is optimized to avoid heap allocation. In V8's case optimized values take 32 bits, while in SpiderMonkey's case optimized values take 64 bits. So an array consisting of 100,000 integers will take 400K plus a negligible amount of malloc/GC slop in V8 and 800K in SpiderMonkey.
JavaScript also has typed arrays, which allow the programmer to use fine-grained, optimized array buffers. Performing this experiment with a typed array will yield a 400K buffer in all engines that support the feature.
Python and Ruby both have separate arrays and hashes. They're completely different data structures in both cases.
Python's dict type is a hash table like you'd expect. Python's list is a pointer-array-backed list. (it may inline ints/similar things--I don't remember if CPython does, and exact details are implementation-dependent), and raw arrays are in the standard library if you need them.
From a very quick check, a list of 100k ints in CPython is ~1.5mb, and a dict of 100k ints -> other ints is ~6mb.
Ruby's hash and array implementations are similar, I think, although I don't know Ruby as well, so I don't know the specifics.
Good info about Python, but you are incorrect about Ruby. Hashes and arrays are totally different in the Ruby implementations I know of.
EDIT: Sorry, I misread you. Ruby MRI and CPython are indeed similar.
At least in Ruby MRI (the mainline) arrays are implemented as a struct with a size and a pointer to a normal C array which contains the object references (references in MRI are pointers to object structs which use the lower bits to inline integers of <= 31 or 63 bits, true, false, nil and symbols).
Hashes in MRI I have not looked that much into but I believe they are a hash tables which in ruby 1.9 retain insertion order using pointers like a singly linked list.
Ugly special casing requires code that would detect those special cases. This adds overhead in terms of processing cycles and code. Most arrays, majority of the time, store a tiny amount of information. Less than a dozen entries. Doing that sort of detection is massive overkill and would actually hurt performance. You want to speed things up for the real world. Not artificial benchmarks. How many times in your code are you really going to need 100,000 integers in memory? Probably once in your entire lifetime, if that.
I use much larger integer arrays all the time but that's beside the point. The question is whether or not it makes sense to make dynamic language runtimes as efficient as possible. I think it does, not because it necessarily helps all or even most existing programs, but because it enables people to create a much wider range of software with a language they already know. Many modern websites wouldn't be possible with JavaScript engines from 1999.
And if you optimize at all, memory usage must be a top priority. Lack of memory is a lot more costly in terms of performance than a couple of C based heuristic checks. Creating and then garbage collecting an array full of individual integer objects doesn't just use a lot more memory, it's also orders of magnitude slower. The array doesn't have to be very large for that to matter as there may be large numbers of arrays.
Yeah, I too am pretty sure that CPython does not inline integers in the pointers. And from a quick glance at the source code I saw nothing such.
Inlining integers in pointers by shifting up and adding 1 is a quite common trick though and I have seen it in more programming language implementations than MRI. I think at least some Prolog implementation and older versions of Spidermonkey (newer versions use a similar trick with doubles).
Inlining integers in pointers by shifting up and adding 1 is a quite common trick though and I have seen it in more programming language implementations than MRI
This trick was already used in Smalltalk-80, btw. A more recent variant of this is NaN tagging, made popular by LuaJIT.
> Inlining integers in pointers by shifting up and adding 1 is a quite common trick though and I have seen it in more programming language implementations than MRI.
Yeah, I know about it, I just did not think MRI had bothered with it anymore than CPython.
Python has lists and dicts, and lists don't get the per-item overhead of GC-ed keys. PyPy does better than CPython and implements specialised, low-overhead storage for some dicts, lists and sets of uniform type (in 1.6, nightly builds, and a branch respectively).
What does memory_get_usage() actually do ? Does it report the "heap" size assigned to the process, or does it use PHP internal counters for the allocated user data/variables ? A C malloc subsystem will assign a whole lot of virtual memory, in steps of pages, or more if it decides to attach a piece using mmap().
In order to make this test case relevant, I'd say one have to know what memory_get_usage() does - it's at least meaningless to determine the overhead of an array based on it, if for whatever reason creating the 1. PHP array in a program also initializes "big" memory pools that count towards the memory usage.
This is easy to check in the source. memory_get_usage() calls zend_memory_usage(), which accesses the size field on a global structure mm_heap, which is updated by PHP's memory allocation system (e.g. if you call *_zend_mm_alloc_int)
I was debugging a php memory issue yesterday, and noticed that at some point my get_memeory_usage() value became much, much smaller than the memory footprint recorded by top (20 MB in get_memory_usage(), 500 MB in top RES). It was a sudden jump during a loop execution according to top. Does top give accurate memory estimates for php scripts?
php is not unique in having big memory footprint. AFAIC Python and Ruby are also memory hogs. An interesting question is how memory efficient a dynamic PL can be. Given that in modern computers memory access (cache misses) is fairly expensive it probably makes sense to trade instructions for memory.
It's only always a full-blown object in CPython. Smarter implementations, notably PyPy, will do escape analysis, and never actually end up allocating those objects.
(My point is that not having unboxed types does not imply being a memory hog. You just need a smarter implementation.)
And once you relalize that they are really hashes you also realize that a fast and memory efficient representation will use at least nine machine words or 72 bytes per value: 1 + x for the pointer in the bucket array (you don't want 100% occupancy to avoid hash collisions), 2 for the cons cell of the list stored in the bucket (the second element of which will usually be NULL if you have a good hash function and low occupancy), 2 for the key-value pair pointed to by the first element of the cons cell representing the index and the value, and two each for the integer key and value plus their tag words). And you may also want to store some metadata (like the length of the list stored in each bucket) with the hash, so the PHP array in the example is within a factor of two from the naive optimum for a dynamically typed language without specialized arrays.
Its just the price you have to pay for not caring about the type of your "array" keys.
That or specialize. Lua, Javascript, and PyPy all have more efficient storage of integer-keyed hashes. Lua's solution is particularly simple and could easily be transferred to PHP.
A PHP array is in fact just an ordered hash-map (IE.. hash map with an associated linked list which tracks the iteration order).
When you do an array append, the logic PHP runs is it tries to guess what the most logical key would be, add that to the hash-map and then to the end of the linked list.
$a = array();
$a[] = 'A';
$a[] = 'B';
print_r($a);
Array
(
[0] => A
[1] => B
)
$b = array();
$b[5] = 'A';
$b[100] = 'B';
$b[] = 'C';
print_r($b);
Array
(
[5] => A
[100] => B
[101] => C
)
Well you are explicitly sorting the array by keys, so its not like you are contradicting his sentence. Remove the krsort() call and your "not really" becomes "example".
I think its important to note your first example actually isn't valid. Part of PHP's type-mixing means strings that consist of an integer representation as array keys are automatically casted to int.
It is the same for Javascript: even though you have an Array object that you can instantiate instead of the more generic Object, they are much the same thing under the hood with both arrays and objects being hash maps [you can use the two interchangably to an extent (if you don't need the extra functinos defined by the array prototype) - references object properies in an hash-like manner or array contents the same way as object properties].
I suspect a number of languages have simiar memory (in)efficiency with their array types because the arrays are implemented this way (and the stuff stored in each slot is untyped so you'll not just store that integer, at very least the engine will need a marker that identifies it as an integer rather than something else).
The high memory use is one of the prices you pay for the type flexibility.
It is the same for Javascript: even though you have an Array object that you can instantiate instead of the more generic Object, they are much the same thing under the hood with both arrays and objects being hash maps
While this is true if you only consider language semantics, implementions do use actual arrays to store dense properties with integer names, ie arrays are backed by both flat and sparse storage with some heuristic algorithm to determine when to use what.
It's funny. I don't have much experience with php, and I had to actually run your last example to know that it appended the values (instead of recreating the array with a single value each time).
Is there an append operator, or something more explicit ? '+=' seems to do something strange..
Not sure exactly what you were trying. Using the [] syntax is common for appending elements (using the next available key). Otherwise you can use the array functions (array_push/pop/etc). As for plain operators, here is how PHP handles array operators: http://www.php.net/manual/en/language.operators.array.php
[] IS the append operator. And += only works right if both sides are arrays. If not the right side will be converted. += is really intended for hashmaps, not integer indexes, for those use [].
+= does union-by-key: any keys present in the right array, and not in the left, are appended to the left.
`array_merge` will append all numeric keys from the right array to the left, under new key values, as if you used [] one-by-one; for string keys, all keys from the right are copied into the left, possibly overwriting what was there.
In real code, I either have all-numeric or all-string keys, and `array_merge` does what I want in bulk operations: it's essentially equal to Python list.extend and dict.update, respectively. `+` on arrays is basically useless for me.
I'm kinda hijacking this thread because I always wonder how to handle vars in PHP. Should we use short named vars like $a, $b? Should we avoid always using the same var and changing its type? $a = 30; $a = "thing";
Someone downmodded you for being offtopic, but I'll answer you anyway.
The name of the var doesn't matter at all. And you can change the type of a var at will. There is nothing at all in PHP that will be better if you avoid changing the type, so just do what is clearest for your program.