Hacker Newsnew | comments | show | ask | jobs | submit login

Even a simple PHP script that needs to work with a large data set can quickly run into problems if you don't pay attention to algorithm complexity.

Consider this simple script that first creates a dataset of strings "x0", "x2", "x4", "x6" etc. (500000 entries)

The task is to count from 0 to a 1000 to see whether the string "x"+NUMBER is present. Using a linear search with array_search() is super slow, but if you switch the storage from array values to array keys, and use a hash lookup via array_key_exists(), suddenly it flies:

% time php -dmemory_limit=512M -r '$entries = array(); for($i=0;$i<1000000;$i+=2){ $entries[] = "x".$i; }; $hits = 0; for ($i=0;$i<1000;$i++) { if(array_search("x".$i,$entries)!==false) $hits++; } var_dump($hits);' int(500) php -dmemory_limit=512M -r 11.25s user 0.09s system 99% cpu 11.362 total

So, 11.362 seconds for a linear algorithm.

% time php -dmemory_limit=512M -r '$entries = array(); for($i=0;$i<1000000;$i+=2){ $entries["x".$i]=true; }; $hits = 0; for ($i=0;$i<1000;$i++) { if(array_key_exists("x".$i,$entries)) $hits++; } var_dump($hits);' int(500) php -dmemory_limit=512M -r 0.35s user 0.06s system 99% cpu 0.419 total

And 0.419 seconds for the hash one.

If you increase the workload to counting from 0 to 10000, it gets even worse in the linear case, while the hash key lookup finishes at about the same time as earlier!

% time php -dmemory_limit=512M -r '$entries = array(); for($i=0;$i<1000000;$i+=2){ $entries[] = "x".$i; }; $hits = 0; for ($i=0;$i<10000;$i++) { if(array_search("x".$i,$entries)!==false) $hits++; } var_dump($hits);' int(5000) php -dmemory_limit=512M -r 115.96s user 0.29s system 99% cpu 1:56.52 total

So, 1 minute 56.52 seconds for a linear algorithm.

% time php -dmemory_limit=512M -r '$entries = array(); for($i=0;$i<1000000;$i+=2){ $entries["x".$i]=true; }; $hits = 0; for ($i=0;$i<10000;$i++) { if(array_key_exists("x".$i,$entries)) $hits++; } var_dump($hits);' int(5000) php -dmemory_limit=512M -r 0.37s user 0.06s system 99% cpu 0.440 total

And 0.440 seconds for the hash one.




keep in mind that since PHP must preallocate the array() to an arbitrary 'sane' size, it has to 'resize' (create a new one, copy it over, point reference, destroy old one) a large number of times here.

therefore, most of the time spent here will be malloc'ing (mmap, memcpy). the rest of it will be on executing the php interpreter, compiling the script, and cleanup.

the hash search and comparisons are so much faster than those operations (in PHP, especially), this isn't really a valid way to measure your thesis

-----


> keep in mind that since PHP must preallocate the array() to an arbitrary 'sane' size, it has to 'resize' (create a new one, copy it over, point reference, destroy old one) a large number of times here.

Those allocations only happen once in a while (because at each reallication, the size of the array doubles), so the cost is amortized. I think that you only get about lg2(1000) = 10 allocations in total for an array of 1000 elements.

-----


fair point, but they're rather costly due to the amount of memcpy'ing in my experience..

you've probably seen/know this, given that you know PHP internals :), but for others: http://nikic.github.io/2011/12/12/How-big-are-PHP-arrays-rea...

-----


The initial array creation + startup/parse/compile/teardown should be quite constant between runs (probably most of the 0.4 seconds baseline). I think the 11sec->2min vs 0.41sec->0.44sec time increase illustrates the algorithmic complexity quite well, no?

-----




Applications are open for YC Winter 2016

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: