

How to speed up massive data analysis by eliminating disk seeks - petewarden
http://petewarden.typepad.com/searchbrowser/2010/01/how-to-speed-up-massive-data-set-analysis-by-eliminating-disk-seeks.html

======
petewarden
I'm _certain_ I'm re-inventing the wheel with this approach, but I obviously
haven't been researching in the right places, since I hadn't run across this
approach before I cobbled it together.

I'm expecting an education on what I'm missing from the HN community!

~~~
fhars
The "external sorting" chapter of TAoCP might also be interresting in this
context (5.4 in Vol. 3), even though Knuth thought that "the once-crucial
topic of patterns for tape merging has become of limited relevance to currnt
needs" (p. 251) due to the rapid development of disks in the 80s and 90s, so
that his exposition might "be their last grabd appearance before they acept a
final curtain call." Little did he know that disk would soon become the new
tape... To his credit, he ended the introduction to the chapter with the
citation "for all we know now, these techniques may well become crucial
again."

------
NateLawson
Yes, you are reinventing the wheel. This kind of approach has been used for
decades in disk drive controllers. You sort track accesses in ascending or
descending order to prevent longer seeks. It's called the "elevator
algorithm".

<http://en.wikipedia.org/wiki/Elevator_algorithm>

This is combined with tag queuing, where multiple requests can be accepted
from the host at once. The greater your tag depth, the more insight the
controller gets into future seeks.

~~~
petewarden
Interesting - so in those terms I'm creating a massively deep queue of access
requests, and then sorting them into an optimal order.

------
aaronblohowiak
Random access is _always_ slower than linear reading. Even if you aren't going
to disk, you can avoid blowing out the processor cache and having to get from
main system memory. Ram is a cache, and L2 is a cache, &c. What you are doing
is pretty normal "old-school" unix programming.

------
timtadh
it is a good point that classical dbms's aren't always good, however the
method that a dbms uses to perform queries are always good to know. the author
here implemented a sort-merge-join which is one of the classic implementations
of join algorithm. for a good overview of the trade offs between various the
various join and sort algorithms see "Principles of Database & Knowledge-Base
Systems Vol. 2" by Jeff Ullman. The first chapter in the book is the one you
want. It is dated but therefore cheap if you get it used.

here is the worldcat link <http://www.worldcat.org/oclc/439156325>

~~~
petewarden
Cheers! That's exactly the sort of reference I was hoping for.

------
stephenjudkins
Take this post with a grain of salt, since I have the zeal of a recently saved
sinner, but you should try using Hive and Hadoop for this sort of thing.

We recently switched from a workflow that is very similar to the one you
describe to using Hive with Amazon's elastic map reduce. Hive presents a SQL-
like layer of abstraction over exactly this sort of thing. Instead of doing
the sorting and merging by hand, you simply write it as a series of joins.
It's like writing SQL, except the actual implementation works almost exactly
like what you're doing.

Integrating simple Ruby scripts for JSON processing was also trivial.

Elastic MapReduce also had near-zero infrastructure and management overhead
for us (besides the 10% Amazon charges for the machine instances). We use S3
for all data input and output, which is perfect for us.

Even when running on a single machine, using Hive was a big win in terms of
development time, and performance of the jobs seemed only slightly slower that
using Unix utilities on big text files. It's almost a bonus that we can also
scale it out to dozens of machines, for a huge speedup. Running a job that
took several hours on a single machine took less than five minutes, and only a
few hours of EC2 machine time. Cheap and easy!

------
jbeda
This is basically a map reduce. You should look at hadoop as you start doing
more complicated stuff.

~~~
gstar
The sort mode is map-reduce style, but the overall idea was to put data
sequentially on disk before you process it.

------
speek
You should check out this paper: Disk is the new ram
<http://www.ccs.neu.edu/home/gene/papers/acm-viewpoint08.pdf> It's really
neat.

------
jbl
You can even combine this approach with split and parallel make to do a cheap
single-machine parallel sort. I use a little script that generates a Makefile
that can be called with make -j n that splits an input file, sorts the parts,
and then merges them with sort -m. It's proved to be quite handy.

------
conquest
Where do you store the processed data for recall? It seems you have data per
fan page as well as a google style suggest index of those page names.

~~~
petewarden
I'm actually storing out the data as text files containing json in the file
system, one per fan page. I normally use Mongo, Redis, Tokyo or MySQL for this
sort of thing, but since I'd already done all of the processing that they
normally help me with as operations on disk files, I thought I'd try sticking
with the low-tech theme.

------
earle
This is precisely how data is stored in ranges within HBase/Hypertable. Your
data is also sorted between mapping and reducing.

------
tlack
is it really faster to write to a bunch of text files, sort them to a new
bigger text file, and then do the insert? seems like a lot of extra steps, all
involving a lot of reading and writing..

~~~
barrkel
It depends. For example, what's the balance between reading and writing, how
stream-oriented your processing is, how important is random access between
different records, etc.

This approach can be easily augmented too. For example, doing a binary search
for a particular line in a text file when you don't have all the lines in RAM
is somewhat tedious; it can be made much easier by creating a simple index for
the file, consisting of a flat array of the file offset of every line start.
That flat array can be stored in a file also; then, both the total number of
lines and the contents of a line at any given index are trivial to retrieve.

If you have to handle a small number of updates while still handling lots of
reads, then you can use a two-layer approach. Keep a cache of all pending
updates in memory in an efficient manner (e.g. hash table), and look up the
cache before falling back to the disk; and when writing, both update the cache
and write out to an update log, which can be sorted and included in the main
store later, when it makes sense.

------
ilaksh
how about not using a mechanical disk.

solid state seeks are like 10 or 50 times faster.

~~~
akronim
because (as it says in the article) he's running on EC2 and it isn't an
option.

