
Single server systems can tackle big data - datascientist
http://strata.oreilly.com/2013/04/single-server-systems-can-tackle-big-data.html
======
bane
At one time I worked for a company that, as a side development effort, built
some pretty cool data processing software. It was built as a side effort to
support the desktop software we were selling as the main line of business.

Since we were running on a single desktop, it never occurred to us to think
about large server clusters cranking away on data day and night. We just had
to make it work on a regular 'ol single machine. Since user experience is
paramount in user-facing desktop software, it meant clever algorithms and
performant code to get it to process the data in a timely fashion.

One day, one of our customers found themselves incredibly dissatisfied with
the money and manpower that was being poured into a "big-data" data processing
solution on another project, 3 FTEs and a half-dozen very expensive servers.

On a lark he asked us if our data processing tool could do what their big-data
system was doing. As it turned out, functionality-wise, we were about 80%
there. With a couple months worth of work we were 95% there. The big question
was performance.

The customer setup a single instance of our single desktop data processing
software and kicked it off. A few days later it had finished. "It's not fast
enough" he said. So we spent a few months making it faster and getting rid of
bottlenecks. We got it down to just a few hours. "It's not fast enough!" he
said again.

So we halved the time...but that was it, it was pretty much all the
performance we were ever going to get out of our code-base. We asked "how long
does your big data system take to do what we're doing in a morning?", fully
expecting that their multi-million dollar solution was cranking through the
same data in a ten minutes to a half hour or so.

He responded, "two and a half weeks".

A year later, the other solutions provider called us up and asked us if we
were interested in purchasing their big-data processing division. We politely
declined.

The point is that sometimes the giant software stacks and interconnected
clustering systems we build to tackle "big data" sometimes introduce _so_ much
overhead into the system that a single system, running some smart software
_can_ outperform the larger systems. We engineers sometimes get incredibly
lost in the details of making cool things work to stop and ask ourselves if
what we're doing is actually the best approach in the end.

------
kijin
> _for workloads that are processing multi-gigabytes rather than terabyte+
> scale, a big-memory server may well provide better performance per dollar
> than a cluster._

Does the "performance per dollar" take into account the devastating damage
that a single, very powerful server would cause if it ever went down?

With 1 server that can process terabytes of data, you're either processing
terabytes of data or zero bytes of data. Redundancy would be difficult to
achieve, not only because the hardware is expensive, but also because your
software was designed to work best on a single server.

With 1000 servers that can process gigabytes of data each, you're always
processing terabytes of data even if a few servers go down from time to time.
Redundancy is already built in.

If you put all your hard drives in RAID 10, you need twice the hard drives as
your storage needs. If you put them all in RAID 5/6 instead, you only fewer
hard drives but you'll get a similar level of overall reliability. With hard
drives, of course, there would be a significant difference in performance.
With CPU cores, on the other hand, the difference would be smaller.

~~~
jacques_chester
I think you're conflating throughput with availability. They are distinct non-
functional requirements and they should be considered and architected for
distinctly.

~~~
kijin
Throughput is different from availability, and I'm not conflating them. But
they are not completely unrelated, either.

All I'm saying is that it might turn out to be more expensive to design high-
availability architectures using a small number of high-throughput servers
than it is to design one using a large number of low-throughput servers.
Sometimes this doesn't matter because you only need to crunch data a few times
a week. Other times, you need 99.99% uptime.

------
pilooch
Recently a big engineering firm (let's not harm them by revealing their name
:) ), called on me. A query on our system takes 7 days they said, when it
succeeds... These guys deal with up to 4Tb a day. With proper multi-core map-
reduce and open source search, I brought this down to 3 sec on a single 48
core machine.

They used this without revealing it to their (big) customers (who pay a lot
for the crappy system). Then they've got a new version of their original
system, relying on hadoop. Still, it was slower than the single machine
design.

Sometimes the bottleneck isn't in the hardware resources... Distributed
computing has many advantages, but I personally use it as a last resort (or I
prepare the architecture for scalability, trying to never hit the two machines
and beyond case!).

------
necubi
This is a fairly uninteresting insight--if your data fits on one machine, use
one machine to process it!

But in many cases, your data does _not_ fit on one machine, even one with gobs
of ram and SSDs. In my work, we receive hundreds megs of data per second and
our jobs run from hundreds of terabytes to petabytes of input data. That is
the scale where "big data" ceases to be a meaningless buzz word and becomes a
serious technical challenge, requiring distributed storage and processing over
hundreds of machines.

Really, this is about the devaluation of "big data" to mean anything over a
few gigabytes, which is a drastically simpler problem than peta-scale
computing.

------
jacques_chester
I was under the impression that "Big Data" is a moving line of "too big for
single systems". As system density increases, the Big Data threshold moves
too.

I don't see this ever being resolved for all time; but right now SSDs are the
one technology that's put disk-using designs back into the hunt.

Where single-systems really shine is simplicity. Distributed systems push a
lot of complexity up into the solution space and experience seems to suggest
that it cannot be safely abstracted away.

------
atdt
The scarcest resource in big data is analyst grey matter, not CPU cycles.
Distributed computing can be a bad choice for data analysis because it
transfers work from computers to humans: analysts have to spend time
explicitly thinking about how to express their queries in a form that could be
efficiently parallelized. The fact that every big data vendor advertises its
product as doing this work for you is a good sign that it's a pain point.

