

Why a Do it Yourself Big Data Stack Is a Better Option - jdrock
http://cloud.gigaom.com/2010/09/04/why-a-diy-big-data-stack-is-a-better-option/

======
jfager
Good article, but the point is that you should roll your own when you gain a
competitive advantage by doing so, and you have the talent to actually
execute. I would argue that if your situation matches these criteria, you'll
already know it. If you had to read a blog post to start thinking you should
write your own big data stack, you probably shouldn't.

~~~
gruseom
You can enjoy reading a blog post even if you've started thinking. I liked
hearing his reasoning and comparing it to our own.

~~~
jfager
Of course, I enjoyed the post as well, and I'm always interested to hear how
good companies are solving their problems. I just wanted to emphasize their
point that the situations where it's appropriate to roll your own are usually
pretty specific and obvious once you're down in the prolem you're trying to
solve, and if you're a young startup that never thought about it before,
there's probably no reason to spend the time and effort when there are a bunch
of other things you need to be worried about.

------
akshayubhat
I could not understand several points made in this article:

1: He mentions infochimps, but according to my knowledge its more of an ebay
for datasets rather than support/provider for Big Data Stack, also is it
successful? I am unsure how infochimps is related to the Big Data stack.

2: From what I remember reading about 80legs, is that it uses distributed grid
computing to run the crawlers (something like SETI @ Home), I doubt Hadoop was
ever designed for such applications. So this is surely isn't a Hadoop use
case.

3: Quoting:

    
    
           While the standard big data stack has made huge strides in making big data more accessible to everyone, it will always fall short against our stack when it comes to the cost of collecting data.  We actually don’t store that much data. Because 80legs users can filter their data on the nodes, they’re able to return the minimum amount of data in their result sets.  The processing (or reduction, pardon the pun) is done on the nodes.  Actual result sets are very small relative to the size of the input set.
    

Again I am unsure how it is different from Hadoop? First Hadoop uses same
principle "to move computation closer to data" hence a crawler implemented
using Hadoop (something Hadoop is not intended to do) will also store data
locally and not on some other node.

Also he mentions """ We have about 50,000 computers using their excess
bandwidth.""" 50,000?? The biggest Hadoop cluster That I know (Yahoo) has
~10-20k nodes, and Hadoop was never meant to be used at 50K scale for
crawling. So they had no option other than building their own system, even if
they had to build it today.

4: Quoting

    
    
          One advantage is optimization — an “off-the-shelf” system is going to have some generalities built into it that can’t be optimized to fit your needs. The opportunity cost of going “standard” is a slew of competitive advantages.
    

The only issue I can think of regarding Hadoop is that its written in JAVA,
otherwise its an extremely extensible piece of software. Unless you are
designing a real time messaging system or distributed system for High
Frequency Trading, Hadoop is good enough for most of the applications. Also
what about cost of finding good enough programmers who are capable of building
a system? Another advantage of Hadoop is that in case of a low load the
remaining nodes can be used to do something else, maybe processing some data,
with your own solution it would be harder to do it. Also your IP and your
Secret Sauce isn't of much use, if you dont have solid Patents for them,
otherwise they would mostly end up becoming a maintenance nightmare, after
original engineers cash out. Also what if the the big company already has
Hadoop cluster, it would be even difficult for them to integrate with your
computing power.

While I seem to agree with Authors conclusion that a highly focused startup
should make their proprietary solution, I cannot agree with his evidence
behind that argument. A grid based crawler with 50K machines isn't something
that Hadoop was ever designed to support.

~~~
jdrock
A major point made in the article is that the standard big data stack does not
fit all big data _needs_.

There's a growing assumption that this stack is sort of all that's needed, and
that's just not the case.

------
delano
_most true competitive advantages are operational and cultural ones, contrary
to popular thinking_

That's true. Technology people tend to focus on the development-side. That is,
creating something new and novel rather than running and maintaining something
that already exists. Business people do pay more attention to operational
concerns, but from a cost-cutting perspective. In other words, both sides tend
to see operations as a loss rather than an opportunity.

This is probably changing though. It'll take a bit longer to know for sure,
but that could be what devops is really about.

~~~
siculars
I'm not sure how true that is. Take a look at recent companies like Zappos and
Amazon or older ones like WalMart and FedEx. Operation and logistics are their
bread and butter. It was practically Dells entire reason for existence (until
they fell into the abyss).

~~~
anamax
Operations and logistics are things that a company can control. The big about
successful retail chains being real estate operations with store sidelines
also applies.

------
rarrrrrr
If you were considering rolling your own large storage infrastructure, a
reasonable place to start might be <https://spideroak.com/diy/> (entirely open
source.)

------
moonpolysoft
Just for the love of god don't write another dynamo clone. The world has
enough.

