

LexisNexis open sources Hadoop challenger - cpeterso
http://www.theregister.co.uk/2011/06/15/thor_roxy_hadoop_challenge/

======
stephenjudkins

      "We are four faster than Hadoop on the Thor side. If Hadoop needs 1,000 nodes we can do it with 250 – that means less cooling and data center space."
    

Many (most?) Hadoop jobs are IO-bound. It's unlikely that switching from Java
to C++ could speed those up 4x.

We should reserve judgement until these things are actually open-sourced.

~~~
srean
I wont be surprised at all by 4x over Hadoop, rather I would be a little
underwhelmed.

What follows is an anecdotal data point. I have run similar code on similarly
sized input on an Yahoo configuration of Hadoop (which is written in Java) and
Google's mapreduce (which is written in C++). Google's implementation clearly
ran in 4x to 5x less time. The difference would be between my job finishing in
2.5 hrs vs I have to check the result the next day.

I would expect Yahoo's set up to be a fairly well tuned one too. Its
significant because it does require a fair bit of tuning to get get good
performance off Hadoop. Google's setup might as well, but I have not come
across anyone who has set it up personally so I have no idea how difficult or
easy it is/was. Furthermore my Yahoo experience was much more recent, so I
believe the Hadoop installation would have had the advantage of benefiting
significantly from the recent advances in JVM. All in all Google's mapreduce
was quite a pleasure to use (unless the nodes kept failing too often and that
has happened too), I cannot vouch for a similar level of pleasure with Hadoop,
but no node crashes (again, keep in mind this is anecdotal).

This probably has more to do with the design rather than C++ vs Java, but it
cannot be ruled out entirely. I can well imagine the JVM based solution to be
more memory intensive and hence would run less number of jobs per node. A
possible explanation for no node crashes at Yahoo side of things is that their
clusters typically use close to cutting edge and newish and expensive servers,
whereas Google's philosophy had been to use cheaper servers and make up for it
on the software stack. But it could have been that the data center where I was
running my stuff on was having issues that month.

There was recent post on HN about an ex-Googler complaining that Google
infrastructure is old in comparison to Hadoop, its filesystem etc, old it
probably is, but in my experience not lacking in performance in anyway, in
fact quite superior. Again, this is no way a benchmark.

EDIT: Not sure what I am allowed to tell, but you wont be too wrong if you
assume that the Yahoo servers were top of the line a couple of years ago,
whereas my time on the Google system was in the pre multi-core era and pre 64
bit era. Don't remember anything about disk sizes though.

~~~
gojomo
You say similar input and similar code... but don't mention similar
cores/RAM/disks. Do you know if those were similar, as well?

------
jchrisa
"The risk business is worth $10m a year" - typo or British m? Seems like it
would be much bigger than this. Or maybe this is just what LexisNexis makes on
their old product?

~~~
hollerith
A British million is the same as an American million. It's "billion" that has
a different definition on the other side of the pond.

------
Todd
There's no mention of the distinction between the file system and the compute
portion. Also, no mention of the particular computational approach (e.g.,
map/reduce). Arguably one of Hadoop's big benefits is HDFS.

------
bdb
Awesome.

Here's a link to the actual project: <http://hpccsystems.com/>

~~~
joshu
Anything there yet?

~~~
bdb
No source yet, but there's a demonstration VM.

Unfortunately, it looks like they decided to choose to license the code under
the AGPL[1], so it will probably be of little use to most of us.

[1] <http://hpccsystems.com/print/518>

~~~
joshu
I'm not really sure what AGPL means for a subcomponent like that. MongoDB is
similar. Does that just mean you have to share modifications to the component
itself?

~~~
skorgu
That's how the MongoDB guys interpret it[1]:

    
    
      To say this another way: if you modify the core database source code, the goal is that you have to contribute those modifications back to the community.
    
      Note however that it is NOT required that applications using mongo be published.  
    

[1] <http://blog.mongodb.org/post/103832439/the-agpl>

------
hdeo
1> You can find many analysts and many more engineers with Hadoop skills. 2>
Hardware & power costs - are not huge (assuming hadoop is slower) - at least
till you reach a massive scale.

