I'd been a fan of Sector/Sphere and Dr. Gu for a while. But here is the biggest problem of the benchmark: "replication is disabled" for maximum performance, which means these are not realistic benchmarks for large scale clusters (> 200 nodes), where node failure is the norm. I suspect that Sector's replication algorithm is not up to par with HDFS', which uses fairly sophisticated replication pipelining for maximum throughput.
The most interesting thing about Sector/Sphere is that it doesn't use TCP/IP for transport but their own UDT stuff, which has its own pluses (more efficient on high latency links) and minuses (stability issues due to lack of wide usage/testing)
As noted in something I posted yesterday, hadoop is about being able to scale linearly, not raw performance so it's no surprise that they underperform.
The really interesting question is how Sector/Sphere scales and how well it deals with outages.
Probably fairly similar to Hadoop. From what I've learned thus far, dealing with outages pretty much boils down to replication. http://sector.sourceforge.net/tech.html looks like they can scale and replicate, and have some other nicenesses to boot.
> Sector provides automatic failover for Map style data processing. When a slave fails during the processing of a data segment, another slave will be assigned for the same data segment to continue processing.
But that does not say whether the 'read' or 'write' that failed will be restarted or whether it starts the whole segment from scratch.
Which is what's interesting in Hadoop, it apparently (I haven't tried that) keeps chugging along is if literally nothing happened, so from the applications point of view the fault never occurred.
Hm. I see your point. I guess what it boils down to is how frequently the nodes report their status back to the master (or whatever its called) so that it can determine which operations have succeeded and which are still pending/failed.
I wonder how much of a tax it takes to report the result of each operation sequentially vs. doing it in a batch at the end of the segment. If the difference in performance is because Hadoop reports sequential while sector does it batched, be nice to have something that gives you a sliding bar to chose somewhere along the gradient between failover optimization and optimistic optimization where your failovers are far and few in between.
If you need 5 machines to get the same performance as Sector/Sphere on 1 machine, then that is a problem. The question is, at which point does Hadoop start to outperform Sector/Sphere (assuming that Hadoop scales better). If the answer is if you have 10000 machines then Sector/Sphere is better for almost everyone.
This is the same problem with Haskell/Parallelism with functional data structures. Maybe you can write a linearly scaling parallel quicksort on linked lists Haskell, but I can write an array based C sort that's 1000x faster. You could say that that's unfair and use arrays in Haskell, but now parallelism isn't so easy anymore.
Yes, that's exactly the question. But knowing that there might be a cross over point is not the same as knowing there is none (which is very well possible). I'm wondering whether there even is a scenario under which Hadoop would outperform Sector/Sphere.
The only time that I tried Hadoop I was so underwhelmed with the performance that I immediately decided that it wasn't for us. For a while glusterfs looked like it might be a winner but I'm getting more and more skeptical over time.
Pohmelfs (in the linux kernel) is an absolute pain to get working at all. So anything that will give us that cluster file system with good failover will get a look, if performance is acceptable compared to what we do today. (custom built CDN).
The most interesting thing about Sector/Sphere is that it doesn't use TCP/IP for transport but their own UDT stuff, which has its own pluses (more efficient on high latency links) and minuses (stability issues due to lack of wide usage/testing)