

Sector/Sphere is significantly faster than Hadoop - yarapavan
http://sector.sourceforge.net/benchmark.html

======
vicaya
I'd been a fan of Sector/Sphere and Dr. Gu for a while. But here is the
biggest problem of the benchmark: "replication is disabled" for maximum
performance, which means these are not realistic benchmarks for large scale
clusters (> 200 nodes), where node failure is the norm. I suspect that
Sector's replication algorithm is not up to par with HDFS', which uses fairly
sophisticated replication pipelining for maximum throughput.

The most interesting thing about Sector/Sphere is that it doesn't use TCP/IP
for transport but their own UDT stuff, which has its own pluses (more
efficient on high latency links) and minuses (stability issues due to lack of
wide usage/testing)

------
jacquesm
As noted in something I posted yesterday, hadoop is about being able to scale
linearly, not raw performance so it's no surprise that they underperform.

The really interesting question is how Sector/Sphere scales and how well it
deals with outages.

~~~
lzimm
Probably fairly similar to Hadoop. From what I've learned thus far, dealing
with outages pretty much boils down to replication.
<http://sector.sourceforge.net/tech.html> looks like they can scale and
replicate, and have some other nicenesses to boot.

~~~
jacquesm
> Sector provides automatic failover for Map style data processing. When a
> slave fails during the processing of a data segment, another slave will be
> assigned for the same data segment to continue processing.

But that does not say whether the 'read' or 'write' that failed will be
restarted or whether it starts the whole segment from scratch.

Which is what's interesting in Hadoop, it apparently (I haven't tried that)
keeps chugging along is if literally nothing happened, so from the
applications point of view the fault never occurred.

~~~
lzimm
Hm. I see your point. I guess what it boils down to is how frequently the
nodes report their status back to the master (or whatever its called) so that
it can determine which operations have succeeded and which are still
pending/failed.

I wonder how much of a tax it takes to report the result of each operation
sequentially vs. doing it in a batch at the end of the segment. If the
difference in performance is because Hadoop reports sequential while sector
does it batched, be nice to have something that gives you a sliding bar to
chose somewhere along the gradient between failover optimization and
optimistic optimization where your failovers are far and few in between.

