

Data sorting world record: 1 terabyte, 1 minute - buluzhai
http://scienceblog.com/36957/data-sorting-world-record-falls-computer-scientists-break-terabyte-sort-barrier-in-60-seconds/

======
extension
BTW, these are 100 byte records, so 1TB = 10 billion records.

I don't know why they measure in bytes, which is nearly meaningless. I can
sort a single 1TB record in my head right now. There, I'm done.

~~~
adataminer
the size is given because one can understand how many machines it would take
to hold the data in the memory, how much bandwidth is required to move data
around and latency.

Its hard to understand if you are just a programmer, a computer science
graduate can easily identify with size in bytes since they represent accurate
amount rather than vague billion entries.

In later case you need to specify two variables size of entry and number of
entries.

the larger entries become the easier is to sort them, if they get smaller then
you can have an O(n) sorting ability.

The 100 byte records is fixed in literature and well understood by
practitioners, what you are experiencing is your naivete.

------
nl
That's pretty impressive. Back in 2008, Google did a 1TB sort in 68 seconds
([http://googleblog.blogspot.com/2008/11/sorting-1pb-with-
mapr...](http://googleblog.blogspot.com/2008/11/sorting-1pb-with-
mapreduce.html)), breaking the previous record of 209 seconds (which I think
was done on Yahoo's Hadoop cluster)

I expect these aren't quite comparable (the op link talks about it being non-
generalized data), but it's interesting how these high performance numbers are
becoming more achievable.

------
binarymax
Here is a link the tech specifics from the TritonSort team (PDF)
<http://sortbenchmark.org/tritonsort_2010_May_15.pdf>

~~~
jasondavies
Here's Google's cached copy:
[http://webcache.googleusercontent.com/search?q=cache:2fmfGIe...](http://webcache.googleusercontent.com/search?q=cache:2fmfGIe5xN8J:sortbenchmark.org/tritonsort_2010_May_15.pdf+tritonsort_2010_May_15.pdf&cd=2&hl=en&ct=clnk&gl=uk)

------
agentultra
_To break the terabyte barrier for the Indy Minute Sort, the computer science
researchers built a system made up of 52 computer nodes. Each node is a
commodity server with two quad-core processors, 24 gigabytes (GB) memory and
sixteen 500 GB disks — all inter-connected by a Cisco Nexus 5020 switch._

That's some pretty impressive hardware. Must've been pretty fun. Not something
just anyone gets to work on. :)

------
stuff4ben
it's hard to infer if the record is being set due to better hardware or better
algorithms. I'd like to see benchmark improvements in algorithms not hardware,
but I'm a software guy.

~~~
javanix
I was under the impression (could be remembering something wrong, however)
that sorting algorithms had a provable (or at least strongly believed) best-
case of O(nlogn), and that we already have algorithms that meet that.

If that is correct, most improvements would probably come from hardware and
software use-case tuning.

~~~
sp332
Big-O notation is a theoretical tool, it's somewhat useful in practice but it
won't necessarily tell you which of two algorithms is faster. It doesn't tell
you about cache performance, memory requirements, or even if there's a large
coefficient on that n * log(n) term.

