With only 2TB of local storage, your 4TB dataset (or output set) has to pass over the net for every execution.
At 10Gbs - this alone can take 20 minutes to 1hr.
If you want to write it to the local SSD - multiply that by x3 or x10 ?! ....
This 4TB of memory is still bounded by 4 x 30MB of L3 cache. Which means your single thread implementation will be slowed down by x3 or x5 due to memory latency.
Your multi core implementation will probably suffer even more.
Distributed system are VERY hard, but dealing with them is inevitable for certain workloads.
That's what I mean. You have to get really big to have Big Data. Fine, you can't do PageRank on a petabyte of web crawls using one machine. But the datasets that people use for benchmarking, at least the benchmarks that are made public, you definitely can. You can go far larger than a typical benchmark dataset and still do it on one machine.