A single terabyte is a few magnitudes from what you need big-data-anything for. You could probably work with that just fine on your average 64GB ram desktop with an SSD.

Another poster already replied with a decent refutation of this claim, but a single pass over a TB of data is often not enough for 'big data' use cases and at tens of minutes per pass, it may very well be infeasible to operate on such at dataset with only 64GB of memory.

In the machine learning world, some of the algorithms that are industrial workhorses will require you to have your dataset in memory (ie: all the common GBM libraries), and will walk over it lots of times.

You may be able to perform some gymnastics and allow the OS to swap your terabyte+ dataset around inside your 64GB of RAM, but the algorithms are now going to take forever to complete as you thrash your swap constantly while the training algorithm is running.

tl;dr - a terabyte dataset in the machine learning context may very well need that much RAM plus some overhead in terms of memory available to be able to train a model on the dataset.

A small computer with 1 SSD will take at least 10-20 minutes to make a pass over 1TB of data, if everything is perfectly pipelined.

Samsung claims their 970 Pro NVMe can read 3.5GB/s sequentially. That's about 300 seconds or 5 minutes per TB.

It can't though.

It can, and their fastest enterprise SSD can write at that speed too, or do sequential reads at 7-8GB/s, or random reads at over 4 GB/s.

I just ran `time cp /dev/nvme0n1 /dev/null` on the 1TB 970 Pro. The result:

  real    4m50.724s
  user    0m2.001s
  sys     3m10.282s
So with literally zero optimization effort, we've hit the spec (and saturated a PCIe 3.0 x4 link).

Impressive performance for a $345 consumer grade SSD.


That's impressive and all, but any fragmentation or non-linear access and performance will fall off a cliff

You'd probably be surprised. For reads, there are tons of drives that will saturate PCIe 3.0 x4 with 4kB random reads. Throughput is a bit lower because of more overhead from smaller commands, but still several GB/s. Fragmentation won't appreciably slow you down any further, as long as you keep feeding the drive a reasonably large queue of requests (so you do need your software to be working with a decent degree of parallelism).

What will cause you serious and unavoidable trouble is if you cannot structure things to have any spatial locality. If you only want one 64-bit value out of the 4kB block you've fetched, and you'll come back later another 511 times to fetch the other 64b values in that block, then your performance deficit relative to DRAM will be greatly amplified (because your DRAM fetches would be 64B cachelines fetch 8x each instead of 4kB blocks fetched 512x each).

