
Google seeks new disks for data centers - dbcooper
http://googlecloudplatform.blogspot.com/2016/02/Google-seeks-new-disks-for-data-centers.html
======
tallanvor
That blog post doesn't really say anything. You have to click through a couple
of links to find the real paper:
[https://static.googleusercontent.com/media/research.google.c...](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44830.pdf)
which has some interesting ideas, although it's a bit too brief for the
subject, I think.

------
davidiach
"For example, for YouTube alone, users upload over 400 hours of video every
minute, which at one gigabyte per hour requires more than one petabyte (1M GB)
of new storage every day or about 100x the Library of Congress. As shown in
the graph, this continues to grow exponentially, with a 10x increase every
five years."

This blew my mind. What does this even mean in terms of logistics. How many
people do you need to have just to add all those hard drives? How many new
datacenters do you need to build every 5 years?

~~~
nostrademons
It's actually not that much. Current consumer desktop hard drives top out at
8T (you can get enterprise RAID boxes of up to 48T now, but leave them out of
the equation). 1P/day ~= 128 hard drives, so at 4min/drive, that's one person.

Of course, that's just YouTube, and Google has many other needs for data. But
people forget just how big the denominators are on this quantity, and how
effective Kryder's Law has been. They also forget how much reserve capacity
there is in human labor; Google's datacenters have tiny employee counts
because they are so automated, and could easily scale up into the exabyte/day
range.

A more interesting question is what the differential rates of Kryder's Law vs.
Moore's Law will do to how we architect software. Already, people in the know
say that "disk is the new tape" \- disk drive capacity has been increasing
much faster than seek times, bus bandwidth, and available processing power,
which means that you have to start treating the drive as a sequential storage
device and not as a random-access platter. That's behind a lot of the shift
from B-trees (as in conventional RDBMSes) to LSM-trees (as in
BigTable/LevelDB), and also the resurgence of batch-processing frameworks like
MapReduce. How does the software you build change when reading & writing data
sequentially is really cheap, but accessing it randomly is expensive?

~~~
msellout
> How does the software you build change when reading & writing data
> sequentially is really cheap, but accessing it randomly is expensive?

I thought we were already in that situation. Cache is king.

~~~
thfuran
The real change is that with increasingly performant and large SSDs, it's
becoming reasonable to have main storage for which random access is orders of
magnitude faster than HDDs will ever be. Still much slower than cache, but a
few orders of magnitude here and there are likely to shift what optimal
strategies are.

------
baruch
A big thing they are saying in the paper is that they want to see data center
HDDs that care slightly less about their medium error rates and opt to give
better areal density and more consistent tail latencies and let the system
above it to handle the reliability as it needs to do anyway.

I've been thinking like this for a while and would tend to reduce the ERC
(Error Recovery Control) to a minimum but the disks are still not designed to
have these at very low numbers and Google has several interesting ideas in
their paper along these lines.

It would be really awesome if they get the HDD makers to go along with this.

------
ck2
Should be interesting to see what they come up with.

Somehow I don't think they will find things better than multi-million dollar
research like HAMR for write-once read-many, but we'll see I guess.

I wonder if there will be 100TB spinning read/write drives by 2020, an
exponential leap instead of incremental

Is there any reason for them to stay with 5.25 or 3.5 inch design for a
datacenter?

Why not go back to 8 inch for massive surface area? Or is that too much mass
to spin.

~~~
darkr
> I wonder if there will be 100TB spinning read/write drives by 2020, an
> exponential leap instead of incremental

The problem with spinning disks is that speed and reliability has not
increased in line with capacity. There comes a point at which it doesn't make
sense to make them any bigger (capacity-wise).

Just creating a filesystem on an 8TB disk takes hours. If you bring in disk
(block level) encryption and the requirement to fill the disk with random data
and then encrypt before creating the filesystem, you're looking at a multi-day
task. Expanded to 100TB, you could be looking at a month just to bring a disk
online.

On the reliability front - a spinning 8TB disk is probably about as reliable
as a 1TB disk, so that means you have a 8x increase in probability of data
loss, as well as ~ 8x more data to recover/re-distribute for every failure

~~~
bsdetector
> Just creating a filesystem on an 8TB disk takes hours.

Am I being dense here, but why would you create a filesystem on the hard
drive? The hard drive should only store file contents.

There's all kinds of seeking and syncing and random access needed for the
filesystem. Seeks from file contents can't be avoided, but ones due to the
filesystem can be.

If there's no metadata-on-ssd + data-on-disk filesystem for Linux, there
should be.

~~~
woodman
ZFS allows for that level of control: assign a cache device to the pool, then
set your secondarycache flag to metadata. Yes the metadata will hit the
primary cache first (memory), but it will spill onto the secondary as it is
pressured out of ARC. I've got a SD card playing that role in my build
machine, it works really well for keeping track of all the tiny files that
make up the FreeBSD base and ports trees.

------
cfcef
The point about ECC and accepting lower error rates makes a lot of sense. It's
the end-to-end principle again: since all of these HDDs are going into a
global pool in which each of them is disposable and written content is
protected by FEC spread across multiple drives, there is no need for each
drive to spend a lot of resources, going deep into diminishing returns, trying
to make itself as resilient as possible. If a network transmission fails, it
is retried and doesn't need to be swathed in huge numbers of elaborate
checksums and ultra-reliable links; if a hard drive fails, the content is
recovered from the FEC and re-written out to a new drive.

------
novaleaf
reminds me of the 5.25" Quantum Bigfoot. I'd be happy with a big disk like
that in a storage network / raid.

[https://en.wikipedia.org/wiki/Quantum_Bigfoot](https://en.wikipedia.org/wiki/Quantum_Bigfoot)

~~~
mapmap
The article doesn't go into much detail on the performance. I'm curious if
reading was any faster because of the relatively faster spin speed at the
outer rings of a larger disk? Or if it was a wash with seek times increasing
due to the larger area?

------
ibmthrowaway271
> For example, for YouTube alone, users upload over 400 hours of video every
> minute, which at one gigabyte per hour requires more than one petabyte (1M
> GB) of new storage every day or about 100x the Library of Congress

Hmm, something up with the sums in the middle of that.

400 hours of video every minute is much more than one gigabyte per hour. It's
way more than one terabyte per hour.

Working backwards:-

1 PB/day =~ 42 TB/hour =~ 728 GB/minute

~~~
lcpriest
I believe they are saying that the 400 hours they upload every minute are
400GB worth of data.

~~~
ibmthrowaway271
Ah, yes, good point.

