If you are looking for performance, do not get the 8TB drives. In my experience,...

sytse · on Dec 11, 2016

Thanks for commenting! You're right that 2TB disks will have more heads per TB and therefore more IO. We want to increase performance of the 8TB drives with Bcache. If we go to smaller drives we won't be able to fit enough capacity in our common server that has only 3 drives per node. In this case we'll have to go to dedicated file servers, reducing the commonality of our setup. We're using JBOD with Ceph instead of RAID.

If the 8TB drives don't offer the IO we need we'll have to leave part of them empty. I assume that if you fill them with 2TB they should perform equally well. Word on the street is that Google offers Nearline to fill all the capacity they can't use because of latency restrictions for their own applications.

It is an interesting idea to do SSD cache + SSD storage + HDD instead of just SSD cache + HDD. I'm not sure if Ceph allows for this.

AnthonBerg · on Dec 12, 2016

I recommend testing and evaluating Bcache carefully before planning to use it in a production system. I found it unwieldly, opaque, and difficult to manage when testing on a simple load; Benefits were mild.

I'm sure it has its uses! But I'd emphatically advise anyone to obtakn hands-on experience with Bcache before planning around it! As I'm sure you are already doing or considering of course :)

lykron · on Dec 11, 2016

It doesn't work like that. 2TB drive will always be faster than an 8TB drive. The amount of data has no effect when compared to the physical attributes of the drive. More platters will increase the response time.

Ceph seems to offer Tiering, which would move frequently accessed data into a faster tier while the infrequent data to a slower tier.

toomuchtodo · on Dec 11, 2016

> Ceph seems to offer Tiering, which would move frequently accessed data into a faster tier while the infrequent data to a slower tier.

By "Tiering", is this moving data between different drive types? Or by moving the data to different parts of the platter to optimize for rotational physics?

lykron · on Dec 11, 2016

By Tiering, I'm talking about moving blocks of data from slower to faster or visa verse. If I have 10TB of 10k IOPS storage and 100TB of 1k IOPS storage in a tiered setup, data that is frequently accessed would reside in the 10k IOPS tier while less frequently accessed data would be in the 1k tier. In this case, the blocks of popular repositories would be stored in SSD, while the blocks of your side project that you haven't touched in 4 years would be on the HDD. You still have access to it, it might just take a bit longer to clone.

Ceph can probably explain it better than I can. http://docs.ceph.com/docs/jewel/rados/operations/cache-tieri...

sytse · on Dec 12, 2016

That is pretty awesome. Should the SSD's for the fast storage be on the OSD nodes that also have the HDD's or should it be separate OSD nodes?

illumin8 · on Dec 12, 2016

The fact that you're asking this question on Hacker News leaves little doubt in my mind that you and your team are not prepared for this (running bare metal).

I read the entire article, and while you talked about having a backup system (in a single server, no less!) that can restore your dataset in a reasonable amount of time, you have no capability for disaster recovery. What happens when a plumbing leak in the datacenter causes water to dump into your server racks? How long did it take you to acquire servers, build the racks and get them ready for you to host customer data? Can your business withstand that amount of downtime (days, weeks, months?) and still operate?

These questions are the ones you need to be asking.

In other words, double your budget because you'll need a DR site in another datacenter.

lykron · on Dec 12, 2016

They don't need another datacenter. AWS has had more issues they NYI's datacenters have had in the past year. There are many companies that are based out of a single datacenter that haven't had any major issues.

toomuchtodo · on Dec 12, 2016

Backblaze comes to mind.

lykron · on Dec 12, 2016

That would be something I would test. I don't know Ceph, so I would be taking a shot in the dark. I would guess it would not make much of a difference as everything is block level. I, personally, would do 1x PCIe SSD for cache, 1x 2/4TB SSD, and 2x 4TB HDD for each storage node.

Edit. If Ceph is smart enough, it would be aware of the tiers present on the node, and would tier blocks on that node. So a Block on node A will stay on node A.

matt4077 · on Dec 11, 2016

Though 2TB of data on an 8TB drive does mean only 1/4 as many requests hit it, right?

lykron · on Dec 11, 2016

I'm not sure I understand your question. A 2TB/2TB disk will have the same number of requests as a 2TB/8TB disk, as they both have the same amount of data.

If you are talking physically, 2TB/8TB would theoretically be faster than a 2TB/2TB disk if the performance attributes were the same. But a 2TB HDD will have a faster average seek time than an 8TB HDD due to physical design. Any performance gains of only partially filling a drive would probably be offset by the slower overall drive.

Dylan16807 · on Dec 12, 2016

> a 2TB HDD will have a faster average seek time than an 8TB HDD due to physical design. Any performance gains of only partially filling a drive would probably be offset by the slower overall drive.

I'm skeptical here. So your minimum seek time goes up on the 8TB because it's harder to align. But your maximum seek time should drop tremendously because the drive arm only has to move along the outer fifth of the drive. And your throughput should be great because of the doubled linear density combined with using only the outer edge.

lykron · on Dec 12, 2016

I'm not saying that you wouldn't see any performance gain. 2TB on the outer track will be faster than 8TBs on the same 8TB disk, but I'm saying any gains will be lost due the dense nature of the drives.

A quick google search shows that there are marginal gains on the outer track vs the inner, but that is only on sequential workloads. For something like GitLab, the workloads would be anything but.

http://superuser.com/questions/643013/are-partitions-to-the-...

https://www.pythian.com/blog/hard-drive-inner-or-outer/

Dylan16807 · on Dec 12, 2016

Ignore the part about where the partition is, then.

1. If I look at 2TB vs. 8TB HGST drives, their seek times are 8ms and 9ms respectively. But if you're only using a quarter of the 8TB drive, the drive arm needs to move less than a quarter as much. Won't that save at least 1ms?

2. The 8TB drive has a lot more data per inch, and it's spinning at the same speed. Once a read or write begins, it's going to finish a lot faster.

3. Here's a benchmark putting 8TB ahead in every single way http://hdd.userbenchmark.com/Compare/HGST-Ultrastar-He8-Heli...

alfalfasprout · on Dec 11, 2016

So, the 8TB drives are fine if you're doing sequential writes, mainly doing write-only workloads, or have a massive buffer in front.

In my experience, the PCIE drives they mention (the DC P3700) are incredible. They're blazing fast (though the 3D X-Point stuff is obviously faster), have big supercaps (in case of power failure), and really low latency for an SSD (150 microseconds or so). They're a pretty suitable alternative to in-memory logbuffers where latency is less crucial. Much cheaper than the equivalent memory too. Having a few of these in front of a RAID 10 made up of 8TB drives will work just fine.

FWIW your experience with a RAID 6 of big disks is unsurprising. Raid 6 is almost always a bad choice nowadays since disks are so big. Terrible recovery times and middling performance.

lykron · on Dec 11, 2016

True about RAID6, I was just commenting on performance. A 2TB will spank an 8TB in any configuration.

I would think that GitLab's workload is mostly random, which would pose a problem for larger drives. The SSDs are a great idea, but I've only seen 8TB drives used when there are 2 to 3 tiers; with 8TB drives being all the way on the bottom. I'm not sure how effective having a single SSD as a cache drive for 24TBs of 8TB disks will be.

BJanecke · on Dec 11, 2016

Don't have hard numbers you say?

https://www.backblaze.com/blog/hard-drive-reliability-stats-...

[EDIT]

Oops that's reliability, I can jump the gun sometimes, sorry.