From my own personal experience, I would go with a PCIe SSD cache/write buffer, and then a primary SSD tier and a HDD tier. Storage, as it seems you guys have experienced, is something you want to get right on the first try.
N1: Yes, the database servers should be on physically separate enclosures.
D5: Don't add the complexity of PXE booting.
For Network, you need a network engineer.
This is the kind of thing you need to sitdown with a VAR for. Find a datacenter and see who they recommend.
If the 8TB drives don't offer the IO we need we'll have to leave part of them empty. I assume that if you fill them with 2TB they should perform equally well. Word on the street is that Google offers Nearline to fill all the capacity they can't use because of latency restrictions for their own applications.
It is an interesting idea to do SSD cache + SSD storage + HDD instead of just SSD cache + HDD. I'm not sure if Ceph allows for this.
I'm sure it has its uses! But I'd emphatically advise anyone to obtakn hands-on experience with Bcache before planning around it! As I'm sure you are already doing or considering of course :)
Ceph seems to offer Tiering, which would move frequently accessed data into a faster tier while the infrequent data to a slower tier.
By "Tiering", is this moving data between different drive types? Or by moving the data to different parts of the platter to optimize for rotational physics?
Ceph can probably explain it better than I can. http://docs.ceph.com/docs/jewel/rados/operations/cache-tieri...
I read the entire article, and while you talked about having a backup system (in a single server, no less!) that can restore your dataset in a reasonable amount of time, you have no capability for disaster recovery. What happens when a plumbing leak in the datacenter causes water to dump into your server racks? How long did it take you to acquire servers, build the racks and get them ready for you to host customer data? Can your business withstand that amount of downtime (days, weeks, months?) and still operate?
These questions are the ones you need to be asking.
In other words, double your budget because you'll need a DR site in another datacenter.
Edit. If Ceph is smart enough, it would be aware of the tiers present on the node, and would tier blocks on that node. So a Block on node A will stay on node A.
If you are talking physically, 2TB/8TB would theoretically be faster than a 2TB/2TB disk if the performance attributes were the same. But a 2TB HDD will have a faster average seek time than an 8TB HDD due to physical design. Any performance gains of only partially filling a drive would probably be offset by the slower overall drive.
I'm skeptical here. So your minimum seek time goes up on the 8TB because it's harder to align. But your maximum seek time should drop tremendously because the drive arm only has to move along the outer fifth of the drive. And your throughput should be great because of the doubled linear density combined with using only the outer edge.
A quick google search shows that there are marginal gains on the outer track vs the inner, but that is only on sequential workloads. For something like GitLab, the workloads would be anything but.
1. If I look at 2TB vs. 8TB HGST drives, their seek times are 8ms and 9ms respectively. But if you're only using a quarter of the 8TB drive, the drive arm needs to move less than a quarter as much. Won't that save at least 1ms?
2. The 8TB drive has a lot more data per inch, and it's spinning at the same speed. Once a read or write begins, it's going to finish a lot faster.
3. Here's a benchmark putting 8TB ahead in every single way http://hdd.userbenchmark.com/Compare/HGST-Ultrastar-He8-Heli...
In my experience, the PCIE drives they mention (the DC P3700) are incredible. They're blazing fast (though the 3D X-Point stuff is obviously faster), have big supercaps (in case of power failure), and really low latency for an SSD (150 microseconds or so). They're a pretty suitable alternative to in-memory logbuffers where latency is less crucial. Much cheaper than the equivalent memory too. Having a few of these in front of a RAID 10 made up of 8TB drives will work just fine.
FWIW your experience with a RAID 6 of big disks is unsurprising. Raid 6 is almost always a bad choice nowadays since disks are so big. Terrible recovery times and middling performance.
I would think that GitLab's workload is mostly random, which would pose a problem for larger drives. The SSDs are a great idea, but I've only seen 8TB drives used when there are 2 to 3 tiers; with 8TB drives being all the way on the bottom. I'm not sure how effective having a single SSD as a cache drive for 24TBs of 8TB disks will be.
Oops that's reliability, I can jump the gun sometimes, sorry.