does not really agree with
> Each of the two physical network connections will connect to a different top of rack router.
Sure, you can do it with something like MLAG, but that's really just moving your SPOF to somewhere else (the router software running MLAG). Router software being super buggy, I wouldn't rely on MLAG being up at all times.
> N1 Which router should we purchase?
Pick your favorite. For what you're looking for here, everything is largely using the same silicon (broadcom chipsets).
> N2 How do we interconnect the routers while keeping the network simple and fast?
Don't fall into the trap of extending vlans everywhere. You should definitely be routing (not switching) between different routers. You can read through http://blog.ipspace.net/ for some info on layer 3 only datacenter networks.
You'd want to use something like OSPF or BGP between routers.
> N3 Should we have a separate network for Ceph traffic?
Yes, if you want your Ceph cluster to remain usable during rebuilds. Ceph will peg the internal network during any sort of rebuild event.
> N4 Do we need an SDN compatible router or can we purchase something more affordable?
You probably don't need SDN unless you actually have a SDN use case in mind. I'd bet you can get away with simpler gear.
> N5 What router should we use for the management network?
Doesn't really matter, gigabit routers are pretty robust/cheap/similar. I'd suggest same vendor as you go for whatever your public network routers.
Also, consider another standalone network for IPMI. I can tell you that the Supermicro IPMI controllers are significantly more reliable if you use the dedicated IPMI ports and isolate them. You can use a shitty 100mbit switches for this, the IPMI controllers don't support anything higher.
> D5 Is it a good idea to have a boot drive or should we use PXE boot every time it starts?
PXE booting at every boot is cool, but can end up sucking up a lot of time. If you have not already designed your systems to do this, and have experience with PXE, then don't.
> The default rack height seems to be 45U nowadays (42U used to be the standard).
You may not have accounted for PDUs here. Some racks will support 'zero-U' PDUs, but you'd need to confirm this before moving on.
> H3 How can we minimize installation costs? Should we ask to configure the servers to PXE boot?
Assume remote hands is dumb. Provide stupidly detailed instructions for them. Server hardware will PXE by default, so that's not really a concern. IPMI controllers come up via DHCP too, so once you've got access to those you shouldn't need remote hands anymore.
> D2 Should we use Bcache to improve latency on on the Ceph OSD servers with SSD?
Did you consider just putting your Ceph journals on the SSD? That's a lot more standard config then somehow using bcache with OSD drives.
I would strongly consider doing this via pure L3 routing. This is a scale at which the benefits of L2 fabric switching vs L3 multihomed routing (yes, routing decisions on every node) begin to be interesting decisions.
We're already planning a separate router for the management network ("Apart from those routers we'll have a separate router for a 1Gbps management network.").
All Ceph journals will be on SSD too. I've added a question about combining this with bcache in https://gitlab.com/gitlab-com/www-gitlab-com/commit/a9cc9aad...
For switches, yes. Many of the switches share the same merchant silicon Broadcom Trident-II, Tomahawk, et al., however there are switches like the Juniper EX9200 which isn't baed on merchant silicon. Routers (N1) are also not typically based on merchant silicon (Juniper Trio-3D for example).