
Server-Side Hardware for MOGs - kiyanwang
http://ithare.com/server-side-hardware-for-mogs/
======
myrandomcomment
The design of the proposed networks are not really optimal for provinding low
latency. You would do much better by dropping the firewalls, having the front
end game severs on then net with the correct setup security wise (only running
the services needed to answer for the game). On the game serves have a back
end network for storage and DB (2 networks on shared NIC with VLANs or even 2
different NICs). It really depends on the load and throughtput requirements.
This is an answer given the design they proposed. L2 and bae metal servers are
not the right answer when you are designing from scratch in 2017.

Ideally you do an L3 CLOS Pod (leaf/spine) behind a SLB that directs traffic
to containers on the servers running the game system. Do every container as a
/32s (BGP fabric on the clos pulled down to the server - running routing on
the server) and run the internal DB / app / storage on a diffent set of /32s.
Towards the external router do not announce the internal DB / app / storage
networks and then the world can not get to them. Redundancy is a feature of
the Clos and BGP + containers. Scaling is simple if your application is built
to scale with the addition of service containers.

Also 10g is cheap and the latency is much lower then 1g. Heck I know people
building the 2nd design I talked about above running 50g to the servers and
6x100g from the leaf to the spines. The cost on the switchs (Broadcom Tomahawk
ASIC in kit from Arista, Cisco, Juniper, whitebox) is really cheap.

~~~
lazylizard
actually whats wrong with bare metal servers?

~~~
myrandomcomment
Nothing, if you have a need for a single big app that can consume the full
system. Makes redundancy more complex as you cannot move the app if there is a
failure. Requires clustering, SLB type design.

~~~
lazylizard
i'm thinking a cluster of bare metal servers is rather simpler than a cluster
of containers on top of the same number of bare metal servers?

~~~
myrandomcomment
What are you trying to do is the key here. Lets say you have some web servers
and you also have some applications (games). The load on the games depends on
the number of users, how popular it is. If you have a a few racks of servers
that are all 100% the same that you can bring up the containers on demand.
Have web containers. Have game containers on demand based on load. Bare metal
is really for something where you have a reason to consume all the HW, lets
say for a DB you want to keep all the set in memory, then bare metal in a
cluster with a big box that has lots of RAM. Like I said it is all really use
case dependent. The trend today in the target scale systems is load on demand
via, VM or container.

Also if you use containers with a routing protocol on the server they can move
and update the network as they move.

------
cthalupa
Reading through this and it feels very... dated.

Firewalling your game servers is just not a good idea if latency matters at
all. It shouldn't really be presented as an option. Separate VLANs or subnets
or some other method of segregating traffic is the correct answer here. I
understand there is a disclaimer, but it's just not ever the right choice.

Out of band management: If you're in a production setting and having to
manually fix mistakes when you broke your IP configuration, you're doing it
wrong. Why are you not using configuration management tools? Why are you
manually configuring your servers? If you're using the cloud, why wouldn't you
just toss away a broken instance?

Fault tolerance: Vague statements about how building in redundancy is somehow
a single point of failure itself. It mentions further details are available in
some other chapter, but that chapter appears to not be on the site, or isn't
written yet, or something. This seems somewhat nonsensical to me, as there are
many proven high availability configurations.

RAID: Software raid is fine for a game server vs. hardware raid? There's so
many caveats to this... If your RAID is software based, that means your CPU is
handling the calculations required for the RAID. Do you have the spare cores
for this? Will it negatively impact server performance? Will it perform as
well as it would with a dedicated controllers? How often are you loading
things from disk?

Blade servers: Unsubstantiated anecdotal evidence about how blade servers are
less reliable. If they're really less reliable, there should be data on this.
This appears to be the beta for a book that will be sold. If this is a
product, back up your assertions.

"Mission critical servers": Recommendations to put your mission critical
servers on a 4U because this will somehow make them more reliable than on a 1U
or 2U, despite many failure modes having absolutely nothing to do with the
amount of space the server takes up on the rack.

SANs: Apparently these are not acceptable for database servers. I believe many
of us are in for rude awakenings.

RAID with battery backed write cache vs. NVMe: Apparently it does not matter
what type of storage you are using underneath the RAID as long as you have a
write cache. No mention of what happens if you are pushing more data to the
RAID controller than can be cached before being flushed to the disks backing
it.

Cloud storage: "Just now" cloud providers are providing virtual machines with
local disk options. Last I checked, many of the big name cloud providers
launched their virtual machine offerings with local disk options.

Vendors: Just avoid SuperMicro? No explanation here beyond they're "not mature
enough". I suppose two and a half decades isn't mature.

Browsing some of the other sections, like the 'Cloud' chapter...

"Hardware Replacement": Mentions that cloud providers will automatically
instantly relaunch a busted instance for you, and that you'll lose your hard
disk data. Many providers don't automatically do anything of this sort, and if
you are using network storage, you certainly aren't losing your storage
because you moved your instance to new hardware.

Network throughput: Somehow a 100mbps port is going to push out 13 petabytes
of data in one month. I'm not entirely sure how.

VM migration: Apparently all clouds live migrate VMs, and these live
migrations with small latency spikes are worse than complete failure.

Large boxes: Those 4U boxes in the other chapter supposedly do not have
equivalents available in the cloud.

No mentions at all of the tradeoffs in performance on virtualization vs. bare
metal, mitigating factors, etc.

I honestly feel like this was written by someone who is stuck architecting
things like it's 2002.

~~~
myrandomcomment
+1 to everything you said. It is strange mix of random statements not backed
up buy facts, design elements that come from 12 years ago somehow being
shoehorned onto the cloud.

