
Shuffle Sharding: massive and magical fault isolation - motter
http://www.awsarchitectureblog.com/2014/04/shuffle-sharding.html
======
notacoward
Great stuff, but the same generally beneficial approach taken too far can run
into its own problems.

[http://hackingdistributed.com/2014/02/14/chainsets/](http://hackingdistributed.com/2014/02/14/chainsets/)

To put it simply, seven million is not a big number, and it's the wrong number
anyway. The author confused permutations and combinations; the correct number
of four-card hands from a deck including jokers is only 316251. For the more
common N=3 it's a paltry 24804. If you're doing "pick any N" to choose replica
sets for millions of objects (for example) then pretty quickly every node will
have a sharding relationship with every other. The probability of a widespread
failure wiping out every member of _some_ shard - leading to loss of data or
service - approaches one. You're better off constraining the permutations
somehow, certainly not all the way down to the bare minimum, but so that the
total probability of data/service loss after N failures remains small.

I really hope people actually do the math instead of just cargo-culting an
idea with a catchy name.

~~~
colmmacc
Author here! Well that's embarrassing, you're right about the 7 million
number. I've updated the post to correct it. We'll be following up in a later
blog post on the numbers we use for Route 53.

~~~
notacoward
Awesome, thank you. Those of us who waste^H^H^H^H^Hspend our lives pondering
these things need to stick together. Cheers.

~~~
otoburb
Off-topic: ^W is more efficient and carries the same connotation of a
strikethrough. I only recently found out about ^W myself, so am curious
whether you actively chose ^H (character delete) over ^W (word delete), or
whether this is something that people don't know about as bash/emacs commands?

I feel that ^H is more widely known because those are the actual characters
that people used to see in older terminals before remembering to type "stty
erase ^H".

~~~
notacoward
I used ^H because that has become the standard way to indicate a "correction"
for humorous purposes. As to why that became the standard, it does go back to
the days before Delete keys started going where the Backspace key is supposed
to be, requiring that the erase character be set to DEL to compensate. It has
nothing to do with emacs.

------
danudey
Our approach has turned out to work really well, and very simply.

We have N servers in round-robin DNS. When our mobile client starts up, it
does a DNS lookup, fetches the entire list of servers, and then picks one to
connect to. If that connection fails, it tries another one, etc. until it runs
out (which has never happened).

We also ship the client with a pre-populated list of IP addresses (the current
server list as of build time) and the client caches the list it gets from DNS
whenever it does a lookup. This means that even in the event of a complete DNS
failure, even for hours at a time, our clients are still able to connect. This
was quite handy when GoDaddy's DNS was inaccessible a year or two ago due to
what I recall was a DDoS attack.

A few weeks ago my ISP's DNS servers went down, and since I have the same
mobile and DSL provider, I was completely unable to do anything on the
internet — except play our game. It was then that I wondered 'why don't more
apps do this?' It seems like a simple problem; if you can't do a DNS lookup,
assume the previous IP is still valid. Assuming you're using HTTPS, there
should be no more exposure from a security perspective unless someone takes
control of your IP address _and_ fakes your SSL certificate, at which point
you're screwed anyway.

~~~
URSpider94
_Our approach has turned out to work really well, and very simply.

We have N servers in round-robin DNS. When our mobile client starts up, it
does a DNS lookup, fetches the entire list of servers, and then picks one to
connect to. If that connection fails, it tries another one, etc. until it runs
out (which has never happened)._

The point of the article is that this approach is vulnerable in the case where
something about the client request harms the server -- either takes it down or
impairs its response. In such a case, a single bad client could rotate
successively through the round robin and take out every one of your servers.

The author is proposing a way to minimize the impact of such a bad actor while
still providing a form of round-robin failover for well-behaved requests.

------
mey
Quick note about client retry, I highly suggest some type of guard on side
effecting operations, either rejecting duplicate transaction ids (generated by
the client or retrieved from the server before the side effecting operation)
or returning the previous result if the transaction id was recently processed.

Obviously being stateful and correctly routing the request across shards
becomes harder and can hurt this scale out solution. It also depends on the
functionality of the request, for example an update of the same data may not
cause any damage, but a double submit of an order could be.

[https://en.wikipedia.org/wiki/Idempotence#Computer_science_m...](https://en.wikipedia.org/wiki/Idempotence#Computer_science_meaning)

