

Netflix: Run Consistency Checkers All The Time To Fixup Transactions  - yarapavan
http://highscalability.com/blog/2011/4/6/netflix-run-consistency-checkers-all-the-time-to-fixup-trans.html

======
cookiecaper
The described approach just _feels_ wrong to me. I understand that the
problems Netflix faces are challenging, but usually when I run into messy
systems like that, where you do one thing to address one problem, and then
have to do four or five other things to fix the problems that one workaround
caused, there's a fundamental misengineering in the mix somewhere. Systems
should be simple and consistent as much as possible -- doing something like
this across the entire service is crazy.

I understand if they do it a little bit to test things, or whatever, but I
think if they sat back and were willing to carefully consider this problem,
they could come up with a much simpler workflow. The ability to do that is a
key part of quality engineering. A big cascading infrastructure of workarounds
that eventually only kind of gets things in a correct state is usually a sign
of a problem.

~~~
tybris
> Systems should be simple and consistent as much as possible

CAP theorem says otherwise.

Distributed systems are complicated, almost like reality. You can feel the
constraints of physics and math everywhere. You simply cannot get the best of
all worlds (e.g., consistency, availability, and partition tolerance). If you
have the money and need the scale, which often go hand in hand, you're simply
better off avoiding hard consistency models, because the other two are more
important to your business.

I love their approach. It's exactly what I think people should do to scale up
their system. Avoid two-phase commits and other elegant availability mine
fields and go for pragmatism. Note that this is only relevant for a few dozen
companies.

~~~
lsd5you
It is not 100% clear from the examples cited that scaling should necessitate
the consistency issues they have. Surely it is quite straight forwards to
scale user private data (i.e. 'sharding' on user data) such as position in
play and the queue.

For the data which is cross user, such as that which drives their
recommendation system (are there other parts, I don't know, not a user) then
consistency would be compromised - but given the soft/heuristic nature,
presumably also more or less irrelevant. Here eventual consistency would be
good enough (and ideally should not require extra processes)?

Like the halting theorem, the implications of the CAP theorem are only
unavoidable in pathological cases. Until it is identified as a pathological
case then I believe it is premature to invoke such theorems.

~~~
jonburs
Scaling does not imply availability. Sharding user data certainly is a good
strategy to achieve scalability (i.e., support a large number of users,
distributed across a fleet of commodity hardware), but it doesn't a priori
make that data highly available.

Let's imagine my personal netflix data (queue, ratings, etc.) live on a
lightly loaded (since things are well shareded) DB instance. What happens if
the hardware hosting that DB fails? Ok, let's replicate to a slave, and put
into place automated failover mechanisms to promote the slave if the master
goes down (since with many shards it's really hard to effect a failover
manually in a timely fashion). Should the master and slave be in the same
datacenter? That after is all is the failure scenario Netflix originally
wanted to address -- what happens if your single datacenter goes down (which
eventually will happen, no matter how much redundancy you think it may have)?

Fine then, separate the master and slave in different DCs (with fat and fast
pipes so 2PC or synchronous replication can keep things consistent without
impacting latency too much). Now where does the auto-promotion logic live --
the DC with the slave or the master? What happens if my client can reach both
DCs, but there are connectivity issues on the links connecting the DCs?
Network hardware can fail like any other kind, backhoes accidentally sever
buried optics, and routing protocols do take time to converge.

P is now rearing its ugly and inevitable head, and you're faced with the C/A
decision -- can I see and alter my queue (leading to inconsistency) or is it
unavailable? Is this scenario pathological?

~~~
lsd5you
Sorry, but this argument is the worst kind of wrong - it is half true.
100.00000% consistency/availability is of course impossible in any
circumstances.

'It is only relevant for a few dozen companies'

... because of size/volume. Your argument would make it relevant to all
companies, unless netflix and other large companies have to live to a higher
standard - which i do not think is being argued.

The CAP theory really is about bottlenecks, high volumes of possibly
conflicting data to different nodes cannot be synchronised with guarantees.

It is possible to have distributed transactions on lightly loaded servers.
These could complete in a timely fashion (<10 secs) 99.9% of the time with,
theoretically a geometric drop off for the probability of longer delays.

The answer to the disconnected data centre is to have 3 with majority rule.
Then the loss of a single link can no longer break consistency.

What is more, the article talks about the data being internally inconsistent
(dangling references... etc.) this seems to be totally unnecessary for
something like private user data. For the rest (recommendations), as discussed
other strategies can be employed that at least guarantee internal consistency
of a node.

