
Netflix Billing Migration to AWS - hepha1979
http://techblog.netflix.com/2016/06/netflix-billing-migration-to-aws.html
======
partiallypro
I find it to be a peculiar business decision to completely (as much as you
can) migrate to one of your largest competitors' cloud service. It seems like
Microsoft is the only, of the larger cloud providers, that doesn't -really-
compete with Netflix. (Google has YouTube, and I guess even Microsoft has a
much smaller Windows Store presence.) Even if Amazon can't access the raw
data, they could see how you're utilizing it to improve their own video
service, and they get the benefit of billing you and potentially using their
pricing leverage to squeeze your margins.

I feel like Apple's approach of utilizing multiple providers makes more sense
(though they do this for uptime and redundancy.) Maybe I'm being a pessimist.

~~~
tomphoolery
> Even if Amazon can't access the raw data, they could see how you're
> utilizing it to improve their own video service, and they get the benefit of
> billing you and potentially using their pricing leverage to squeeze your
> margins.

So you're saying Amazon is going to risk millions of dollars so they can make
a few more bucks on video streaming, which is like 3 levels down from their
primary business?

Prove it. I had an argument with my last company about this very issue. If
Amazon's primary business was video delivery, then that would make a lot of
sense. But where does Amazon's primary revenue stream _actually_ come from?
That's right, it's AWS. Amazon may be an excellent retailer, but they spend
just as much money as they make on the shipping and fulfillment side to get
shit on your doorstep faster and cheaper than anyone else out there. Each
person that spends money on Amazon can only really spend a few hundred dollars
per year. But even a small company that's entirely hosted on AWS, like 70% of
the companies I've worked for, pays Amazon thousands of dollars per month for
hosting. There's definitely more people than companies, but shipping stuff to
people costs a lot more money than Amazon needs to pay in order to have your
stuff hosted by them. Plus they do a lot of R&D into making their own systems
faster and more efficient, eventually passing savings down when they re-work
their pricing tiers.

Basically, my argument is/was that the whole idea of Amazon stealing your IP
to make a few extra bucks on whatever they happen to be doing is totally bunk.
Amazon is really in the infrastructure business, and if you're a video
startup...you're NOT. And take it from someone who watched a company try and
fail to build a competing private cloud with no budget and a skeleton
crew...it's stressful and not fun at all.

The main reason to use multiple providers is, as you said, uptime and
redundancy...and "not putting all your eggs in one basket", so to speak. It's
an engineering, not political, decision. My last company could have probably
saved themselves by moving everything to AWS and shutting down their Level3
internet-backbone connectivity and direct fiber from the office to the
datacenter (which is the same technology AWS is using anyway, except they have
an actual cloud API and not just a pile of servers), but they were too busy
conflating this engineering/performance decision with one that must be made
for political reasons.

~~~
tmptmp
>>Prove it. I had an argument with my last company about this very issue. If
Amazon's primary business was video delivery, then that would make a lot of
sense. But where does Amazon's primary revenue stream actually come from?
That's right, it's AWS.

Let me try. Earlier Amazon's _primary_ business was selling books, then it
"became" selling almost every object that can be legally sold, and now you are
saying that it "is" AWS. What about tomorrow? Tomorrow, it easily may "become"
selling videos too. With Amazon it's very much possible.

So, it's not just technical issue, it's a political/business issue too. Of
course, as you have said, they must take into consideration the trade-off. If
the trade-off is more like "killing yourself under the technical burden of
setting up a good network" vs "potentially allowing/helping Amazon to take
advantage of your hosted service on their AWS and thus to become a future
competitor" then they may go to AWS and/or other cloud provider(s).

edit: allowing/helping

~~~
frik
Microsoft's business shifted too. Long gone are the days they were a software
vendor for end user. Nowadays they produce services and hardware products for
enterprise customer. And end user are the product, their new subscription
based Office and Win10 collects a lot of private data like key presses
(keylogger), audio from microphone, scans documents and uploads unspecified
tracking data in many encrypted TLS (phone home). And Microsoft is known to
suddently compete with you. There is a rule, never compete with Microsoft,
they have more money than you, they will make their competiting
product/service/hardware available for less money than you could offer.

~~~
corobo
Oh come off it. You're going to have to back that one with sources.

------
buro9
Thanks for writing this up, I'm just at the tail end of having re-architected
the CloudFlare billing system, also a subscription system written in Java with
a MySQL back-end, but fronted by a Go API that insulates the rest of the
business from the internals of the billing system.

The blog post covers a lot of the high level stuff really well, but I'm
interested to learn whether you experienced any issues along the way, what
they were, how you dealt with them.

In CloudFlare's case our migration was made more complex by also adding
PayPal, and changing our processing gateway. Both of which created risk that
we've had to work hard to understand and mitigate, i.e. how different gateways
may, with the same card, return different results, etc.

~~~
pamonrails
We do a lot of migrations from custom systems or SaaS integrations to Kill
Bill (open-source subscription billing and payments platform) [0] and we've
summarized our strategy and usual pain points in our migration guide [1]. You
might find it useful.

Happy to chat offline too if you want to go into specifics.

[0] [http://killbill.io/](http://killbill.io/) [1]
[http://docs.killbill.io/0.16/migration_guide.html](http://docs.killbill.io/0.16/migration_guide.html)

------
tzakrajs
How many times is Netflix going to finalize its transition to AWS?

~~~
yeukhon
They won't 100% be on AWS, but probably close to 95%. Netflix owns its own
media CDN.

~~~
dexterdog
So almost all of their bandwidth is handled outside of AWS. I would wager a
guess that bandwidth is quite a bit more than 5% of their ops bill.

~~~
yeukhon
I think that's where you have to draw line between owning your own
infrastructure and using a public cloud in the architecture.

------
ww520
DRBD works very well for high availability, especially good to provide
failover for master database, since that usually requires fast failover like
under 10 seconds. Five or six years ago, did couple HA setup with DRBD along
with Linux HA and virtual IP. The failover work great.

~~~
kbenson
I was under the impression the accepted way to fail-over with mysql and DRBD
was to fail out the old server, and then start repairing tables, because you
couldn't be sure they were in a valid state. That's about a decade old info
though.

I always just set up master/master replication with mysql. You can get free
distributed reads that way if you architect your application right.

~~~
ww520
DRBD produces two mirror copies of data. When the primary failed, the standby
has whatever data the primary has just before the crash. When the standby
starts up, the RDBMS goes through the normal recovery and bootup. It's same as
the primary crashed and being started up again.

MySQL's master/master has a number of complication and problems: 1. data loss
due to async nature of replication, 2. update conflict on same data on
multiple masters, 3. two masters mean two IP so all the clients need to know
how to fail over to different IP, 4. complication in adding or removing
master.

With DRBD, the disks are mirrored so that's no chance of data loss once a
transaction is committed. There's only one master so no complicate conflict
resolution. Linux HA's virtual IP means the standby will take over the primary
IP so the clients don't need to know there's a server failover. Adding or
removing standby is easy. DRBD will sync the disks automatically, no downtime
on the primary.

~~~
kbenson
> 1\. data loss due to async nature of replication

That depends on what you mean by async. The replication itself is synchronous
(statements cannot happen out of order), it's just not lockstep with disk
writes and commits. I think it's more illustrative to say it's delayed.

> 2\. update conflict on same data on multiple masters, 3. two masters mean
> two IP so all the clients need to know how to fail over to different IP

I'm not referring to multiple live masters, I'm referring to a set of servers
where each is both master and slave to the other, and one live master server
which gets the HA IP address. In that respect, there is no difference to a
DRBD replicated setup. Clients just use the HA IP.

> 4\. complication in adding or removing master.

I never found it that complicated, and I deployed it at least 5-6 times for
multiple companies, and even with slightly different topologies (master/master
where each master had an additional dedicated slave reserved for intensive
read-only queries). What did you find complicated about it?

~~~
ww520
1\. MySQL replication is async by default; see [1]. That means commit returns
before replication to peer is complete. Committed data can be lost if master's
disk is destroyed before a slave replicates it. NDB has synchronous
replication but it has other limitation. 5.7 seems to have a semi-sync mode
now.

[1]
[http://dev.mysql.com/doc/refman/5.7/en/replication.html](http://dev.mysql.com/doc/refman/5.7/en/replication.html)

4\. For complication, you need to find out and configure the binary log
position. Also because replication is on the binary log, if you ever truncate
the log, you can't simply add a brand new master. You have to do a backup on
the primary and restore to the new master, and then set up the log position to
just prior to backup. Just lots of extra complication.

~~~
kbenson
> That means commit returns before replication to peer is complete.

That's what I was talking about. It's just a matter of what aspect of it you
are talking about, but I'll give you that it's in their official
documentation, so there's no point in me pressing the issue.

> Committed data can be lost if master's disk is destroyed before a slave
> replicates it.

That is true. It's a trade-off you can make for slightly different CAP
assurances, or nuances in the failure states at least (mostly in what you
might expect to do in a split brain scenario).

> For complication, you need to find out and configure the binary log
> position.

Your backups should be logging the binary log position as well (--dump-slave
or --maser-data). If they aren't, you aren't doing yourself any favors.

> Also because replication is on the binary log, if you ever truncate the log,
> you can't simply add a brand new master. You have to do a backup on the
> primary and restore to the new master, and then set up the log position to
> just prior to backup. Just lots of extra complication.

If you have to do another backup because it was truncated recently, you aren't
much worse off than doing backups with DRBD replication (which even if you
have a slave configured and do backups off that, you can truncate logs and
need to backup from the master then as well). The downside is that you may not
want to immediately do a backup of the master due to reasons of load, which
will leave you without a failover for a short while. Whether an extra
queryable resource available is worth that is up the the architect.

I remember a Percona training I was at a few years back there were a few more
clustering options available I hadn't played with (and still haven't). Percona
XtraDB Cluster was one, and it's supposed to support synchronous master/master
replication. That might be the best of both worlds, if it lives up to its
billing.

------
merb
They used a DRBD replicated MySQL. Wonder why they used MySQL over PostgresSQL
then.. Would be great to know if they looked that up aswell.

~~~
setheron
What does DRBD have to do with MySQL vs PostgresSQL ? DRBD is just block
device replication

~~~
gdulli
I think it's just a reflex for (ex-MySQL) Postgres users to ask that anytime
they see someone using MySQL. If you have significant experience with both you
know the quality of life is different between the two.

~~~
merb
Actually I never used MySQL really. But sometimes it's great to know _why_
decisions are made. Actually I don't think they said, "well lets use MySQL
over Oracle" especially since MySQL is a Oracle product, too. There would've
been a way to use MariaDB aswell. And I guess anything with license costs fall
out already (they explained why in the article).

Edit: My guess would be that they still keep galera in mind, but since they
didn't shared they why, one could only guess. And Transaction Wraparound
maybe.

~~~
rimantas
MySQL had useable replication years beofre PG. Maybe that's why.

------
adrianggg
I got really excited because I read the heading as Netflix migration costs
paid for by AWS. I thought they worked out deal to get a free tier during
migration. Wow...oh...nevermind... :-)

~~~
ryanmerket
I work for AWS. I believe we do offer some migration assistance for bigger
startups. Hit me up if you want to learn more: rmerket@amazon.com

------
hetfeld
Dropped Oracle, using MySQL. Why not use PostgreSQL instead?

~~~
lapitopi
I work on the Netflix Billing Team.

PostgreSQL was indeed a very attractive option, but we wanted to keep a path
to Aurora open. When we were working on the migration, Aurora was still in
beta, so instead of going to Aurora directly, we decided run our own MySQL
instances on EC2.

~~~
philliphaydon
Why would you go to RDS with such large amounts of data when AWS do not
provide ability to get data out easily. If you moved away from AWS in the
future for what ever reason your data is more or less stuck in AWS.

~~~
jon-wood
Amazon released their data migration service a while back, which allows you to
transfer data in just about anyway you might want to, they'll migrate data
between different RDS engines (MySQL to Postgres for example), and to
databases outside AWS. They even support near real-time replication to
database servers outside AWS, so you could hypothetically replicate your RDS
instances to a fail over environment with another provider. There's very
little risk of your data being locked in now.

[1]
[https://aws.amazon.com/dms/?nc2=h_mo](https://aws.amazon.com/dms/?nc2=h_mo)

~~~
philliphaydon
Unless you use SQL Server. The dms service is basically useless with
sqlserver. We can't get our 200gb do out of AWS. And any method that works
without dms takes about 40 hours.

~~~
toomuchtodo
Can you replicate to a slave outside of RDS and then perform a replica
promotion during scheduled maintenance?

------
crisopolis
At least they ditched Oracle (licensed)...

~~~
emcrazyone
I worked at a Fortune-5 that was heavily invested in Oracle.

Oracle has a nasty licensing model where they charge you per core regardless
of if that core is a physical one or not (hyper-threading). While I was there,
it suites told all the engineering managers that Oracle was out and the going
forward solution was Microsoft SQL which, as I understand, has more relaxed
licensing model.

Another thing I'm wondering about is I would figure Netflix to be big enough
to have SAN storage. Just about every large company I worked at always used
SAN replication technologies instead of open source stuff. And it's not a
debate about open source solution vs. commercial. It's more about support.
Large companies want a throat to grab when things break bad.

~~~
c17r
MS SQL used to be per socket pricing. With 2012 they switched to per core.

That was a sad day.

~~~
e12e
Does anyone happen to know how MS actually charges per core these days? I
found: [https://www.microsoft.com/en-us/Licensing/learn-
more/brief-l...](https://www.microsoft.com/en-us/Licensing/learn-more/brief-
licensing-by-cores.aspx) which points to
[http://go.microsoft.com/fwlink/?LinkID=229882](http://go.microsoft.com/fwlink/?LinkID=229882)
\-- taken together, it looks like AMD hex-core+cpus count at .75 cores,
single-core cpus count as 4, and dual-cores count as 2. Given that you need to
buy 2-packs, it appears you can get a single 2-pack for a single dual-core
(hyper-treading appears to be ignored), 2 packs for a single single-core cpu,
or 3 two-packs for 8 AMD cpus?

------
cia48621793
What if the Netflix side on AWS was hacked say like their security credentials
was leaked?

------
ForHackernews
Does this mean Amazon can mine Netflix's data to improve their Prime Video
services?

~~~
thramp
Customer data is sacrosanct within Amazon. Cannot touch it without the
customer’s consent.

source: I work for Amazon Web Services.

~~~
kevin_b_er
Amazon already showed its hand at maliciousness when it blackballed all
chromecasts from stores, including 3rd party sellers. I don't put suddenly
blackballing netflix beyond amazon's consideration. Amazon is cutthroat with
public customers, corporate customers, and with its own employees. I think
netflix is stupid to put more eggs in the vulture's nest.

In fact, I can pretty much guarantee that at the first opportunity where the
lawyers agree it is a usable hole, they'll try to kill netflix through denying
it service. Taking out netflix for a week or two while the engineers rebuild
the backends with a different provider would be excellent for amazon's video
division.

------
jason46
Is this why my daughter can't watch Young Justice? I've noticed quite a few
titles show unavailable. Curious if Netflix is positioning to sell to amazon.

------
mikikian
Dropped Oracle, using MySQL. Why not use AWS Aurora instead?

~~~
nemothekid
FTA: While our subscription processing was using data in our Cassandra
datastore, our payment processor needed ACID capabilities of an RDBMS to
process charge transactions. We still had a multi-terabyte database that would
not fit in AWS RDS with TB limitations.

------
back_beyond
This thread is missing Google's PR team and links to its SRE book

------
innocenat
Am I the only one who initially thought that Netflix is passing the bill for
migration to AWS?

------
Annatar
_Considering how much code and data was interacting with Oracle, one of our
objectives was to disintegrate our giant Oracle based solution into a services
based architecture. Some of our APIs needed to be multi-region and highly
available. So we decided to split our data into multiple data stores.
Subscriber data was migrated to Cassandra data store. Our payment processing
integration needed ACID transaction. Hence all relevant data was migrated to
MYSQL._

Considering that Cassandra is not ACID compliant with her "eventual
consistency", and that MySQL is notorious for corrupting data and not
functioning correctly, I am compelled to wonder just what kind of people work
at Netflix. And who gets the idea to go to AWS and pay the full virtualization
on Linux performance penalty?

Now, I've done Oracle engineering at some very large databases (hundreds of
millions of rows, OLTP and DWH), and I know that Oracle is a smoking fast
database when the right people develop on it. Also makes me wonder what kind
of code they had running, and what kind of people selected it, when they
managed to gum up what is essentially the Bugatti Veyron of databases.

Given this information from "Netflix" I won't be considering them as a
potential employer any time soon. It has to be a mess over there.

~~~
tedivm
The majority of the corruption issues with MySQL comes from using the innodb
engine. AWS built their own MySQL engine called Aurora that I would be shocked
if Netflix wasn't using. It's designed for distributed workloads and should be
harder to corrupt.

I'll admit I'm confused about picking Cassandra as well, but not for the same
reasons you are. They're only storing subscriber data (billing address,
subscription type, etc). That data is going to remain static for months at a
time. When changed the only potential problem that could occur is the billing
process using old data, but I'm guessing their system is smart enough to try
again in thirty minutes.

Oracle may be fast, but it's also expensive. This is billing, which means
batched jobs running in the background- they don't care how long each
individual transaction takes, and I'm positive it's going to be cheaper to
roll up more servers to compensate than it is to pay Oracle's licensing fees.

~~~
tokensimian
I'm not sure if you saw this other thread, but I thought it answered one of
your questions so was worth sharing. Sounds like Aurora wasn't stable when
this migration began, so they did the migration with an eye towards the next
migration (to Aurora).

Sorry about the c/p but I don't see a permalink on lapitopi's comment.

lapitopi 1 day ago I work on the Netflix Billing Team. PostgreSQL was indeed a
very attractive option, but we wanted to keep a path to Aurora open. When we
were working on the migration, Aurora was still in beta, so instead of going
to Aurora directly, we decided run our own MySQL instances on EC2.

edit: begun/began

