
Twitter migrates data to Google Cloud - theBashShell
https://cloud.google.com/twitter/
======
boulos
Disclosure: I work at Google Cloud (and directly with Derek and the Twitter
team).

I’ll try to edit this page tomorrow when I’m at a computer, but there’s much
more information in Derek’s talk at NEXT [1]. They (rightfully) didn’t want to
get into a detailed “this is what we saw on <X>”, but Derek alludes to their
careful benchmarking across providers.

While you should always assume smart people make economically reasonable
decisions, Derek’s point about savings is about list price differences that
result from total system performance (and not any sort of special
discounting). I’m hoping a follow-up talk will let us say more about how the
migration is going, while this talk was focused on the decision to move part
of (!) their Hadoop environment to GCP.

[1]
[https://m.youtube.com/watch?v=4FLFcWgZdo4](https://m.youtube.com/watch?v=4FLFcWgZdo4)

~~~
dijit
FWIW I’m not a big cloud fan, but I was tasked with the obligation of finding
the “least worst” cloud provider in terms of predictable throughout and total
cost (inc dev/ops time) and GCP came out as a clear winner despite heavy
financial incentives from both Azure and AWS. (We’re a very large corp).

Due to our new allegiance to google cloud I’ve been given a little more
privileged access to engineers (after the fact) and I can tell that I
definitely made the right choice, the people I spoke to favoured having a
clean/clear backend with actual quality of life features; but since they don’t
go nicely into feature comparison charts people often think that GCP is less
mature or featureful.

I’m a real convert, and our company uses all three of the big cloud providers
in some fashion- but my team only deals with GCP and we’ve had the least
headaches.

What I’m trying to say is: you guys are doing great. I’m really happy with the
product for my use-cases.

~~~
indigochill
In what way(s) do you find GCP better than AWS? My company was using both for
a while but has migrated more towards AWS lately. But I'm just learning my way
around cloud development now so I don't have insight into the major
differences between providers.

~~~
manigandham
GCP has the fastest, simplest, and cheapest primitives for building.
Organization and project hierarchy combined with IAM permissions (integrated
with G Suite if you use it) make security and access easy. Every project has
its own namespace and can be transferred to different owners or billed
separately.

VMs don't have a mess of instance types, just standard but customizable
cpu/ram with any disks and local SSDs attached. Live-migrated, billed to the
second, and automatically grouped per cpu/ram increments for billing
discounts.

Networking is 2gbps/core upto 16gpbs/instance with low latency and high
throughput regardless of zone and doesn't need placement groups. VPCs are
globally connected and can be peered and shared easily across projects so that
it's maintained in one place across multiple teams. Fast global load balancing
with a single IP across many protocols and immediate scaling.

Storage is fast with uncapped bandwidth and has strongly consistent listings.
BigQuery, BigTable and lots of other services have "no-ops" so there's no
overhead other than getting your work done. Support is also flat rate and not
a ripoff percentage.

The major downsides to GCP are the lack of managed services, limited and
outdated versions for what they do offer, poor documentation and broken SDKs,
and dealing with the opinionated approach of their core in-house products.
This usually means as a startup that you are more productive on AWS or Azure
because you can get going in a few clicks with a vast ecosystem, while GCP is
a better home for larger companies that have everything built and just need
strong core offerings with operational simplicity.

~~~
lnkmails
Your reasoning and the conclusion resonates a lot with my own findings for a
project that we tried to bootstrap. GCP's GKE and Istio integration are
becoming more compelling to consider them for containerized workloads. EKS
isn't quite there. One of the things we are struggling with is finding a
qualified partner who can help us build our new project on GCP. We don't have
a mature infrastructure team and while we try to bootstrap one, we would like
to rely on partners to help us move the needle. Even partners in GCP premier
list don't have case studies of migrating monoliths to GCP. Most of them talk
about GSuite migration which isn't quite the same. AWS wins this battle. They
have far more mature partners with better track record and thought leadership.
I wonder if you see this the same way. Maybe partnership is not crucial for
you? Some insights from the community would help.

~~~
manigandham
Yes. AWS biggest advantage now is the giant marketplace of vendors and
partners so you can get help and managed services for just about anything.

In my experience, most startups just want managed options they can run
themselves rather than engaging partners but if that's what you need then AWS
will have more companies to offer, although GCP does have qualified partners.
I recommend contacting one of the GCP developer advocates either in this
thread or on twitter for help, or email me separately and I'll put you in
touch.

Also I havent worked with them but
[https://shinesolutions.com/](https://shinesolutions.com/) has put out plenty
of articles and case studies that seems to show they're pretty capable, that
might work for you.

~~~
vira28
Just curious, Which support tier you are using from GCP?

For an early stage startup/Indie developer, GCP support model is not
appeasing. $100 Per user for development role is unacceptable. (at least to
us, the stage where we are now)

~~~
manigandham
We use the production roles but you don't have to signup every single person,
just those who file tickets and interact with support.

If you're very early then you can also try the older support pricing:
[https://cloud.google.com/support/premium/](https://cloud.google.com/support/premium/)

If that's still too expensive then you can probably rely on the free support
and forums until you have more spend and revenue.

------
dustinmoris
Reading most comments here I feel like nobody wants to give a tiny bit of
credit to either Twitter or Google, which I find disenchanting.

It is true that any cloud provider would get good publicity by being able to
say that Twitter runs on their cloud, but that is precisely the same reason
why we shouldn't just shrug this off as a publicity/business decision, because
just like Google might have tried to persuade Twitter other cloud providers
will most certainly have tried their best too. If Twitter still went with GCP
then I think it is only fair to assume that there must have been some real
advantage in going with GCP. I don't think anyone who says this must have been
a pure business decision is entirely honest with themselves, because a cloud
which cannot handle Twitter's volume is no good even if it was entirely for
free.

I have nothing to do with Twitter (actually don't even like Twitter because it
is such a toxic social media platform) and I don't like many things which
Google does, but as an engineer myself I can only say that after using Azure,
AWS and GCP for many years in a commercial setting that GCP is indeed years
ahead. The quality, speed and reliability of GCP is second to none from my
experience and I honestly couldn't say the same about AWS and especially not
at all about Azure.

~~~
smt88
I'm not worried about GCP's quality or technical merits.

Google's management is what scares me. I'd never build a business around a
company with so little follow-through, commitment to product longevity, or
focus.

------
pizza
The scale of these megacorps is crazy! I wonder how you even plan to move that
much data. I know that Amazon has a service called Snowmobile for stuff like
this. They probably did something similar here.

"You can transfer up to 100PB per Snowmobile, a 45-foot long rugged sized
shipping container, pulled by a semi-trailer truck."

[https://www.youtube.com/watch?v=8vQmTZTq7nw](https://www.youtube.com/watch?v=8vQmTZTq7nw)

~~~
derekalyon
This is Derek, from the video. Sorry I am just joing the conversation mow.

For this use case, we setup 800 Gbps of interconnect with Google.

~~~
chatmasta
So at 300PB it took about a month to transfer? Sounds like an interesting
project in itself.

~~~
derekalyon
Yeah! We’ll be operating in a hybrid model steady-state, so the links aren’t
just for the one-time copy.

~~~
AndrewKemendo
That's insane.

With the interconnect being faster than the network cards by several orders of
magnitude (I assume unless there is special hardware), how is it structured to
utilize all of the 800Gbps?

~~~
derekalyon
We are using dedicated clusters to push data over the links.

A member of our team proposed a session for Google NEXT '19 that goes into the
architecture we are using in a lot more depth. We're waiting to see if that
talk is accepted or not, but in any case we're planning to share more soon.

------
andrewguenther
Is there anything new here? Twitter announced this move back in May of last
year:
[https://blog.twitter.com/engineering/en_us/topics/infrastruc...](https://blog.twitter.com/engineering/en_us/topics/infrastructure/2018/a-new-
collaboration-with-google-cloud.html)

The video linked on the page is also from last August:
[https://www.youtube.com/watch?v=T1zjmNAuMjs](https://www.youtube.com/watch?v=T1zjmNAuMjs)

Throw a 2018 tag on this?

------
echelon
Correct me if I'm wrong, but this doesn't say _what_ Twitter moved to the
cloud. It could be literally anything, and may not in fact be core tweet or
user data.

This strikes me as a team (or teams) outgrowing the storage options that were
offered internally and choosing to outsource to fit their needs. Isolated use
for one business case, and not indicative of a broad movement to the cloud on
the part of the whole org.

I could be wrong, of course.

I'd like to know more. And I'd certainly like to know what the engineers at
Twitter think.

~~~
sidcool
You are not wrong. The message here is that whatever data was transferred,
Google was able to handle hundreds of petabytes of data. It could just be
billions of photos of cats

~~~
anvuong
billions photos of cats is invaluable data, mind you

~~~
inferiorhuman
Well Twitter is now part of the official White House record, isn't it?

------
egorfine
The next global Twitter downtime?

Their credit card failed at Google Payments, acc gets marked for fraud and no
one in support can help them recover. Twitter is gone for good.

~~~
jasonvorhe
This has long been fixed.

~~~
marcinzm
Every few weeks there's a story about something in Google's automated
machinery failing a customer with no recourse except to post an angry
blog/twitter/hackernews note and hope it get's noticed. The core issue (bad
and powerless customer service) isn't fixed and all that changes is the
particular way it may bite you.

~~~
munchbunny
When you're Twitter sized, that was never a problem to begin with. For mere
mortals, that continues to be a problem.

------
cyberferret
I've scoured the comments and article, but I might have missed it... What did
Twitter migrate _from_?? Did they have their own data centre before moving to
Google Cloud?

~~~
joatmon-snoo
Yep.

[https://www.datacenterknowledge.com/archives/2015/07/13/cust...](https://www.datacenterknowledge.com/archives/2015/07/13/custom-
servers-process-tweets-in-twitter-data-centers)

~~~
userbinator
Another step backward for the decentralisation of the Internet... not that
Twitter was decentralised in the first place, but it's unsettling to see
Google taking over more and more of the "big pieces" of Internet.

~~~
zymhan
Funnily enough, Twitter and Google were/are neighbors in a QTS datacenter in
Atlanta.

~~~
forgot-my-pw
Just moving across the street then? It's probably the fastest data transfer
ever by physically moving all the hard drives!

------
ripvanwinkle
Any inside stories on what twitter discovered when they benchmarked Google
with other providers. I imagine they must have done a very exhaustive
evaluation

~~~
gkoberger
Or, Google thought the publicity was worth a gigantic discount.

(EDIT: The provided quote from a Twitter employee hints at this directly: "the
savings")

~~~
cheriot
Google may be playing the long game. I've seen an effect on my team after
moving from physical hardware and centralized capacity provisioning to GCP
with immediate gratification hardware. It's easier to spend money so we ask
for more stuff more often.

~~~
pdelgallego
it is The Jevons paradox [1]. I have seen it as well happening when I migrated
a client from a standard cloud architecture to a serverless architecture.

>>> In economics, the Jevons paradox (sometimes Jevons effect) occurs when
technological progress or government policy increases the efficiency with
which a resource is used (reducing the amount necessary for any one use), but
the rate of consumption of that resource rises due to increasing demand.

[1]
[https://en.wikipedia.org/wiki/Jevons_paradox](https://en.wikipedia.org/wiki/Jevons_paradox)

------
rapsey
Well hopefully google does not decide to abandon google cloud :)

~~~
advisedwang
Hopefully Google doesn't abandon search!

~~~
samfisher83
Search+Ads:Google Windows+Office:microsoft

Those are the sacred cows.

~~~
ocdtrekkie
Microsoft has been hacking off bits of dependency on Windows these days in
order to push and support their cloud business, to the point I would argue
Microsoft is aggressively trying to push people off of Windows+Office and onto
Azure+365. Microsoft would far rather you use Linux on their cloud platform
over having you use Windows literally anywhere else.

I don't think Windows counts as a sacred cow anymore at Microsoft.

------
gigatexal
Huge win for Google. I hope this helps Twitter focus on features and not
firefighting infrastructure issues.

~~~
SmellyGeekBoy
Does Twitter need more features? Perhaps they could focus more on customer
support and moderation.

~~~
gigatexal
A downvote button would be nice. The ability to edit a tweet would also be
nice. It’s 2019 after all, and we can edit things on other platforms. But yes,
moderation would be great. Maybe they can leverage Google’s machine learning
hardware and smarts.

------
slackoverflower
Just makes the eventual Google acquisition one bit easier.

~~~
bduerst
What would Google gain from the acquisition that they don't already have?

AFAIK Twitter allows search indexing and is already a publisher on their
network.

~~~
ec109685
They don't provide their realtime firehose for free, though.

~~~
bduerst
Surely that cost isn't anywhere near $24B, which is the market cap for
Twitter.

~~~
ec109685
For 24B, you get Twitter (an asset), plus its Firehose. I was mostly pointing
at then Twitter has data that google does not.

------
temp231239
Hun. I was under impression that at higher scale of operations building your
own cloud is more economical. I guess money is not an issue for Twitter.

~~~
pas
They are moving batch jobs to GCP. Because there probably the occasional burst
loads and the often underutilized capacity makes using a flexible compute
fabric (aka cloud provider) a better choice.

If they are moving everything, well, Netflix did that too. Maybe opportunity
cost, maybe having your engineers work on your core product instead of running
VMs is cheaper altogether.

~~~
tybit
Much of Netflix’s spend would presumably be on CDNs, which they run themselves
not on AWS.

~~~
opportune
uh correct me if I'm wrong but all CDN's that Netflix purchases also need to
have a large storage cache backing them right? Meaning each CDN Netflix uses
for local caching also requires a colocated datastore to circumvent their
centralized-bandwidth issue.

~~~
PuffinBlue
You're not wrong per se but Netflix takes a similar route to Google Global
Cache in that they provide the hardware and place it inside the networks of
other ISPs etc etc. So it's a CDN in the sense its a distributed content
delivery network but not in the sense that they just use large traditional CDN
providers.

Netflix provides massive storage boxes to ISPs that serve content from within
the network of the ISP the user is connecting from. This can save the ISP a
lot of external traffic so they generally want to do this to save costs and
meet customer demands. YouTube does a similar thing.

They call it OpenConnect:

[https://openconnect.netflix.com/en_gb/](https://openconnect.netflix.com/en_gb/)

------
opportune
side note, if you want to know why distributed infrastructure engineering has
been the place to be recently, think about the fact that engineers making
$200-500k/year are making decisions that can save big corporations $1-5mm per
year in compute/storage costs

------
kerng
The issue with GCP and with Google in general is support. I'm certain that's
not an issue for Twitter as there are probably multiple people on call just
for the Twitter deal. But it might mislead the average company down a bad path
believing there is decent support with GCP. Any custom issues (like billing)
and it's a headache dealing with Google.

------
patrickg_zill
What Twitter is paying for and the level of service they receive from Google,
is very different from the typical GCP pricing and service level that would be
available for say, your company...

Realistically, it can't be considered as the same service, can it?

------
sudhirj
One thing to remember with the benchmarks is that when evaluating a cloud
provider, the benchmarks are useless if talked about with absolute numbers or
with regards to hardware specs. Because they're so horizontally scalable, it's
more a question of average cost per compute operation or cost per (peta) byte.

Both AWS and Google clouds are perfectly capable of running Twitter, and
running any software that Twitter uses, even if the number of machines is
different. The only "benchmark" applicable is actually the total _negotiated_
cost of the machines required to get the job done.

So it's not about being able to do things on Google cloud with fewer
processors or less RAM or faster hard drives - Google was willing to give them
a lower total cost of ownership for reasons known only to Google.

Do not assume that Google (or anybody else) will give you a similar
preferential deal. Ignore the "benchmarks".

~~~
sciurus
I don't think it's purely a measure of cost per compute operation. That
assumes you would run the same software on any cloud using plain VMs or a
service that slightly abstracts them. That may be true in this case since
Twitter will continue to use Hadoop/Spark, but sometimes you can get a real
advantage from switching to a service that only one cloud provider offers. For
instance, someone in these comments pointed out that Twitter already migrated
their anayltics workload to BigQuery. Evaluating BigQuery vs Redshift vs
something else is not as clear cut as cost per compute operation.

------
thisgoodlife
How are they gonna move all that 300 pb of data to Google's data center? A
fleet of semitrailer trucks?

~~~
aboutruby
For instance AWS has 50 Terabytes devices that they ship from your own data
center to their data center:
[https://aws.amazon.com/snowball/](https://aws.amazon.com/snowball/)

~~~
kondro
They also have the 100PB Snowmobile.

[https://aws.amazon.com/snowmobile/](https://aws.amazon.com/snowmobile/)

------
tome
I'm curious what happens to everyone at Twitter who was employed in managing
their old data centre.

~~~
thecanman
Well I'm sure they still have amazing career prospects just based on their
resumes.

~~~
tome
Sure, but what's the actual process for dealing with them? Will they be let go
in small batches as the infrastructure moves over to GCP, or what?

------
yeukhon
So what will happen to the small Cloud team in NY? I thought they were
developing their own dc.

------
prepend
This really makes me wonder if Twitter finally made up their mind who they
will merge with.

~~~
partiallypro
I too was thinking this, does this mean a future acquisition is not out of the
question? Frankly, I find a Google owned Twitter to be a terrifying prospect.
Not that Jack is doing a great job there...but I'd much rather they be owned
by multiple companies in a partnership than by one company. Twitter, imo, is
far far more powerful than Facebook in driving world events given the
leadership and journalists on the platform.

------
tanilama
I think by this announcement, it is probably safe to say, even at the size of
Twitter, it is probably still more economical to handle your operations to
bigger players.

Hosting your own infrastructure makes less and less sense going forward, the
cloud is the new king now.

~~~
kortilla
Probably not safe to conclude anything. Twitter has very different
requirements than something like a bank, an insurance company, an auto
manufacturer, etc.

------
maloneyg
More beneficial for Google I believe.

------
cr0sh
I know my comment won't move this discussion forward in any fashion, but I
have to ask anyway.

What was this data, and why did they feel they needed to keep it?

I mean - 300PB is nothing to sneeze at. I can understand wanting to keep
important data around (IP, for instance, or recent server logs for the last 30
days or even a year) - but this amount seems to go well beyond that.

In fact, it almost seems like they have stored every single tweet ever
written. In addition to probably all of their logs, and who knows what else.

Why are they keeping all of that information? If it is all the tweets ever
written - they don't have copyright to it, so such information wouldn't belong
to them...

I'm not a big twitter user - can I search for a tweet on their system that I
(or someone else) made say...5 years ago or longer? If this is a bunch of
tweets - is that why they keep them around?

Are people in general aware of this? That is - if they are tweets - that
Twitter has stored all of them, forever and ever, and that they are all still
accessible (if not by the general public, then by law enforcement at the very
least)?

If they are tweets - do users have any recourse at being able to remove those
tweets (aside from those that they likely are legally bound to keep - I am
thinking of the current POTUS's tweets, for instance)?

What is the ultimate purpose behind keeping all of that data? Even if it
wasn't a bunch of tweets, whatever it is can't really be that useful
otherwise, except for maybe the most recent stuff created in the past year, at
best. Things like that - logs, etc - might be destined for a rotated "cold
storage" and would ultimately "age out". Accounting records (financial data,
etc) probably isn't that great a percentage of the data (if at all?) - but
whatever was there, too, would likely need to be saved for a greater range -
but even there maybe only 10-15 years worth (and it certainly wouldn't even
comprise a fraction of the bulk).

So what is it? Why is it being kept? To what purpose? While 300PB isn't that
great of an amount in physical terms (that is, the storage medium probably
doesn't take up a great amount of space in a datacenter), it still isn't
anything tiny, either (especially depending on what medium was used - I mean,
I doubt they are storing it all on a ton of 256GB microSD cards - though that
does bring up an interesting idea/thought to mind - but I digress).

I'm of the thought that companies shouldn't just "throw away" data - but it
should be tempered by "importance to humanity over time". Not everything
should be kept - but, for instance, it would have been nice if we (humanity)
had kept around the original transmissions from the surface of the moon, or
the blueprints to the rocket that took humans there. But we probably don't
need all the various interoffice memoranda and fluff at the various NASA
contractors (we probably don't need the accounting records - but BOMs might be
useful).

That's just a couple of examples I can think of off the top of my head, but
history is rife with instances of companies imploding or being bought out or
restructured in such a manner that data is just "destroyed" instead of kept.
Or lost, or otherwise made unretrievable to humanity.

Here with Twitter - an active company to be sure - we have the seemingly exact
opposite - almost like r/datahoarders were in charge.

...and that should concern people a bit, regardless of which company it is -
but especially a company like Twitter.

~~~
sidibe
I think the vast majority of people would expect Twitter to keep all the
tweets they've made...

------
yani
This feels like marketing material. Twitter jumped over to GCC because of the
offer they received.

~~~
SketchySeaBeast
The first clue was that the news was being hosted on
[https://cloud.google.com/](https://cloud.google.com/)

------
dugluak
Next step - acquire twitter?

~~~
smt88
I definitely wonder if Twitter's management considered this. Any price for
Twitter goes down by at least a few million if no major cloud migration is
necessary.

That said, a few million is a rounding error on a price Google would pay for
Twitter.

------
iheartpotatoes
wow. talk about "loss of service challenges".

------
synaesthesisx
[Removed]

~~~
jcmi
LOL

------
ranman
I have had a hard time dealing with twitter engineering since the account
activity API debacle.

