I’ll try to edit this page tomorrow when I’m at a computer, but there’s much more information in Derek’s talk at NEXT . They (rightfully) didn’t want to get into a detailed “this is what we saw on <X>”, but Derek alludes to their careful benchmarking across providers.
While you should always assume smart people make economically reasonable decisions, Derek’s point about savings is about list price differences that result from total system performance (and not any sort of special discounting). I’m hoping a follow-up talk will let us say more about how the migration is going, while this talk was focused on the decision to move part of (!) their Hadoop environment to GCP.
Due to our new allegiance to google cloud I’ve been given a little more privileged access to engineers (after the fact) and I can tell that I definitely made the right choice, the people I spoke to favoured having a clean/clear backend with actual quality of life features; but since they don’t go nicely into feature comparison charts people often think that GCP is less mature or featureful.
I’m a real convert, and our company uses all three of the big cloud providers in some fashion- but my team only deals with GCP and we’ve had the least headaches.
What I’m trying to say is: you guys are doing great. I’m really happy with the product for my use-cases.
If it works for you thats good, but it wouldn't be my first choice even if I found it to cost less than other alternatives.
I will agree that GCP has some nice quality of life features, and is certainly favorable for a small project compared to some other providers, but I find it hard to trust google's ability to keep up with the demands of a large organization.
VMs don't have a mess of instance types, just standard but customizable cpu/ram with any disks and local SSDs attached. Live-migrated, billed to the second, and automatically grouped per cpu/ram increments for billing discounts.
Networking is 2gbps/core upto 16gpbs/instance with low latency and high throughput regardless of zone and doesn't need placement groups. VPCs are globally connected and can be peered and shared easily across projects so that it's maintained in one place across multiple teams. Fast global load balancing with a single IP across many protocols and immediate scaling.
Storage is fast with uncapped bandwidth and has strongly consistent listings. BigQuery, BigTable and lots of other services have "no-ops" so there's no overhead other than getting your work done. Support is also flat rate and not a ripoff percentage.
The major downsides to GCP are the lack of managed services, limited and outdated versions for what they do offer, poor documentation and broken SDKs, and dealing with the opinionated approach of their core in-house products. This usually means as a startup that you are more productive on AWS or Azure because you can get going in a few clicks with a vast ecosystem, while GCP is a better home for larger companies that have everything built and just need strong core offerings with operational simplicity.
In my experience, most startups just want managed options they can run themselves rather than engaging partners but if that's what you need then AWS will have more companies to offer, although GCP does have qualified partners. I recommend contacting one of the GCP developer advocates either in this thread or on twitter for help, or email me separately and I'll put you in touch.
Also I havent worked with them but https://shinesolutions.com/ has put out plenty of articles and case studies that seems to show they're pretty capable, that might work for you.
For an early stage startup/Indie developer, GCP support model is not appeasing.
$100 Per user for development role is unacceptable. (at least to us, the stage where we are now)
If you're very early then you can also try the older support pricing: https://cloud.google.com/support/premium/
If that's still too expensive then you can probably rely on the free support and forums until you have more spend and revenue.
https://cloud.google.com/solutions/migration-center/ is a reasonable jumping off point. Sorry if that didn't come up more clearly.
Gcp was always exactly the same perf, on the button exactly.
Azure was all over the place, fast then slow then fast then slow.
Aws was in the middle.
If you want guaranteed perf then go with gcp 100%
That is spot on. After a year of research and testing on GCP and AWS, I concluded with the same thing. AWS is much more startup/indie-friendly.
We hate Oracle like the next person, but we're locked in. And there's no way we're going to the "Oracle Cloud"
I can understand why this is a no-go for the GP.
This is the cost of lock-in.
Personally I'd rather have the expertise on staff than pay oracle (and microsoft) for this shitty behaviour.
Edit: I'm being downvoted but it's true, if you need enterprise support you're paying the same percentages you would on any other cloud service.
The role-based support is flat-rate but you're right that the Enterprise tier is $15k or percentage spend. From my experience, most companies are fine with the 1-hour production tier.
I haven't worked with GCP beyond 6 figure scale but I dont think that matters. If you need the 15-min response times and the TAM guidance then that's the fee but at least you can opt-out if you don't.
So, obviously my experience is based on my use-case so things that are important to me are predictability of resources and consistency of data.
From a resource perspective (GCP:Compute) there seems to be a tendency of setting CPU affinity for certain cores, often in the same NUMA zone on the host. This means my tickrate doesn't get hijacked on certain cores randomly. That's quite nice honestly and it's something that on-prem hyperconverged VM solutions don't manage. The NUMA zone affinity / samezone affects our performance quite drastically as certain applications (such as in-memory databases or gameworlds) allocate large chunks of memory and then do little updates millions of times a second. This means memory bandwidth is very important.
GCP also has live migration of instances which is 100% transparent to the application (even the CPU clock seems to be moved over), unlike my colleagues on AWS who get 24h notifications of host machine maintenances I've never had to "deal" with even a single instance outage since starting up on GCP 1y ago. But I would assume if AWS hasn't moved on solving this problem they will soon.
Things like VPC networking being inter-zone by default is something that just makes sense and solves some headaches that I hear other teams having.
From a storage performance perspective, on AWS we were hard-capped to 5GBps per instance to S3, such a hard limit does not exist for GCP instances to cloud storage so we're able to make use of it. (in fact, I can often saturate my 10GBit interface limit even to non-google instances).
n AWS sales rep that this hard-limit can't be removed no matter how much we pay.
I can't attest to the difference in APIs, since I use terraform and that abstracts away anything I would be using.
Regarding storage again; we had reps from AWS talking to us at length and they would not guarantee that fsync() would be honoured on elastic storage. They stated that "it should never be needed" but honestly I'm not comfortable with that answer. I mean, my bare-metal database instances haven't gone down in 3 years but it doesn't mean I'm going to start decreasing the consistency for raw throughput.
There's a lot more but these are the most important things that I can think of right now.
The fixed performance EC2 instances (in contrast to instances with burstable performance) all have dedicated physical CPU allocations, and 1:1 affinity from vCPU to the underlying CPU. Similarly, memory allocations are fixed, pre-allocated, and are aligned to the NUMA properties of the host system. When combined with both our Xen hypervisor and Nitro hypervisor, this provides practically all of the performance of the host system with regard to memory latency and bandwidth.
With our latest 25 Gbps and 100 Gbps capable instances, EC2 to S3 traffic can make use of all of the bandwith provided to the instance.
Please send me more information about the interaction you had on the EBS side, msw at amazon dot com. All writes to EBS are durability recorded to nonvolatile storage before the write is acknowledged to the operating system running in the EC2 instance. fsync() does not necessarily force unit access, and explicit FUA / flush / barriers are not required for data durability (unlike some storage devices or cloud storage systems). Perhaps there was confusion about the question that was asked.
In hyper converged environments the storage and the compute live together on a service mesh and thus some processing is done on the hypervisor to manage storage, those processes don't have an affinity (in the case of vmware for example) so a guest VM core gets suspended for a couple of hundred clock ticks sometimes.
This is a super challenging problem for general purpose virtualization stacks. EC2 has been working to avoid any guest VM interruptions. With the Nitro hypervisor, there are no management tasks that share CPUs with guest workloads.
See more here, including data from a real customer with real-time requirements running on EC2 under the Nitro hypervisor: https://youtu.be/e8DVmwj3OEs?t=1796
This is a great tool to help uncover if CPU and memory NUMA within a VM aligns to the physical host or not:
Paper from the USENIX 2012 conference: http://anil.recoil.org/papers/drafts/2012-usenix-ipc-draft1....
EC2 instances have been NUMA optimized since we launched our CC1 instances in 2010. I encourage you to try ipc-bench on your cloud provider of choice, if CPU and NUMA affinity is important to your workload.
However it may depend on instance type and other factors. This complexity is a problem with AWS.
Live migrations is amazing.
Now if only Google paid more attention to certain features of their GLB... amazing tech with showstopper bugs means amazing tech I can’t use. (Also means it’s not so amazing)
"We have important news about your account (AWS Account ID: XXXX). EC2 has detected degradation of the underlying hardware hosting your Amazon EC2 instance (instance-ID: i-XXXX) in the us-east-1 region. Due to this degradation, your instance could already be unreachable. After 2018-12-28 19:00 UTC your instance, which has an EBS volume as the root device, will be stopped."
The stop date is 2 weeks after the email is sent, but as the message states, the instance is likely already dead. It was already dead in this case. GCP will live migrate your instance to another host.
Like the parent said; we used to receive these notices quite frequently and the amount of time varies wildly. I should ask the teams who currently/primarily use AWS perhaps the situation has changed?
But it was not uncommon to have 24hrs of notice, especially for instances in US-EAST2 for some reason.
GCP has much more modern APIs (they got to do "right" the first time around, while AWS had to learn from their "mistakes")
GCP is less mature with documentation, but still excellent (AWS is first class). I've found bugs, and others I know have run into straight up wrong docs - a quick message to support yielded the correct information and docs update within the week.
GCP is far more "opinionated" about how you should run a service. AWS is opinionated as well, but less so. What this means is that while GCP will sell you resources in traditional way, to go GCP native or from scratch really requires buying into the GCP "way" of doing things more than AWS does. Basically, K8S or go home.
We use GCP and I enjoy it quite a bit. I have extensive AWS experience as well, but no strong preference.
AWS now has the new arm machines in production which are cheaper. AWS also allows yearly prepaid managed databases - while Google still charges per second billing (with some monthly discount). My AWS TCO comes out to be much, much lower (almost 30%) than GCP if i take into account committed use/prepayment discounts.
For us: on AWS/Azure we had to benchmark each instance when it came up and if it performed poorly it had to be reaped and redeployed constantly. The increased overhead in dealing with things "the AWS way" was enough by itself to offset the pricing difference with the steep AWS discount we were getting.
Understanding administrative overhead is complicated but I'm sure you are taking into account the cost of humans. (or, you've already absorbed the cost of automating it?) :)
As with all things; use what works for you. I'm just very happy with GCP over the others coming from bare metal.
in terms of benchmarking - this is not something we care about (since we dont have so much CPU utilization), so probably its easier for me.
For my usecase, AWS has better administrative tools than google. I use Route 53, SES, S3 - in all those three, no other products come close (transfer acceleration...im looking at you).
The one place that I agree with you is Dataproc. That's far superior to EMR in terms of spin-up/down.
I prefer to think of reserved instances as a purely financial trick.
For example, if I have a new workload that won't fit on existing hardware with AWS, I can either: a) reserve the new instances, or b) not reserve on the assumption I will be able to optimise in the 'near' future.
In practice this is hard to know but AWS forces you to come to a correct decision up front, whereas Google Cloud allows you to not have to answer this up front without being punished later for making the 'wrong' choice.
At the bottom line, commitment impedes change.
I now have a significantly reduced opex that saves me a lot of money. If I grow very fast (and outgrow it)... I really dont care, because I will most likely have a financing event.
If I dont grow fast enough to justify the spend, I have bigger problems.
In addition, I have to commend Google and Azure here - the way they do committed use is very flexible. They price it on number of units (cores, RAM, whatever) - so if you outgrow it, you still get the discount + full cost of additional units. On AWS, if you outgrow, you have to frikking sell off your machines on their auctions and buy new ones.
Only problem is that Google doesnt do this for databases (which are very expensive)
I don't believe this is entirely true. The AWS reserved credits are good within the machine family. So a t2 credit is good for all t2 instance sizes. Not as flexible as CPU/RAM credits, but more so than it was in the past.
its not very nice. that's why they have a reserved instance marketplace - https://aws.amazon.com/ec2/purchasing-options/reserved-insta...
what is new is the convertible reserved instances. I havent used it - and you may be right there. But i still maintain the GCP/Azure way is light years better
I do think gcp and azure do it better though.
Separating capacity from instance type makes it much more natural and easy to use for your actual requirements.
- >500k cores
- >300PB storage
- >12,500 cluster size
- >1T messages per day
And there's also this other talk, "How Twitter Migrated its On-Prem Analytics to Google Cloud" - focused on their migration to BigQuery:
- 20 TB/ day of raw log data, >100k events/sec
- Loading ~1TB/hour into BigQuery.
- Serving 5,000+ complex queries / second. p99 ~300ms
Disclosure: I'm Felipe Hoffa and I work for Google Cloud https://twitter.com/felipehoffa.
I’d be interested in their cost/benefit analysis for this one.
How? I mean there are 7B people on earth. Are people that active on Twitter? Or does "messages" mean something else here?
> Disclosure: I work at Google Cloud (and directly with Derek and the Twitter team).
Thanks for making this community awesome. I want to work at Google or with Google one day. One of the few companies I have always admired and always will.
It is possible that while GCP gives a good value based on the writing on the tin, someone else might beat that with a good margin under special terms.
That being said, I have reasons to think very few could beat Google on compute or storage game and definitely much harder to beat the big G on Networking side of things.
Disclosure: I worked for GCP until early 2015, but I haven't worked for Google since then, have no inside info on this deal and am not relying on anything from my time at Google for this comment.
I like the google storage API the most honestly. Firebase is really nice but the storage API is very polished and basically does exactly what you need it to do without the complexity of S3.
Obviously that's the thing you'd choose to highlight in a public Google-hosted presentation after a strategic partnership that is a bonanza in marketing for Google Cloud. It wouldn't work so well to tell a room full of engineers "We moved to Google cuz they gave us a fat discount, one you can't get because you're not Twitter!"
The fact that you allocated a top level URL for it is embarrassing (https://cloud.google.com/twitter/). Can you imagine if Amazon advertised https://aws.amazon.com/netflix/?
It's a case study. Google didn't create a one-off URL for Twitter, but have the same URL pattern for all their public case studies:
Amazon has absolutely the same, except that their URL is slightly different, but I didn't think that there is a URL police which determines which URLs are considered embarrassing for case studies and which not:
It is true that any cloud provider would get good publicity by being able to say that Twitter runs on their cloud, but that is precisely the same reason why we shouldn't just shrug this off as a publicity/business decision, because just like Google might have tried to persuade Twitter other cloud providers will most certainly have tried their best too. If Twitter still went with GCP then I think it is only fair to assume that there must have been some real advantage in going with GCP. I don't think anyone who says this must have been a pure business decision is entirely honest with themselves, because a cloud which cannot handle Twitter's volume is no good even if it was entirely for free.
I have nothing to do with Twitter (actually don't even like Twitter because it is such a toxic social media platform) and I don't like many things which Google does, but as an engineer myself I can only say that after using Azure, AWS and GCP for many years in a commercial setting that GCP is indeed years ahead. The quality, speed and reliability of GCP is second to none from my experience and I honestly couldn't say the same about AWS and especially not at all about Azure.
Google's management is what scares me. I'd never build a business around a company with so little follow-through, commitment to product longevity, or focus.
Can you be more specific on what reliability GCP is years ahead of the others? I'm guessing numbers will be pretty comparable on all platforms so if we want to declare outright winners it would be useful to cite a number to indicate where you experienced this.
I hope this is a joke. AS someone who started on GCP, and got tired of the TERRIBLE quality and reliability of their docs among other issues saying that they are ahead is an absolute joke.
I've found some corner cases with AWS, but got quick resolution even on things they have in "beta" - and consistency from documentation through to system is high.
Secondarily, AWS seems to support even old tech FOREVER. I have a simpledb based app. That tech is 9 years old now. Still ticking. When you are building stuff up over time and can't afford to rebuild on the new hotness every other year, this is nice.
Anyone know the revenue AWS and GCP generate? Seriously hard to believe GCP is so many years ahead in this space from my own experience.
"You can transfer up to 100PB per Snowmobile, a 45-foot long rugged sized shipping container, pulled by a semi-trailer truck."
For this use case, we setup 800 Gbps of interconnect with Google.
With the interconnect being faster than the network cards by several orders of magnitude (I assume unless there is special hardware), how is it structured to utilize all of the 800Gbps?
A member of our team proposed a session for Google NEXT '19 that goes into the architecture we are using in a lot more depth. We're waiting to see if that talk is accepted or not, but in any case we're planning to share more soon.
First you can try to replicate some data and move read instances. After that step by step the rest.
Its really not possible to do that overnight or months even.
Fyi: im currently migrating twice the Twitter size one in the world infrastructure from aws to azure now.
The video linked on the page is also from last August: https://www.youtube.com/watch?v=T1zjmNAuMjs
Throw a 2018 tag on this?
This strikes me as a team (or teams) outgrowing the storage options that were offered internally and choosing to outsource to fit their needs. Isolated use for one business case, and not indicative of a broad movement to the cloud on the part of the whole org.
I could be wrong, of course.
I'd like to know more. And I'd certainly like to know what the engineers at Twitter think.
Relevant technical bits:
> Today, we are excited to announce that we are working with Google Cloud to move cold data storage and our flexible compute Hadoop clusters to Google Cloud Platform. This will enable us to enhance the experience and productivity of our engineering teams working with our data platform.
> Twitter runs multiple large Hadoop clusters that are among the biggest in the world. In fact, our Hadoop file systems host more than 300PB of data across tens of thousands of servers.
It's going to be logs data, backups, and users original uploaded data (rather than reencoded versions) that they hope will be valuable one day.
If you are still using full VMs AWS is the place to be
Azure is what I am running into when I get contracts in the midwest and I am really not sure why they are using it :)
Their credit card failed at Google Payments, acc gets marked for fraud and no one in support can help them recover. Twitter is gone for good.
If you also use Google Domains or Fi, also good luck getting your number and email.. You're basically locked out of all your accounts. With minimal chance of recovery.
This was on the zero-support GCP/GSuite package.
It took them 7-8 years to solve billing?
I don't have any internal numbers, but imagine if Twitter could meet its needs with an N-node Redshift cluster or an N/2-node Dataproc cluster, because Dataproc is so much faster at flipping the bits or whatever Twitter wants to do (probably network performance really). Google could charge more per node and still come out cheaper because of performance.
(EDIT: The provided quote from a Twitter employee hints at this directly: "the savings")
This is a company with more petabytes of video footage in their data lake than I've seen before and GCP wanted to give them free storage for the prestige (and possible upside, longterm) of having Big Company on their list.
Quote from reply above: "Derek’s point about savings is about list price differences that result from total system performance (and not any sort of special discounting)"
>>> In economics, the Jevons paradox (sometimes Jevons effect)
occurs when technological progress or government policy
increases the efficiency with which a resource is used (reducing the amount necessary for any one use), but the rate of consumption of that resource rises due to increasing demand.
> Derek’s point about savings is about list price differences that result from total system performance (and not any sort of special discounting).
But if all ask for the Twitter discount, you're screwed. However, by by next year no one can compare the offerings side by side. Better machines, cheaper offerings etc.
We shared a bunch of stuff in this video:
One of the interesting benchmarks was GridMix performance at scale. And overall performance for throughput-oriented workloads like Hadoop was generally strong. Most cloud benchmarking I have seen sticks to micro-benchmarks and single-node benchmarks, so they miss this.
You'd be surprised, if you are a big enough customer, even better if you are a reputable one, how much catering the cloud provider would be willing to offer.
Apparently this is a trend going back years, it doesn't seem to be doing anything good for Twitter or Square: https://sg.finance.yahoo.com/news/big-twitter-investor-defen...
People claiming that Twitter or Facebook has a political bias are missing the real point. For example, I think that Zuck really, truly, doesn't give a fink about political bias. How could he? He'd never get any sleep! In his position, all he cares about is the money. If that takes making this or that comment or public statement or testimony to Congress, it doesn't matter. It's all geared towards and carefully crafted towards keeping the money flowing.
In that sense, he and the top brass at a place like Facebook are all political whores, so to speak. Selling their convictions for whatever keeps the spigot open.
On a sidenote, Square, Stripe, Payment Ninja, <insert payments startup> so far seems to be lipstick on the pig that is the underlying debit/credit networks, often with a hefty price for said lipstick. The US badly needs a better version of Zelle that supports business transactions.
Not yet available in the US. Coming soon though.
Those are the sacred cows.
I don't think Windows counts as a sacred cow anymore at Microsoft.
AFAIK Twitter allows search indexing and is already a publisher on their network.
Google will "enhance the tweets" by showing 4-5 "relevant ads" before you even see the tweet. Tweethancer
Also, imo, calling them "similarly sized" would be incorrect, except perhaps in signed up users (from which google got a ton from the "login with google and your account is already setup", not daily active users)
https://en.wikipedia.org/wiki/Twitter (see sidebar)
> Google+ currently has low usage and engagement: 90 percent of Google+ user sessions are less than five seconds
Hell, the Chromecast screensaver photos are hosted on Google Plus.
The G+ numbers looked suspicious if you ran any sort of public website, where e.g. I saw metrics for referrals or shares at least an order of magnitude lower than Facebook or Twitter.
> User engagement on Google+ was low compared with its competitors; ComScore estimated that users averaged just 3.3 minutes on the site in January 2012, versus 7.5 hours for Facebook.
If they are moving everything, well, Netflix did that too. Maybe opportunity cost, maybe having your engineers work on your core product instead of running VMs is cheaper altogether.
Netflix provides massive storage boxes to ISPs that serve content from within the network of the ISP the user is connecting from. This can save the ISP a lot of external traffic so they generally want to do this to save costs and meet customer demands. YouTube does a similar thing.
They call it OpenConnect:
It sounds like they moved stuff like logging/metrics and query systems onto GCP, which makes sense because as another poster said utilization is probably bursty.
Realistically, it can't be considered as the same service, can it?
Both AWS and Google clouds are perfectly capable of running Twitter, and running any software that Twitter uses, even if the number of machines is different. The only "benchmark" applicable is actually the total negotiated cost of the machines required to get the job done.
So it's not about being able to do things on Google cloud with fewer processors or less RAM or faster hard drives - Google was willing to give them a lower total cost of ownership for reasons known only to Google.
Do not assume that Google (or anybody else) will give you a similar preferential deal. Ignore the "benchmarks".
After you do all that, just start with Heroku and Postgres with some AWS Lambda, then move to ALB + AWS Fargate + RDS|Aurora / DynamoDB as you get bigger, then to NLB + ECS with a cluster of 20-80 On Demand and Spot Instances, then to a cluster of ARM instances on Spot.
If you need to build your own datacenter at that point, you'll know. And you would have built a 12Factor app to make all of the above work, so migrating will be easy.
Pipes! That data isn't static either. As Derek said in a sibling comment, they setup hundreds of Gbps of peering. Each 100 Gbps gets you a Petabyte per day (roughly). Planning the move and doing the tooling to keep stuff synchronized takes longer than the actual transfer once you've got say 800 Gbps (just 30 days ish). And again, you're going to have lots of data churn per month, so you want pipes anyway.
Hosting your own infrastructure makes less and less sense going forward, the cloud is the new king now.
What was this data, and why did they feel they needed to keep it?
I mean - 300PB is nothing to sneeze at. I can understand wanting to keep important data around (IP, for instance, or recent server logs for the last 30 days or even a year) - but this amount seems to go well beyond that.
In fact, it almost seems like they have stored every single tweet ever written. In addition to probably all of their logs, and who knows what else.
Why are they keeping all of that information? If it is all the tweets ever written - they don't have copyright to it, so such information wouldn't belong to them...
I'm not a big twitter user - can I search for a tweet on their system that I (or someone else) made say...5 years ago or longer? If this is a bunch of tweets - is that why they keep them around?
Are people in general aware of this? That is - if they are tweets - that Twitter has stored all of them, forever and ever, and that they are all still accessible (if not by the general public, then by law enforcement at the very least)?
If they are tweets - do users have any recourse at being able to remove those tweets (aside from those that they likely are legally bound to keep - I am thinking of the current POTUS's tweets, for instance)?
What is the ultimate purpose behind keeping all of that data? Even if it wasn't a bunch of tweets, whatever it is can't really be that useful otherwise, except for maybe the most recent stuff created in the past year, at best. Things like that - logs, etc - might be destined for a rotated "cold storage" and would ultimately "age out". Accounting records (financial data, etc) probably isn't that great a percentage of the data (if at all?) - but whatever was there, too, would likely need to be saved for a greater range - but even there maybe only 10-15 years worth (and it certainly wouldn't even comprise a fraction of the bulk).
So what is it? Why is it being kept? To what purpose? While 300PB isn't that great of an amount in physical terms (that is, the storage medium probably doesn't take up a great amount of space in a datacenter), it still isn't anything tiny, either (especially depending on what medium was used - I mean, I doubt they are storing it all on a ton of 256GB microSD cards - though that does bring up an interesting idea/thought to mind - but I digress).
I'm of the thought that companies shouldn't just "throw away" data - but it should be tempered by "importance to humanity over time". Not everything should be kept - but, for instance, it would have been nice if we (humanity) had kept around the original transmissions from the surface of the moon, or the blueprints to the rocket that took humans there. But we probably don't need all the various interoffice memoranda and fluff at the various NASA contractors (we probably don't need the accounting records - but BOMs might be useful).
That's just a couple of examples I can think of off the top of my head, but history is rife with instances of companies imploding or being bought out or restructured in such a manner that data is just "destroyed" instead of kept. Or lost, or otherwise made unretrievable to humanity.
Here with Twitter - an active company to be sure - we have the seemingly exact opposite - almost like r/datahoarders were in charge.
...and that should concern people a bit, regardless of which company it is - but especially a company like Twitter.
That said, a few million is a rounding error on a price Google would pay for Twitter.