Uber’s Big Data Platform: 100+ Petabytes with Minute Latency

bradhe · on Oct 17, 2018

I've been following Uber's big data platform engineering for a while, this is a really interesting update. Specifically, it's interesting how well their Gen 3 stack held up. Also interesting choice to solve the incremental update problem at storage time instead of inserting another upstream ETL process (which would be incredibly expensive at this level of scale I'm sure).

Also interesting: A lot of companies, you look at their big data ecosystem and it's littered with tons of tools. Uber seems like they've always done a good job keeping that pared down, which indicates to me that their team knows what they're doing for sure.

artwr · on Oct 18, 2018

I have only heard second hand, but internally Uber has always had a large litter of tools. There is always a little bit chaos edited out from what is presented to the outside world. They run several container schedulers (YARN, Mesos, Myriad), they used to run several separate workflow orchestrators.

The proliferation of tools allows to test out different ideas in production conditions which is not bad per se. But the duplication of efforts or variability has to come with some advantage to continue doing it. Infrastructure engineers wanting to reinvent the wheel can create a lot of technical debt fast, and leveraged one at that.

This takes nothing away from the great work Uber has been doing in the area. They have definitely set a great direction for their platform.

tomnipotent · on Oct 18, 2018

This is par for the course in any large org - Google, Yahoo, Facebook are all in the same boat. It's just not possible for a single data infrastructure to be everything to everyone.

iblaine · on Oct 17, 2018

Being litter with tools is not necessarily a bad thing. A tool is a modular reusable component that acts a building block for something more complex.

Consider Netflix. They are big open source contributors to the data industry. Many of their tools interest me because I can see their value outside of Netflix. Uber on the other hand is solving esoteric problems with esoteric solutions.

kerng · on Oct 18, 2018

This is the tool they released:

https://github.com/uber/hudi

vazamb · on Oct 18, 2018

I find it interesting that one of their major pain points was data schema. After having worked at places that use plain json and places that used protobuf I can highly recommend anyone starting an even mildly complex data engineering project (complexity in data or number of stakeholders) to use something like protobuf, apache arrow or a columnar format if you need it.

Having a clearly defined schema that can be shared between teams (we had a specific repo for all protobuf definitions with enforced pull requests) significantly reduces the amount of headaches down the road.

Invictus0 · on Oct 18, 2018

I was wondering how Uber could possibly need 100PB of space; but if you consider that they've served roughly 10 billion rides, it actually only comes out to roughly 100 kilobytes per ride.

pso · on Oct 18, 2018

I think you miscalculated. I get 10MB per trip on average, assuming 10bn total trips. Maybe deduct something for maps and inefficient file formats & data structures?

1 billion bytes = 1GB,

1 billion KB = 1TB,

1 billion MB = 1PB,

10 billion * 10MB = 100PB.

Invictus0 · on Oct 18, 2018

Yes, you're right. My calculation was 2^50 bytes in a PB / 10^10 trips, which should have been multiplied by 100 to get 100PB.

Which again leaves me wondering why Uber needs 100PB.

buremba · on Oct 18, 2018

They collect user location during the trip, app launches and every possible user action that may be useful for them while you're using the app.

ddalex · on Oct 18, 2018

Tinfoil hat: data that they slurp outside trips, like tracking user locations, etc

efrank02 · on Oct 18, 2018

I do forecasting at Uber, we don’t collect data outside of trips.

cr3ative · on Oct 18, 2018

Has this kind of thing stopped, then? https://techcrunch.com/2016/11/28/uber-background-location-d...

kk58 · on Oct 18, 2018

What kind of problems do you work on?

compcoffee · on Oct 18, 2018

>we don’t collect data outside of trips.

And it's not like Big Tech has ever misled people on this subject.

jandrewrogers · on Oct 18, 2018

The data sizing heuristic for basic vehicle positioning telemetry is a GB per vehicle per year assuming a professional driver (e.g. taxi, trucking, etc). A petabyte is about a million vehicle-years of positioning data, which isn't that much.

Remember, Uber also stores application positioning telemetry as well, so it isn't just the number of vehicles.

bloomer · on Oct 18, 2018

It makes me wonder if those yellow cabs are just stuffed full of hard drives to store all their historic data too. None of that data is useful or necessary. I would love to see any “insight” that would beat the knowledge of a basically competent cabbie on where and when people need rides. Uber just has lots of investor funny money that they want to spend to make it look like technology has something to do with their business when they are actually just a marketing and booking engine for independent taxi operators. Investors just get more excited and start throwing “internet money” your way if you brag about things like amount of data and engineers.

samontar · on Oct 18, 2018

Well, I’d expect there to be aggregate result stores as well for other dashboards that may store the same data sorted and aggregated a different way.

NeedMoreTea · on Oct 18, 2018

Which is about 99.7kb more than I would hope and expect as a customer.

Jedd · on Oct 18, 2018

That's a fairly naive expectation.

I had a friend head off in 2011 to work with Hailo in London, and designed much of the data ingest & storage architecture there. That was built on Cassandra, and some of the numbers were staggering, even for a (initially) single-city, < 2,000 driver system.

You're not just tracking start and end point of successful rides -- you're tracking users when their app is open (both drivers and users move around before and after initiating a ride, and you need to share that data between parties), you're also tracking empty vehicles so you can identify over / under serviced areas ... doubtless lots of other real time stuff you want to know. But the big thing is you want to then have that data available for mining, improving your service, regulatory obligations, and so on.

NeedMoreTea · on Oct 18, 2018

Only if you believe in retaining all data in perpetuity. I suppose it demonstrates the difference in US and European perspectives on data.

Most of what you describe has little meaning or use at a few months old. Sure you might want to collect detailed awareness of individual drivers over say a month, generic traffic patterns over six months, and little or nothing beyond.

Once the data has outlived that direct, primary, purpose you dispose of it or it's a famous, excessive, data breach waiting to happen. There is absolutely nothing wrong with only keeping billing data beyond say six months max.

samontar · on Oct 18, 2018

In California, you soon won’t be able to keep 6 months max. You have to hold a year.

arbie · on Oct 18, 2018

> Most of what you describe has little meaning or use at a few months old. Sure you might want to collect detailed awareness of individual drivers over say a month, generic traffic patterns over six months, and little or nothing beyond.

The really invasive pattern-matches happen best retroactively at that scale.

Age of the data does not typically impact many such analyses.

Jedd · on Oct 20, 2018

> Only if you believe in retaining all data in perpetuity.

There's a huge spectrum between your 'at most 30d' claim in another reply, and 'all data in perpetuity'.

30d is way too short. Perpetuity is probably the idealistic intent, and depending on storage costs over time this may be achievable.

> I suppose it demonstrates the difference in US and European perspectives on data.

How so? I described a the case of one UK-based company, and TFA is about one US-based company, both doing similar things.

> Most of what you describe has little meaning or use at a few months old.

Clearly this is not the case, as evinced by the actions of the people designing, building, operating, and maintaining these types of systems.

singingboyo · on Oct 18, 2018

I think there are tradeoffs, for example, they probably store the route, which is good (refunds for really terrible routes) but bad (tracking). That's quite a bit of space, probably the big one. Rating and possibly comments, again could be pretty big. Multiple UUIDs or whatever (cust/driver IDs, CC txn IDs) will take up some space as well. 300B is probably a fair bit too low.

That said, 100kB does seem high. 10kB or so would seem more appropriate, as a very rough ballpark.

btown · on Oct 18, 2018

Beyond the route itself, consider also: time spent in traffic (speed per geo-coordinate) for use in predicting surges, deviations between planned and taken routes, etc. The degree to which Uber has a pulse on urban traffic is fascinating and probably exceeds even city planners with access to ubiquitous traffic cameras.

rasen58 · on Oct 18, 2018

They probably need to store a ton of things (like maps) that aren't data though, so probably not fair to say each ride is 100kb.

Invictus0 · on Oct 18, 2018

Even if they have 20PB of maps, which is probably way too high, it's still 80kb.

NeedMoreTea · on Oct 18, 2018

Almost none of which they need to keep. Route for 30 days perhaps, start and end point for longer. I imagine most ratings are just that with no comments.

MR4D · on Oct 18, 2018

Routes determine pay to drivers, so they probably have to keep it much longer in case of a pay dispute.

p1necone · on Oct 18, 2018

They could just give a digitally signed copy to the drivers and have the onus be on them to keep it.

manish_gill · on Oct 18, 2018

Good post. The Snapshot-based approach during ingestion time was the part where I couldn't figure out why it was considered a good decision during implementation?

I've experimented with Parquet data on S3 for a work POC, and the latency to fetch the data/create tables/run the Spark-SQL query (running on EMR cluster) was quite noticeable. I was advised that EMR-FS would make it run quicker, but never got around to playing with that. But I guess the creating of in-memory tables using raw data snapshots would still remain true? Or maybe I missed something.

Also, I take it if 24 hrs is the latency requirements for ingestion to availability of this data, obviously this isn't the data platform that is powering the real time booking/sharing of Uber rides. I'd be curious to see what is the data pipeline that powers that for Uber.

georgewfraser · on Oct 18, 2018

Reading this, I can’t help but think Uber would be better off adopting one of the commercial data warehouses that separates compute from storage: Snowflake or BigQuery. They have full support for updates, they support huge scale, and because they’re more efficient the cost is comparable to Presto in spite of the margin. You can ingest huge quantities of updates if you batch them up correctly, and there are commercial tools that will do the entire ingest for you (cough Fivetran).

Disclosure: am CEO of Fivetran.

sixdimensional · on Oct 18, 2018

The pattern I have seen in many current big data architectures reflects quite well a build vs buy decision.

Organizations that have the resources tend to go open source with a lot of custom tools (such as Uber, or FAANG). Those companies tend to be the "makers" of such tech as well.

Organizations that don't have those resources or in-house experience, or desire to build, rely on the commercial licensing for open source offerings, cloud based or traditional commercial offerings.

At these scales of data and time horizons, owning the engineering resources capable of supporting the tools might be necessary, but surely expensive.

If one can go with a stable, well known commercial offering that reaches the desired scale but keeps the volumes of data in open standard formats, I think that is a good compromise. Many commercial vendors have gone that way, for example, Microsoft recently went all in on integrating Spark and HDFS closely with SQL Server 2019, and a lot of other database vendors have already done that as well (e.g. HP with Vertica on Hadoop, etc.).

I also think it's possible that at these scales and performance, people doing that big data work really are on the bleeding edge and therefore, innovation and new development is almost a requirement to make it all work together and meet the aggressive performance and efficiency benchmarks desired. Especially when having to do complex things like handling both the speed and batch layers in one unifying architecture (as in lambda architecture).

jfim · on Oct 18, 2018

The problem with commercial offerings is that at larger scales, there often needs to be special optimizations for large scales, which don't apply to most customers of that vendor.

For example, imagine that System X can scale well up to 1 terabyte of data, but that after that it becomes increasingly complex to handle more data and it requires special optimizations to maintain an acceptable level of performance.

From System X's perspective as a vendor, 99% of their customers are perfectly fine with the performance of System X and just want more features (that the vendor can charge for upgrading to the next version of System X). On the other side, there's that annoying 1% of customers just keeps on asking for more and more performance optimizations that 99% of their customers don't care about.

From the vendors' perspective, it makes more sense to invest development efforts into features (that can translate into more money) than in performance optimizations that only appeal to a narrow segment of their users. For the company that requires these optimizations, since the commercial codebase is proprietary, they're locked in and have no easy way out.

That's pretty much why the large tech companies invest in owning their own part of the stack. It mitigates a scaling risk, keeps expertise in house and ensures that the solution that is developed in house matches 100% with the often specialized needs of the business.

georgewfraser · on Oct 18, 2018

This makes sense in theory but in practice I have observed that commercial data warehouses are incredibly well optimized. There are lots of companies we don’t hear about using Snowflake and BigQuery to analyze giant datasets. They don’t blog about it because they’re just using boring commercial tech. I think the real reason companies like Uber build their own stack is just good old not-invented-here.

carlineng · on Oct 18, 2018

Tough to tell from the diagrams exactly, but it looks like the majority of their data is stored in their own data centers, which might mean some reluctance to migrate and ship data to the cloud (“cloud storage” only makes an appearance in the Gen 1 chart). There also exists at least one public example where they’ve bought a commercial database to build on (https://www.memsql.com/blog/real-time-analytics-at-uber-scal...). I’d be willing to concede that there might be legit reasons for not wanting to use BQ or Snowflake.

That being said, we use Snowflake heavily at Strava and are very happy with it.

jfim · on Oct 18, 2018

They are, but their costs also scale somewhat linearly (ie. for CPU core-based licensing, 10x the cores means ~10x the cost, give or take a little based on the vendor discount). At some point, it makes having an engineering team working on a custom solution the cheaper option: why pay vendor X $n$ millions of dollars per year when you could have $n \times 4$ engineers working on a system that's perfectly adapted to your needs?

That being said, there is definitely a large amount of NIH promotion-ware being developed at large tech companies, as you mention.

indogooner · on Oct 18, 2018

How does BigQuery support updates? IIRC in BigQuery you have to overwrite whole partitions. At least in Hive you can have secondary partitions to mitigate this.

georgewfraser · on Oct 18, 2018

BigQuery has had DML for over a year. There are quotas but the quotas keep rising and they can manually increase your quota on request.

indogooner · on Oct 18, 2018

Thanks. I checked the quotas. They seem to have increased to 200 per table. Last time I checked (last year) they were pretty limiting. I like the MERGE statement though. Hope it comes out of Beta soon.

The pricing is for full scan so it is not just for in-place update but charged with bytes scanned for selected columns for scanned partitions.

kerng · on Oct 18, 2018

My take is that companies like Uber, Google, Facebook see and encounter issues ahead of the curve - so they build their own stuff and luckily open source it, which is good. Maybe others will run into similar issues and can leverage those solutions in case they work for them also.

curiousDog · on Oct 18, 2018

Isn't a DW just part of the overall equation? These days I assume you'd want to do a lot more like train ML models as part of your pipeline, things that Spark/Scala allow you to do that'd be harder than just SQL. I think most customers use both.

indogooner · on Oct 18, 2018

Yes but BigQuery provides read/write libraries. So you can use Spark or Apache Beam to read the data into your cluster and then process it using your framework of choice - XGBoost/SparkML etc. The predictions can then be written back to BigQuery using those same libraries.

curiousDog · on Oct 18, 2018

Sure but that would mean copying the data in/out. Instead you could get data-compute co-location with a proper HDFS cluster. But I think BigQuery launched SQL ML APIs recently to train simple distributed ML models on the fly.

smueller1234 · on Oct 18, 2018

A "proper" HDFS cluster also wastes incredible amounts of disk with it's naive redundancy. It's a bear to maintain (see sibling comment) and the disaster recovery story (at a DC level) is non-existent.

If you invest a chunk of that money saved by not having to do 3 or 4 way data replication into good network, a lot of that move-compute-to-data complexity becomes unnecessary. The second big efficiency promise of the basic map reduce is doing lots of sequential I/O on those pesky disks which can do 100x sequential throughput compared to their pithy random read performance. Alas, in today's mixed workloads, you're certainly not getting those >100MB/s from a disk that might make a benchmark scream. So the network savings (you're still going to do that shuffle before your reduce anyway, aren't you?) from compute-next-to-storage becomes less useful.

Best I can tell, the compute-next-to-storage trick does still matter as you get to large data sets (PB in a job), but then as you keep growing your infrastructure (and I'd wager Uber would be just about at this point), the benefits of disaggregation start weighing increasingly heavily. In particular, fleet management becomes considerably easier with less resource stranding.

azurezyq · on Oct 18, 2018

Actually running and scaling a hdfs cluster is no easy task and quite expensive than cloud based object storage options.

Just list a few: namenode HA, unbalanced r/w (a lot of old data but less used, wasting cpus for just making the machines up), cross dc replication.

You may be able to get some of them from cloudera, but would that be cheap?

buremba · on Oct 18, 2018

I wonder which BI tools they use for running ad-hoc queries on their Presto cluster. The user behavioral analytics is a hassle when you use SQL and generic BI solutions don't help with that.

Also, I assume that they have dashboards that use pre-aggregated tables for faster results, they probably have ETL jobs for this use-case but is the pre-aggregated data stored on HDFS as well?

philip1209 · on Oct 18, 2018

After their data problem exceeded a single MySQL instance - hypothetically, what would have happened if they switched to Google Cloud Spanner? Ostensibly Google has a lot more than 100 petabytes in spanner. Could you still run basic queries in it without switching to hbase?

bradhe · on Oct 18, 2018

Fairly certain the Spanner service wasn’t available then.

faizshah · on Oct 18, 2018

Where exactly does Flink/AthenaX fall into Uber's stack?

tomnipotent · on Oct 18, 2018

Hudi sounds a lot like event sourcing but at data warehousing scale - a change log backed up by a snapshot of the latest updates.

saganus · on Oct 18, 2018

Interesting that they use the term "driver-partner" in some parts but just "driver" in others.

I guess they want to avoid liability as much as possible?

Would it really be possible to use a blog post in a legal proceeding to determine whether Uber has drivers or partners?

riteshpatel · on Oct 18, 2018

Pretty sure that's because in some cities the driver is dispatched by a local car company and the driver works for them, not Uber. I've seen that in a few places.

Maro · on Oct 18, 2018

In some countries, eg. here in the UAE, drivers have to employees. Most drivers are from India/Pakistan, so they need a VISA, and somebody has to sponsor it.

Also, a lot of countries have laws saying [taxi] drivers _have_ to be employees, irrespective of VISA stuff.

seoanalyzer28 · on Oct 18, 2018

amazing...

Iwan-Zotow · on Oct 17, 2018

What exactly is "Big" here? It is about 1000 hard drives, several racks...

nemothekid · on Oct 18, 2018

Usually the “big” qualifier is a fuction of RAM, not hard disk space. Getting hundreds of petabytes of data onto persistent storage in a “large” room has been possible for many years now.

“Big” data was never about how to store the data.

Iwan-Zotow · on Oct 18, 2018

What I'm trying to say, as soon as you could fit data and processing unit(s) into one well cooled room in datacenter, managed by two guys per shift, it is not "Big" problem anymore. Making it all local will probably speed-up their queries/analytics enormously as well

teraflop · on Oct 18, 2018

In my book, "bigger than 99.9% of organizations will ever encounter" is big enough to qualify as "big".

randyrand · on Oct 17, 2018

Ya, big data only starts at 100,000,000 PB. Everyone knows that.

There is no size requirement. It's more about what you collect, the frequency, the coverage, and how you use it.

Iwan-Zotow · on Oct 17, 2018

Of course there is

And anything you could fit into, basically, one well cooled room is not "Big" anymore, sorry to tell you that

PhasmaFelis · on Oct 18, 2018

"Big data" is anything that's too big to cram into a standard database and still access in reasonable times. There's a practical definition, it doesn't just mean "YUUUGE".

TomVDB · on Oct 17, 2018

I think you're off by a factor of 100?

Iwan-Zotow · on Oct 17, 2018

You think wrong

100TB SSD from Nimbus, https://nimbusdata.com/products/exadrive-platform/scalable-s...

saagarjha · on Oct 18, 2018

That website doesn't list a price, but I doubt that running a rack of these would be cost-effective. I wouldn't be surprised if those drives have a cost per terabyte at least a couple of times that of commodity drives.

samontar · on Oct 18, 2018

Are you using this in prod? I’d be very surprised. I wouldn’t use this for prod data.

Iwan-Zotow · on Oct 18, 2018

Not yet, but people are looking into them

My point is, you could fit all those Uber data (I know, I know, replication, sync etc) into racks in the SINGLE well cooled room in datacenter, managed by 2 guys per shift.

And this is not "Big" as far as I can see.

Probably will speed up their queris/analytics as well all things being local