I've been following Uber's big data platform engineering for a while, this is a really interesting update. Specifically, it's interesting how well their Gen 3 stack held up. Also interesting choice to solve the incremental update problem at storage time instead of inserting another upstream ETL process (which would be incredibly expensive at this level of scale I'm sure).
Also interesting: A lot of companies, you look at their big data ecosystem and it's littered with tons of tools. Uber seems like they've always done a good job keeping that pared down, which indicates to me that their team knows what they're doing for sure.
I have only heard second hand, but internally Uber has always had a large litter of tools. There is always a little bit chaos edited out from what is presented to the outside world. They run several container schedulers (YARN, Mesos, Myriad), they used to run several separate workflow orchestrators.
The proliferation of tools allows to test out different ideas in production conditions which is not bad per se. But the duplication of efforts or variability has to come with some advantage to continue doing it. Infrastructure engineers wanting to reinvent the wheel can create a lot of technical debt fast, and leveraged one at that.
This takes nothing away from the great work Uber has been doing in the area. They have definitely set a great direction for their platform.
This is par for the course in any large org - Google, Yahoo, Facebook are all in the same boat. It's just not possible for a single data infrastructure to be everything to everyone.
Being litter with tools is not necessarily a bad thing. A tool is a modular reusable component that acts a building block for something more complex.
Consider Netflix. They are big open source contributors to the data industry. Many of their tools interest me because I can see their value outside of Netflix. Uber on the other hand is solving esoteric problems with esoteric solutions.
I find it interesting that one of their major pain points was data schema. After having worked at places that use plain json and places that used protobuf I can highly recommend anyone starting an even mildly complex data engineering project (complexity in data or number of stakeholders) to use something like protobuf, apache arrow or a columnar format if you need it.
Having a clearly defined schema that can be shared between teams (we had a specific repo for all protobuf definitions with enforced pull requests) significantly reduces the amount of headaches down the road.
I was wondering how Uber could possibly need 100PB of space; but if you consider that they've served roughly 10 billion rides, it actually only comes out to roughly 100 kilobytes per ride.
I think you miscalculated.
I get 10MB per trip on average, assuming 10bn total trips. Maybe deduct something for maps and inefficient file formats & data structures?
The data sizing heuristic for basic vehicle positioning telemetry is a GB per vehicle per year assuming a professional driver (e.g. taxi, trucking, etc). A petabyte is about a million vehicle-years of positioning data, which isn't that much.
Remember, Uber also stores application positioning telemetry as well, so it isn't just the number of vehicles.
It makes me wonder if those yellow cabs are just stuffed full of hard drives to store all their historic data too. None of that data is useful or necessary. I would love to see any “insight” that would beat the knowledge of a basically competent cabbie on where and when people need rides. Uber just has lots of investor funny money that they want to spend to make it look like technology has something to do with their business when they are actually just a marketing and booking engine for independent taxi operators. Investors just get more excited and start throwing “internet money” your way if you brag about things like amount of data and engineers.
I had a friend head off in 2011 to work with Hailo in London, and designed much of the data ingest & storage architecture there. That was built on Cassandra, and some of the numbers were staggering, even for a (initially) single-city, < 2,000 driver system.
You're not just tracking start and end point of successful rides -- you're tracking users when their app is open (both drivers and users move around before and after initiating a ride, and you need to share that data between parties), you're also tracking empty vehicles so you can identify over / under serviced areas ... doubtless lots of other real time stuff you want to know. But the big thing is you want to then have that data available for mining, improving your service, regulatory obligations, and so on.
Only if you believe in retaining all data in perpetuity. I suppose it demonstrates the difference in US and European perspectives on data.
Most of what you describe has little meaning or use at a few months old. Sure you might want to collect detailed awareness of individual drivers over say a month, generic traffic patterns over six months, and little or nothing beyond.
Once the data has outlived that direct, primary, purpose you dispose of it or it's a famous, excessive, data breach waiting to happen. There is absolutely nothing wrong with only keeping billing data beyond say six months max.
> Most of what you describe has little meaning or use at a few months old. Sure you might want to collect detailed awareness of individual drivers over say a month, generic traffic patterns over six months, and little or nothing beyond.
The really invasive pattern-matches happen best retroactively at that scale.
Age of the data does not typically impact many such analyses.
I think there are tradeoffs, for example, they probably store the route, which is good (refunds for really terrible routes) but bad (tracking). That's quite a bit of space, probably the big one. Rating and possibly comments, again could be pretty big. Multiple UUIDs or whatever (cust/driver IDs, CC txn IDs) will take up some space as well. 300B is probably a fair bit too low.
That said, 100kB does seem high. 10kB or so would seem more appropriate, as a very rough ballpark.
Beyond the route itself, consider also: time spent in traffic (speed per geo-coordinate) for use in predicting surges, deviations between planned and taken routes, etc. The degree to which Uber has a pulse on urban traffic is fascinating and probably exceeds even city planners with access to ubiquitous traffic cameras.
Almost none of which they need to keep. Route for 30 days perhaps, start and end point for longer. I imagine most ratings are just that with no comments.
Good post. The Snapshot-based approach during ingestion time was the part where I couldn't figure out why it was considered a good decision during implementation?
I've experimented with Parquet data on S3 for a work POC, and the latency to fetch the data/create tables/run the Spark-SQL query (running on EMR cluster) was quite noticeable. I was advised that EMR-FS would make it run quicker, but never got around to playing with that. But I guess the creating of in-memory tables using raw data snapshots would still remain true? Or maybe I missed something.
Also, I take it if 24 hrs is the latency requirements for ingestion to availability of this data, obviously this isn't the data platform that is powering the real time booking/sharing of Uber rides. I'd be curious to see what is the data pipeline that powers that for Uber.
Reading this, I can’t help but think Uber would be better off adopting one of the commercial data warehouses that separates compute from storage: Snowflake or BigQuery. They have full support for updates, they support huge scale, and because they’re more efficient the cost is comparable to Presto in spite of the margin. You can ingest huge quantities of updates if you batch them up correctly, and there are commercial tools that will do the entire ingest for you (cough Fivetran).
The pattern I have seen in many current big data architectures reflects quite well a build vs buy decision.
Organizations that have the resources tend to go open source with a lot of custom tools (such as Uber, or FAANG). Those companies tend to be the "makers" of such tech as well.
Organizations that don't have those resources or in-house experience, or desire to build, rely on the commercial licensing for open source offerings, cloud based or traditional commercial offerings.
At these scales of data and time horizons, owning the engineering resources capable of supporting the tools might be necessary, but surely expensive.
If one can go with a stable, well known commercial offering that reaches the desired scale but keeps the volumes of data in open standard formats, I think that is a good compromise. Many commercial vendors have gone that way, for example, Microsoft recently went all in on integrating Spark and HDFS closely with SQL Server 2019, and a lot of other database vendors have already done that as well (e.g. HP with Vertica on Hadoop, etc.).
I also think it's possible that at these scales and performance, people doing that big data work really are on the bleeding edge and therefore, innovation and new development is almost a requirement to make it all work together and meet the aggressive performance and efficiency benchmarks desired. Especially when having to do complex things like handling both the speed and batch layers in one unifying architecture (as in lambda architecture).
The problem with commercial offerings is that at larger scales, there often needs to be special optimizations for large scales, which don't apply to most customers of that vendor.
For example, imagine that System X can scale well up to 1 terabyte of data, but that after that it becomes increasingly complex to handle more data and it requires special optimizations to maintain an acceptable level of performance.
From System X's perspective as a vendor, 99% of their customers are perfectly fine with the performance of System X and just want more features (that the vendor can charge for upgrading to the next version of System X). On the other side, there's that annoying 1% of customers just keeps on asking for more and more performance optimizations that 99% of their customers don't care about.
From the vendors' perspective, it makes more sense to invest development efforts into features (that can translate into more money) than in performance optimizations that only appeal to a narrow segment of their users. For the company that requires these optimizations, since the commercial codebase is proprietary, they're locked in and have no easy way out.
That's pretty much why the large tech companies invest in owning their own part of the stack. It mitigates a scaling risk, keeps expertise in house and ensures that the solution that is developed in house matches 100% with the often specialized needs of the business.
This makes sense in theory but in practice I have observed that commercial data warehouses are incredibly well optimized. There are lots of companies we don’t hear about using Snowflake and BigQuery to analyze giant datasets. They don’t blog about it because they’re just using boring commercial tech. I think the real reason companies like Uber build their own stack is just good old not-invented-here.
Tough to tell from the diagrams exactly, but it looks like the majority of their data is stored in their own data centers, which might mean some reluctance to migrate and ship data to the cloud (“cloud storage” only makes an appearance in the Gen 1 chart). There also exists at least one public example where they’ve bought a commercial database to build on (https://www.memsql.com/blog/real-time-analytics-at-uber-scal...). I’d be willing to concede that there might be legit reasons for not wanting to use BQ or Snowflake.
That being said, we use Snowflake heavily at Strava and are very happy with it.
They are, but their costs also scale somewhat linearly (ie. for CPU core-based licensing, 10x the cores means ~10x the cost, give or take a little based on the vendor discount). At some point, it makes having an engineering team working on a custom solution the cheaper option: why pay vendor X $n$ millions of dollars per year when you could have $n \times 4$ engineers working on a system that's perfectly adapted to your needs?
That being said, there is definitely a large amount of NIH promotion-ware being developed at large tech companies, as you mention.
How does BigQuery support updates? IIRC in BigQuery you have to overwrite whole partitions. At least in Hive you can have secondary partitions to mitigate this.
Thanks. I checked the quotas. They seem to have increased to 200 per table. Last time I checked (last year) they were pretty limiting. I like the MERGE statement though. Hope it comes out of Beta soon.
The pricing is for full scan so it is not just for in-place update but charged with bytes scanned for selected columns for scanned partitions.
My take is that companies like Uber, Google, Facebook see and encounter issues ahead of the curve - so they build their own stuff and luckily open source it, which is good. Maybe others will run into similar issues and can leverage those solutions in case they work for them also.
Isn't a DW just part of the overall equation? These days I assume you'd want to do a lot more like train ML models as part of your pipeline, things that Spark/Scala allow you to do that'd be harder than just SQL. I think most customers use both.
Yes but BigQuery provides read/write libraries. So you can use Spark or Apache Beam to read the data into your cluster and then process it using your framework of choice - XGBoost/SparkML etc. The predictions can then be written back to BigQuery using those same libraries.
Sure but that would mean copying the data in/out. Instead you could get data-compute co-location with a proper HDFS cluster. But I think BigQuery launched SQL ML APIs recently to train simple distributed ML models on the fly.
A "proper" HDFS cluster also wastes incredible amounts of disk with it's naive redundancy. It's a bear to maintain (see sibling comment) and the disaster recovery story (at a DC level) is non-existent.
If you invest a chunk of that money saved by not having to do 3 or 4 way data replication into good network, a lot of that move-compute-to-data complexity becomes unnecessary. The second big efficiency promise of the basic map reduce is doing lots of sequential I/O on those pesky disks which can do 100x sequential throughput compared to their pithy random read performance. Alas, in today's mixed workloads, you're certainly not getting those >100MB/s from a disk that might make a benchmark scream. So the network savings (you're still going to do that shuffle before your reduce anyway, aren't you?) from compute-next-to-storage becomes less useful.
Best I can tell, the compute-next-to-storage trick does still matter as you get to large data sets (PB in a job), but then as you keep growing your infrastructure (and I'd wager Uber would be just about at this point), the benefits of disaggregation start weighing increasingly heavily. In particular, fleet management becomes considerably easier with less resource stranding.
I wonder which BI tools they use for running ad-hoc queries on their Presto cluster. The user behavioral analytics is a hassle when you use SQL and generic BI solutions don't help with that.
Also, I assume that they have dashboards that use pre-aggregated tables for faster results, they probably have ETL jobs for this use-case but is the pre-aggregated data stored on HDFS as well?
After their data problem exceeded a single MySQL instance - hypothetically, what would have happened if they switched to Google Cloud Spanner? Ostensibly Google has a lot more than 100 petabytes in spanner. Could you still run basic queries in it without switching to hbase?
Pretty sure that's because in some cities the driver is dispatched by a local car company and the driver works for them, not Uber. I've seen that in a few places.
In some countries, eg. here in the UAE, drivers have to employees. Most drivers are from India/Pakistan, so they need a VISA, and somebody has to sponsor it.
Also, a lot of countries have laws saying [taxi] drivers _have_ to be employees, irrespective of VISA stuff.
Usually the “big” qualifier is a fuction of RAM, not hard disk space. Getting hundreds of petabytes of data onto persistent storage in a “large” room has been possible for many years now.
What I'm trying to say, as soon as you could fit data and processing unit(s) into one well cooled room in datacenter, managed by two guys per shift, it is not "Big" problem anymore. Making it all local will probably speed-up their queries/analytics enormously as well
"Big data" is anything that's too big to cram into a standard database and still access in reasonable times. There's a practical definition, it doesn't just mean "YUUUGE".
That website doesn't list a price, but I doubt that running a rack of these would be cost-effective. I wouldn't be surprised if those drives have a cost per terabyte at least a couple of times that of commodity drives.
My point is, you could fit all those Uber data (I know, I know, replication, sync etc) into racks in the SINGLE well cooled room in datacenter, managed by 2 guys per shift.
And this is not "Big" as far as I can see.
Probably will speed up their queris/analytics as well all things being local
Also interesting: A lot of companies, you look at their big data ecosystem and it's littered with tons of tools. Uber seems like they've always done a good job keeping that pared down, which indicates to me that their team knows what they're doing for sure.