
Uber’s Big Data Platform: 100+ Petabytes with Minute Latency - eaguyhn
https://eng.uber.com/uber-big-data-platform/
======
bradhe
I've been following Uber's big data platform engineering for a while, this is
a really interesting update. Specifically, it's interesting how well their Gen
3 stack held up. Also interesting choice to solve the incremental update
problem at storage time instead of inserting another upstream ETL process
(which would be incredibly expensive at this level of scale I'm sure).

Also interesting: A lot of companies, you look at their big data ecosystem and
it's littered with tons of tools. Uber seems like they've always done a good
job keeping that pared down, which indicates to me that their team knows what
they're doing for sure.

~~~
artwr
I have only heard second hand, but internally Uber has always had a large
litter of tools. There is always a little bit chaos edited out from what is
presented to the outside world. They run several container schedulers (YARN,
Mesos, Myriad), they used to run several separate workflow orchestrators.

The proliferation of tools allows to test out different ideas in production
conditions which is not bad per se. But the duplication of efforts or
variability has to come with some advantage to continue doing it.
Infrastructure engineers wanting to reinvent the wheel can create a lot of
technical debt fast, and leveraged one at that.

This takes nothing away from the great work Uber has been doing in the area.
They have definitely set a great direction for their platform.

~~~
tomnipotent
This is par for the course in any large org - Google, Yahoo, Facebook are all
in the same boat. It's just not possible for a single data infrastructure to
be everything to everyone.

------
vazamb
I find it interesting that one of their major pain points was data schema.
After having worked at places that use plain json and places that used
protobuf I can highly recommend anyone starting an even mildly complex data
engineering project (complexity in data or number of stakeholders) to use
something like protobuf, apache arrow or a columnar format if you need it.

Having a clearly defined schema that can be shared between teams (we had a
specific repo for all protobuf definitions with enforced pull requests)
significantly reduces the amount of headaches down the road.

------
Invictus0
I was wondering how Uber could possibly need 100PB of space; but if you
consider that they've served roughly 10 billion rides, it actually only comes
out to roughly 100 kilobytes per ride.

~~~
NeedMoreTea
Which is about 99.7kb more than I would hope and expect as a customer.

~~~
singingboyo
I think there are tradeoffs, for example, they probably store the route, which
is good (refunds for really terrible routes) but bad (tracking). That's quite
a bit of space, probably the big one. Rating and possibly comments, again
could be pretty big. Multiple UUIDs or whatever (cust/driver IDs, CC txn IDs)
will take up some space as well. 300B is probably a fair bit too low.

That said, 100kB does seem high. 10kB or so would seem more appropriate, as a
very rough ballpark.

~~~
rasen58
They probably need to store a ton of things (like maps) that aren't data
though, so probably not fair to say each ride is 100kb.

~~~
Invictus0
Even if they have 20PB of maps, which is probably way too high, it's still
80kb.

------
manish_gill
Good post. The Snapshot-based approach during ingestion time was the part
where I couldn't figure out why it was considered a good decision during
implementation?

I've experimented with Parquet data on S3 for a work POC, and the latency to
fetch the data/create tables/run the Spark-SQL query (running on EMR cluster)
was quite noticeable. I was advised that EMR-FS would make it run quicker, but
never got around to playing with that. But I guess the creating of in-memory
tables using raw data snapshots would still remain true? Or maybe I missed
something.

Also, I take it if 24 hrs is the latency requirements for ingestion to
availability of this data, obviously this isn't the data platform that is
powering the real time booking/sharing of Uber rides. I'd be curious to see
what is the data pipeline that powers that for Uber.

------
georgewfraser
Reading this, I can’t help but think Uber would be better off adopting one of
the commercial data warehouses that separates compute from storage: Snowflake
or BigQuery. They have full support for updates, they support huge scale, and
because they’re more efficient the cost is comparable to Presto in spite of
the margin. You can ingest huge quantities of updates if you batch them up
correctly, and there are commercial tools that will do the entire ingest for
you ( _cough_ Fivetran).

Disclosure: am CEO of Fivetran.

~~~
sixdimensional
The pattern I have seen in many current big data architectures reflects quite
well a build vs buy decision.

Organizations that have the resources tend to go open source with a lot of
custom tools (such as Uber, or FAANG). Those companies tend to be the "makers"
of such tech as well.

Organizations that don't have those resources or in-house experience, or
desire to build, rely on the commercial licensing for open source offerings,
cloud based or traditional commercial offerings.

At these scales of data and time horizons, owning the engineering resources
capable of supporting the tools might be necessary, but surely expensive.

If one can go with a stable, well known commercial offering that reaches the
desired scale but keeps the volumes of data in open standard formats, I think
that is a good compromise. Many commercial vendors have gone that way, for
example, Microsoft recently went all in on integrating Spark and HDFS closely
with SQL Server 2019, and a lot of other database vendors have already done
that as well (e.g. HP with Vertica on Hadoop, etc.).

I also think it's possible that at these scales and performance, people doing
that big data work really are on the bleeding edge and therefore, innovation
and new development is almost a requirement to make it all work together and
meet the aggressive performance and efficiency benchmarks desired. Especially
when having to do complex things like handling both the speed and batch layers
in one unifying architecture (as in lambda architecture).

~~~
jfim
The problem with commercial offerings is that at larger scales, there often
needs to be special optimizations for large scales, which don't apply to most
customers of that vendor.

For example, imagine that System X can scale well up to 1 terabyte of data,
but that after that it becomes increasingly complex to handle more data and it
requires special optimizations to maintain an acceptable level of performance.

From System X's perspective as a vendor, 99% of their customers are perfectly
fine with the performance of System X and just want more features (that the
vendor can charge for upgrading to the next version of System X). On the other
side, there's that annoying 1% of customers just keeps on asking for more and
more performance optimizations that 99% of their customers don't care about.

From the vendors' perspective, it makes more sense to invest development
efforts into features (that can translate into more money) than in performance
optimizations that only appeal to a narrow segment of their users. For the
company that requires these optimizations, since the commercial codebase is
proprietary, they're locked in and have no easy way out.

That's pretty much why the large tech companies invest in owning their own
part of the stack. It mitigates a scaling risk, keeps expertise in house and
ensures that the solution that is developed in house matches 100% with the
often specialized needs of the business.

~~~
georgewfraser
This makes sense in theory but in practice I have observed that commercial
data warehouses are incredibly well optimized. There are lots of companies we
don’t hear about using Snowflake and BigQuery to analyze giant datasets. They
don’t blog about it because they’re just using boring commercial tech. I think
the real reason companies like Uber build their own stack is just good old
not-invented-here.

~~~
carlineng
Tough to tell from the diagrams exactly, but it looks like the majority of
their data is stored in their own data centers, which might mean some
reluctance to migrate and ship data to the cloud (“cloud storage” only makes
an appearance in the Gen 1 chart). There also exists at least one public
example where they’ve bought a commercial database to build on
([https://www.memsql.com/blog/real-time-analytics-at-uber-
scal...](https://www.memsql.com/blog/real-time-analytics-at-uber-scale/)). I’d
be willing to concede that there might be legit reasons for not wanting to use
BQ or Snowflake.

That being said, we use Snowflake heavily at Strava and are very happy with
it.

------
buremba
I wonder which BI tools they use for running ad-hoc queries on their Presto
cluster. The user behavioral analytics is a hassle when you use SQL and
generic BI solutions don't help with that.

Also, I assume that they have dashboards that use pre-aggregated tables for
faster results, they probably have ETL jobs for this use-case but is the pre-
aggregated data stored on HDFS as well?

------
philip1209
After their data problem exceeded a single MySQL instance - hypothetically,
what would have happened if they switched to Google Cloud Spanner? Ostensibly
Google has a lot more than 100 petabytes in spanner. Could you still run basic
queries in it without switching to hbase?

~~~
bradhe
Fairly certain the Spanner service wasn’t available then.

------
faizshah
Where exactly does Flink/AthenaX fall into Uber's stack?

------
tomnipotent
Hudi sounds a lot like event sourcing but at data warehousing scale - a change
log backed up by a snapshot of the latest updates.

------
saganus
Interesting that they use the term "driver-partner" in some parts but just
"driver" in others.

I guess they want to avoid liability as much as possible?

Would it really be possible to use a blog post in a legal proceeding to
determine whether Uber has drivers or partners?

~~~
riteshpatel
Pretty sure that's because in some cities the driver is dispatched by a local
car company and the driver works for them, not Uber. I've seen that in a few
places.

------
seoanalyzer28
amazing...

------
Iwan-Zotow
What exactly is "Big" here? It is about 1000 hard drives, several racks...

~~~
nemothekid
Usually the “big” qualifier is a fuction of RAM, not hard disk space. Getting
hundreds of petabytes of data onto persistent storage in a “large” room has
been possible for many years now.

“Big” data was never about how to store the data.

~~~
Iwan-Zotow
What I'm trying to say, as soon as you could fit data and processing unit(s)
into one well cooled room in datacenter, managed by two guys per shift, it is
not "Big" problem anymore. Making it all local will probably speed-up their
queries/analytics enormously as well

