
Large-Scale Transactional Data Lake at Uber Using Apache Hudi - santhoshkumar3
https://eng.uber.com/apache-hudi-graduation/
======
planck01
Is Hadoop still the way to go for big data solutions? My feeling based on my
social bubble is that less and less solutions are built on it. The recent
stackoverflow survey seems to indicate it is not very popular anymore.

~~~
monkeyfacebag
I think you may be conflating Hadoop the mapreduce framework with Hadoop the
ecosystem, which includes hdfs, Hive, Spark and others. To the best of my
knowledge, former is waning in popularity (supplanted by tools like Spark),
but the latter remains in wide use.

~~~
throwaway_pdp09
Straight up, what's people's views on all this big data stuff? I see it in so
many job ads and I really can't believe it's necessary. Sure, at the far end
of one side of the bell curve are companies like uber but otherwise, are
people using it to process a few terabytes that could be done better on one
multicore server? How many companies have enough data to justify it? Personal
opinions welcome.

~~~
cyberdrunk
What I've seen is that thousands of jobs for dozens of team are run on
company's Hadoop cluster. Sure, each team could provision their own custom
infra and run the job there, but having a centralized way to do it, with all
the extra niceties (scalable capacity, good monitoring, logging, HA), can
provide some company-wide efficiencies. Plus, some of the jobs can in fact be
huge and you may need dozens of nodes to process them (we have such jobs in a
bank, where we don't really have big data) - doing it without a cluster would
be problematic.

~~~
throwaway_pdp09
Sounds like you're one who might actually need it.

------
loic-sharma
The image in the 'The road ahead' section seems to be using Azure's icons. For
example:

Azure Event Hubs icon -
[https://images.app.goo.gl/NLu8jSKWYPwMyTtD9](https://images.app.goo.gl/NLu8jSKWYPwMyTtD9)

Azure Marketplace icon -
[https://images.app.goo.gl/X9vZeWWAR2TDsawe8](https://images.app.goo.gl/X9vZeWWAR2TDsawe8)

Azure Service Health icon -
[https://images.app.goo.gl/fJcUZqwUWbzK7tC99](https://images.app.goo.gl/fJcUZqwUWbzK7tC99)

They probably should change that...

------
nknealk
Is anyone non-Uber using Hudi in production? Can someone comment on what
developing against it is like compared to other warehousing technologies?

~~~
cmollis
We're about to evaluate it since it's incubating now in AWS' EMR. I think if
you have existing spark workloads that require update or delete of existing
data stored in parquet, then it might be a good choice since it fills in a
major gap there. The other choice is Delta Lake which provides similar
capability.

~~~
xyzzy_plugh
I believe Netflix's Iceberg (now also Apache) aims to solve the same problems.

~~~
hashhar
Iceberg solves more problems than what Hudi does. The biggest win is computed
partitions and not having to define a strict partitioning strategy from the
start.

