Hacker News new | past | comments | ask | show | jobs | submit login
Large-Scale Transactional Data Lake at Uber Using Apache Hudi (uber.com)
66 points by santhoshkumar3 on June 10, 2020 | hide | past | favorite | 31 comments

Is Hadoop still the way to go for big data solutions? My feeling based on my social bubble is that less and less solutions are built on it. The recent stackoverflow survey seems to indicate it is not very popular anymore.

I think you may be conflating Hadoop the mapreduce framework with Hadoop the ecosystem, which includes hdfs, Hive, Spark and others. To the best of my knowledge, former is waning in popularity (supplanted by tools like Spark), but the latter remains in wide use.

Straight up, what's people's views on all this big data stuff? I see it in so many job ads and I really can't believe it's necessary. Sure, at the far end of one side of the bell curve are companies like uber but otherwise, are people using it to process a few terabytes that could be done better on one multicore server? How many companies have enough data to justify it? Personal opinions welcome.

What I've seen is that thousands of jobs for dozens of team are run on company's Hadoop cluster. Sure, each team could provision their own custom infra and run the job there, but having a centralized way to do it, with all the extra niceties (scalable capacity, good monitoring, logging, HA), can provide some company-wide efficiencies. Plus, some of the jobs can in fact be huge and you may need dozens of nodes to process them (we have such jobs in a bank, where we don't really have big data) - doing it without a cluster would be problematic.

Sounds like you're one who might actually need it.

Traditional SQL-based data warehouses still rule the world, and probably will for the foreseeable. HN reports on a small number of businesses that use bleeding-edge technologies (often just for the sake of it). The vast majority of businesses stick to the "traditional" way of doing things.

At the end of the day, most places just want something to point PowerBI at etc.

I've found that a lot of big data solutions in use by companies are only in use because business requirements were "we want all the data and we'll figure out what to do with it later", but then it turns out only a fraction of the data is used or only in a heavily aggregated form. That makes sense, of course, nobody is going to manually go through a few terabytes worth of records manually.

Answer the question about how to process a few TB on a multicore server, and you might still find yourself using Spark, or something like it.

If you start from the assumption that you've been ingesting data and storing it in a compressed columnar format like Parquet or ORC, then you're already locked into a solution that exists in the Hadoop ecosystem. This turns out to be an effective way to deal with terabytes of data because depending on the query you want to execute, the file format (and a smart partitioning structure) helps turn your problem of reading terabytes into one of reading gigabytes. And, everything in Hadoop land is generally a query, so you're going to need a query engine like Hive, Impala, Spark, etc. AND, you probably need something like Hive metadata so that you're not crawling some directory structure for the schema and input files every time you start up a new process to run a query.

You _could_ write something on your own that just forks out a bunch of threads in a single process to rip through the data, but why? Think about what you're effectively implementing - a bespoke query engine that runs one query plan. Spark has already written an API and a query engine (multi-process distributed, unlike whatever you're likely to hand-roll) and lots of input/output code. You can spark-submit --master local[*] and use up all of the cores and ram everything into one monster JVM if you really wanted to.

Finally, you could stuff all of this in memory, but at what cost, and what are you going to use to do it? (Worse, what happens if the VM goes down - those terabytes are going to take their sweet time reloading over a network link.)

How; plenty of disks to give you the IO. It's usually about the IO. If you want memory, go to a server site and configure to max out a server with lots of DRAM slots. You might be surprised. But really getting enough mem is less important than IO, so stick with disks, SSDs I suppose.

You can get single chip with dozens of cores for not too much.

I guess that's how I'd do it.

> This turns out to be an effective way to deal with terabytes of data

A few terabytes don't need cluster, typically.

> You _could_ write something on your own that just forks out a bunch of threads in a single process to rip through the data, but why?

Because it's simple and easy. I wrote one in a few days. Not much code.

> You can spark-submit --master local[] and use up all of the cores and ram everything into one monster JVM if you really wanted to.

Point is, do you need to?

> those terabytes are going to take their sweet time reloading over a network link

You do not run big IO over a network like that if you can avoid it. With a single server with plenty of SSDs, you trivially don't.

My take anyway.

Ah, so the thing I was attacking was the "if it fits in RAM, it isn't big data" meme. AWS is happy to sell you on the idea that you can use S3 as the persistent storage and then bring up the compute whenever. If you bring up the compute on-demand, then the network link matters, whether it's EBS spoon-feeding you data or you using multiple VMs and the right object storage strategy in S3 to suck the data out as fast as possible.

You pay a premium to have this stuff loaded up on faster hardware. If you look at how Redshift works on their compute-optimized nodes, everything's sitting in local NVMe SSDs (a slightly different strategy is used on ra3 nodes). They handle the fact that this is ephemeral with automatic backups. Cluster restores aren't exactly fast. I actually agree with you; if you've got some monster SSDs attached, which are comparatively cheap, why focus on the RAM... I believe there are reasons to do that sometimes, but not everything demands quite that level of performance.

For this point:

>> You _could_ write something on your own that just forks out a bunch of threads in a single process to rip through the data, but why?

> Because it's simple and easy. I wrote one in a few days. Not much code.

I think it depends on what formats you're using and how it's laid out on disk. A lot of people reshape their data into a table-like structured or semi-structured format, and that makes it a candidate for putting it into a database like Postgres or Redshift, and other times it makes more sense to bring the database to the data (the Hadoop ecosystem of stuff.)

For example my company still has stuff that writes out row-like objects into S3, and it's not too hard to write a single process job that spawns threads, reads the input files, and does some computation on those. They're on S3, so throughput kinda sucks, and copying it to local SSD only makes sense if you want to make multiple passes. But the Parquet format alternative of this data is tremendously faster to work with and genuinely easier to use, and the only barrier to entry is that you run Spark on your local machine and commit to using Spark with Elastic MapReduce to do processing for this data. Sure, you might only end up filtering through tens or hundreds of gigabytes; terabytes is usually rare. But part of that is because you only read 10-20% of every input file to do the work. It also integrates extremely well with other stuff we're using - it's even easier to just use Snowflake against the same data set, for example.

Sorry, I think I'm rambling at this point and not presenting a really coherent argument.

I'd pretty much define big data as quantity, so yes, if it fits in ram it ain't. But that's just my definition (though reasonable, surely).

You have immediately confounded 'big data' with a ton of cloud tech. My point is that you arguably don't need big data frameworks, and therefore arguably sticking it in the cloud is pointless too. If you can buy a reasonable server and stick it in the corner of the office, do so (noise & security, yeah).

> Sure, you might only end up filtering through tens or hundreds of gigabytes ...S3 ...Parquet ...Spark ...Elastic MapReduce ...Snowflake

Dude, do you need all this stuff!? Really? For a poxy few hundred GB?

"Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it. "


It’s the same with almost all hyped up technology or movements trying to push stuff into the C suite for conversations.

Exactly. Btw. the other end of companies do not use Hadoop either. S3 + Presto is super popular nowadays in case of cloud.

Spark being an example of a tool that (also) exists inside the Hadoop ecosystem and is still gaining acceptance due to popularity, but at the same time losing traction as better option are arising.

What better options? Can you drop some links? Thanks :)

So the main issue with Spark is that the streaming is not that great. In general a Spark cluster persé is not bad, but I see more and more people using Apache Beam and then just use a runner that either they already have or that fits them best. https://beam.apache.org/

Apache Flink

Depending on cloud vs on-prem.

Not really, just use Apache Beam and set the runner to whatever you have / prefer

No thanks. I do not see any value of the n+1 unnecessary abstraction over things that I am already familiar with. An average customer does not want to use these things either.

is spark that widely used? arent people moving away from it for deep learning frameworks like tf?

Spark is more than just a ML framework. It's extensively used for ETL, stream and batch processing, etc...

Not at all. There are way better solutions out there.

The image in the 'The road ahead' section seems to be using Azure's icons. For example:

Azure Event Hubs icon - https://images.app.goo.gl/NLu8jSKWYPwMyTtD9

Azure Marketplace icon - https://images.app.goo.gl/X9vZeWWAR2TDsawe8

Azure Service Health icon - https://images.app.goo.gl/fJcUZqwUWbzK7tC99

They probably should change that...

Is anyone non-Uber using Hudi in production? Can someone comment on what developing against it is like compared to other warehousing technologies?

Yes, we are using it in production (somewhat) - for a a few tables.

Our workflow with it is:

1. Stream new data into a partitioned S3 bucket available as a table in metastore.

2. Use Hudi to create a copy-on-write compacted version of the table (think SELECT * FROM (SELECT *, row_number() over (partition by uuid order by ts desc) as rnum) WHERE rnum=1).

Now people who are interested on the snapshot of the table rather than the event log can query the Hudi table instead of the raw event log tables.

This has improved perf for the common use-cases while allowing us to use a common pipeline. Earlier we would have to dump such data into an RDS (Aurora Postgres) with UPSERT but that meant table bloat was an issue and cost was high depending on the date-range we wanted to store compacted.

We're about to evaluate it since it's incubating now in AWS' EMR. I think if you have existing spark workloads that require update or delete of existing data stored in parquet, then it might be a good choice since it fills in a major gap there. The other choice is Delta Lake which provides similar capability.

I believe Netflix's Iceberg (now also Apache) aims to solve the same problems.

Iceberg solves more problems than what Hudi does. The biggest win is computed partitions and not having to define a strict partitioning strategy from the start.

Unlike hudi and delta lake, iceberg does not support updates/deletes yet.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact