At the end of the day, most places just want something to point PowerBI at etc.
If you start from the assumption that you've been ingesting data and storing it in a compressed columnar format like Parquet or ORC, then you're already locked into a solution that exists in the Hadoop ecosystem. This turns out to be an effective way to deal with terabytes of data because depending on the query you want to execute, the file format (and a smart partitioning structure) helps turn your problem of reading terabytes into one of reading gigabytes. And, everything in Hadoop land is generally a query, so you're going to need a query engine like Hive, Impala, Spark, etc. AND, you probably need something like Hive metadata so that you're not crawling some directory structure for the schema and input files every time you start up a new process to run a query.
You _could_ write something on your own that just forks out a bunch of threads in a single process to rip through the data, but why? Think about what you're effectively implementing - a bespoke query engine that runs one query plan. Spark has already written an API and a query engine (multi-process distributed, unlike whatever you're likely to hand-roll) and lots of input/output code. You can spark-submit --master local[*] and use up all of the cores and ram everything into one monster JVM if you really wanted to.
Finally, you could stuff all of this in memory, but at what cost, and what are you going to use to do it? (Worse, what happens if the VM goes down - those terabytes are going to take their sweet time reloading over a network link.)
You can get single chip with dozens of cores for not too much.
I guess that's how I'd do it.
> This turns out to be an effective way to deal with terabytes of data
A few terabytes don't need cluster, typically.
> You _could_ write something on your own that just forks out a bunch of threads in a single process to rip through the data, but why?
Because it's simple and easy. I wrote one in a few days. Not much code.
> You can spark-submit --master local and use up all of the cores and ram everything into one monster JVM if you really wanted to.
Point is, do you need to?
> those terabytes are going to take their sweet time reloading over a network link
You do not run big IO over a network like that if you can avoid it. With a single server with plenty of SSDs, you trivially don't.
My take anyway.
You pay a premium to have this stuff loaded up on faster hardware. If you look at how Redshift works on their compute-optimized nodes, everything's sitting in local NVMe SSDs (a slightly different strategy is used on ra3 nodes). They handle the fact that this is ephemeral with automatic backups. Cluster restores aren't exactly fast. I actually agree with you; if you've got some monster SSDs attached, which are comparatively cheap, why focus on the RAM... I believe there are reasons to do that sometimes, but not everything demands quite that level of performance.
For this point:
>> You _could_ write something on your own that just forks out a bunch of threads in a single process to rip through the data, but why?
> Because it's simple and easy. I wrote one in a few days. Not much code.
I think it depends on what formats you're using and how it's laid out on disk. A lot of people reshape their data into a table-like structured or semi-structured format, and that makes it a candidate for putting it into a database like Postgres or Redshift, and other times it makes more sense to bring the database to the data (the Hadoop ecosystem of stuff.)
For example my company still has stuff that writes out row-like objects into S3, and it's not too hard to write a single process job that spawns threads, reads the input files, and does some computation on those. They're on S3, so throughput kinda sucks, and copying it to local SSD only makes sense if you want to make multiple passes. But the Parquet format alternative of this data is tremendously faster to work with and genuinely easier to use, and the only barrier to entry is that you run Spark on your local machine and commit to using Spark with Elastic MapReduce to do processing for this data. Sure, you might only end up filtering through tens or hundreds of gigabytes; terabytes is usually rare. But part of that is because you only read 10-20% of every input file to do the work. It also integrates extremely well with other stuff we're using - it's even easier to just use Snowflake against the same data set, for example.
Sorry, I think I'm rambling at this point and not presenting a really coherent argument.
You have immediately confounded 'big data' with a ton of cloud tech. My point is that you arguably don't need big data frameworks, and therefore arguably sticking it in the cloud is pointless too. If you can buy a reasonable server and stick it in the corner of the office, do so (noise & security, yeah).
> Sure, you might only end up filtering through tens or hundreds of gigabytes ...S3 ...Parquet ...Spark ...Elastic MapReduce ...Snowflake
Dude, do you need all this stuff!? Really? For a poxy few hundred GB?
Azure Event Hubs icon - https://images.app.goo.gl/NLu8jSKWYPwMyTtD9
Azure Marketplace icon - https://images.app.goo.gl/X9vZeWWAR2TDsawe8
Azure Service Health icon - https://images.app.goo.gl/fJcUZqwUWbzK7tC99
They probably should change that...
Our workflow with it is:
1. Stream new data into a partitioned S3 bucket available as a table in metastore.
2. Use Hudi to create a copy-on-write compacted version of the table (think SELECT * FROM (SELECT *, row_number() over (partition by uuid order by ts desc) as rnum) WHERE rnum=1).
Now people who are interested on the snapshot of the table rather than the event log can query the Hudi table instead of the raw event log tables.
This has improved perf for the common use-cases while allowing us to use a common pipeline. Earlier we would have to dump such data into an RDS (Aurora Postgres) with UPSERT but that meant table bloat was an issue and cost was high depending on the date-range we wanted to store compacted.