OLTP databases. I've used Oracle, MySQL, and Postgres in anger (in that order, h...

streblo · 2023-08-01T17:37:55

I've put a lot of time into Airflow and feel similarly that it's a huge pain and a risk to rely on it. I've replaced it with Temporal (https://temporal.io/) and while I don't have the breadth of experience with the frameworks you listed, I do think Temporal is a great replacement for Airflow.

ye-olde-sysrq · 2023-08-01T17:53:56

I worked in Data Engineering for 6 years before recently deciding to at least take a siesta to write some "normal" software as a product engineer.

I really like data engineering in general - writing spark jobs, manipulating huge amounts of data, concerting pipelines, writing SQL... I just like it all a lot. I like SQL - there, I said it!!

But damn do the rough edges just burn you out. Airflow is one of those edges. IMO part of the problem is that everyone wants to hold it in a slightly different way, and so you end up with Airflow being this opinionless monster thing that lets you use it any way you want as long as it "works". And everything built on top of this opinion-less mess is of course also kind of a mess. And plus at a certain scale, Airflow itself needs to be maintained as a distributed system.

Spark is another one of those things with lots of rough edges. Some of it might've been the places I worked at were holding it all wrong, but Spark and all its adjacent context are so damn complicated you either have a small group of experts writing all spark code and you have to heavily prioritize what they work on, or you have a bunch of non-data-engineer folks writing spark jobs and kind of making it work sorta but it's all as inefficient as possible while still making it work, and it's really brittle as they don't know all the ways they need to think about things scaling.

Notably, I got really tired of people coming to our team and going "we wrote this big spark job, it worked fine up until it didn't, can you please help us fix it, it's mission critical". And it's just an OOM because they're pulling data into memory and have just been bumping -Xmx every 2 weeks for six months. or perhaps even worse, they were doing some wonky thing that wouldn't scale and then now they're running into weird errors because they used the feature so wrong (like, having a small dataframe with 1-2 partitions that then gets used in a subsequent stage that's HUGE and all the executors absolutely swamp the poor node or two that's hosting those partitions).

Anyway, it's easy to write a spark job that works at the time of writing, and really hard to make a spark job that will work 6 months from now.

Add to the fact that Spark is a giant money firehose, data engineering departments are constantly asked to sacrifice reliability to increase efficiency (just run everything hotter! who needs slack or buffer space?) and it just makes all the above issues worse.

slotrans · 2023-08-01T23:36:12

> I like SQL - there, I said it!!

SQL is the only, only tech thing I love. Other stuff might be fine tools to accomplish various goals but SQL is Good and Beautiful and True.

hodgesrm · 2023-08-01T18:21:16

> I guess my strong opinion is that Snowflake is massively overrated and it will perform worse and cost you more than you expect.

Are there specific use cases or experiences that prompt you to say this? I've seen a lot of examples (such as web analytics or SEIM) where teams have built very capable stacks on ClickHouse or similar analytic databases. Basically if you have a focused use case it's often possible to built a custom stack with open source on Kubernetes that outperforms Snowflake along axes like p95 response, cost-efficiency at scale, and data ownership. It would be interesting to hear more about your experience.

slotrans · 2023-08-01T23:52:27

It's trivially easy to beat Snowflake on latency because its latency is truly awful. It often takes 1-4 SECONDS end-to-end to run a query that touches just a few thousand (not million!) rows. In theory this is fine for OLAP but when you have a Looker dashboard with 20+ tiles, it becomes a serious problem. ClickHouse absolutely thrashes Snowflake at this, routinely running millions-of-rows queries in hundreds of milliseconds.

Anyway the specific thing I'm remembering about cost is a case where a data team I joined had built a (dumb) CI process that ran a whole DBT pipeline when a PR was opened. After a month or so we got a bill for something like $50k.

Snowflake's rack-rate pricing is $2/credit and an XS warehouse is 1 credit/hr. That XS warehouse is, allegedly, an 8-core/16GB(?) instance with about a hundred gigs of SSD cache, from a "c" family if you're on AWS. Of course since your data is in S3 (cache notwithstanding), you're likely to be network-constrained for many query patterns. BigQuery, which is unquestionably faster than Snowflake, proves that this can be done efficiently. But compare to Redshift (non-RA3) or ClickHouse where you have data in locally-attached disks, Snowflake just gets smoked. The only lever they give you to get more performance is to spend more money which is great for their bottom line but bad for you.

The pitch is that because you can turn it all off when you're not using it (which in fairness they make very easy!), the overall costs end up low. Ehhhhhhhhhhh, maybe. It only takes one person leaving a Looker dashboard open with auto-refresh enabled to keep a warehouse constantly online, and that will add up fast. Plus if you are being silly and building DW data hourly, as is popular, it's going to need to be on anyway. (Do daily builds! You don't need more than that!) Point being, the cost model you will get from sales reps makes very optimistic assumptions about utilization, and it is very likely you will be hit with a bill larger than expected. In practice while it is technically easy to control utilization, it is not actually easy because there are humans in the loop.

abrazensunset · 2023-08-02T03:32:40

Agree with this. Snowflake has best-in-class dev experience and performance for Spark-like workloads (so ETL or unconstrained analytics queries).

It has close to worst-in-class performance as a serving layer.

If you're creating an environment to serve analysts and cached BI tools, you'll have a great time.

If you're trying to drive anything from Snowflake where you care about operations measured in ms or single digit seconds, you'll have a bad time and probably set a lot of money on fire in the process.

hodgesrm · 2023-08-02T00:02:19

Thank you. Much appreciated! I work on ClickHouse, and it's definitely the cat's meow for user facing analytics.

ldjkfkdsjnv · 2023-08-01T17:37:26

Can confirm that Airflow is a terribly designed system. So inflexibile and gotchas that only arise after a ton of development. AWS Step Functions are infinitely better and more simple.

dharmab · 2023-08-01T17:21:46

Could you link some resources on critiques of Airflow? I have some colleagues using it and knowing of any impending footguns would be helpful.

slotrans · 2023-08-02T00:03:51

The Prefect guys wrote this https://www.prefect.io/guide/blog/why-not-airflow it's something but far from comprehensive.

There's not a lot of writing about this. Folks seem content to fight Airflow's deficiencies. Most of them are too young to know any better. The critiques you'll find are generally written by competitors, or folks adopting a competitor.

Here's the big one I see lots of folks get wrong: do NOT run your code in Airflow's address space.

Airflow was copied from Facebook Dataswarm and comes with a certain set of operational assumptions suitable for giant companies. These assumptions are, helpfully, not documented anywhere. In short, it is assumed that Airflow runs all the time and is ~never restarted. It is run by a team that is different from the team that uses it. That ops team should be well-staffed and infinitely-resourced.

Your team is probably not like that.

So instead of deploying a big fleet of machines, you are probably going to do a simple-looking thing: make a docker container, put Airflow in it, then add your code. This gives you a single-repo, single-artifact way of deploying your Airflow stuff. But, since that's not how Airflow was designed to work, you have signed yourself up for a number of varieties of Pain.

First, you are now very tightly coupled to Airflow's versioning choices. Whatever version of Python runs Airflow, runs your code. Whatever versions of libraries Airflow uses, you must use. This is bad. At one point I supported a data science job that used a trained model serialized with joblib. That serialization was coupled to Python 3.6 and some precise version of SciKitLearn. We wanted to upgrade Python! We couldn't! Don't use PythonOperator. You need separation between Airflow itself and Your Code. Use a virtualenv, use another container, use K8s if you must, but please please do not run your own code INSIDE Airflow.

Second, you cannot deploy without killing jobs. Airflow's intended "deployment" mechanism is "you ship DAG code into the DAG folder via mumble mumble figure it out for yourself". The docs are silent. It is NOT intended that you ship by balling up this mega-container, terminating the Airflow that's running, and starting up the new image. You can do this, to be sure. But anything running will be killed. This will be fine right up until it isn't. Or maybe not, maybe for you it'll be fine forever, but just please please realize that as far as Airflow's authors are concerned, even though they didn't say so, you are Doing It Wrong.

zui · 2023-08-02T01:08:32

The Prefect's blog post yield a 404 for me. Here is the WBM link: https://web.archive.org/web/20220702112732/https://www.prefe...

dharmab · 2023-08-02T00:17:29

Thank you for this, this is a helpful guide to looking into some of these things further.

tetha · 2023-08-01T19:57:10

Color me very interested about footguns in Airflow as well.

We're currently considering it as a self-hosted process engine to automate technical business processes and to coordinate a few automation systems like jenkins. Crap like, trigger a database restore via system A, wait until complete, update tickets, trigger some data migration from that database via System B, update tickets. Maybe bounce things back to human operators on errors and wait for fixing / clarification. Trigger a bunch of deployments in parallel and report on success.

Systems in this space are either (a) huge, hugely expensive enterprise applications designed to consume all business processes, which is a bit overkill for our current needs (Camunda, SAP, Stackstorm, ...), (b) overfitted onto a very specific data analysis setup (aka: if you don't have hadoop, don't touch it) or (c) overly simplistic and offering no real benefit beyond investing into self-hatred, guiness and making jenkins work.

Airflow seemed like a bit of a decent middle ground there for workflows a bit beyond what you can sensibly do via jenkins jobs.

abrazensunset · 2023-08-02T03:20:48

Please don't do this.

If you want an Airflow-ish approach without punishing your future self, pick Prefect. Otherwise go with Temporal. Above all do not adopt Airflow for the use cases you describe in 2023

tetha · 2023-08-02T17:47:18

Prefect seems like a really good suggestion, thank you.

We're pretty much committing to python as our language of choice in the infra-layer. Most of the team is sent onto courses over the next month, too. So I have a whole lot of python scripts popping up over the infrastructure.

And this approach of slapping some @task and some @flow onto scripts or helper functions seems to work really well with what the team is doing. It took me like 30 - 40 minutes to convert one of those scripts into what seems a decently fine workflow. very intrigued.

slotrans · 2023-08-02T00:06:23

Jenkins is honestly Really Good as a scheduling system. The best, frankly.

If you can combine it with a library- or tool-style DAG system (vs. server-style like Airflow), like make or Luigi or NextFlow or even Step Functions, that is a great sweet spot.

dharmab · 2023-08-02T00:15:59

In my experience Jenkins is like shell scripting- in the hands of hands of someone who understands its strengths and weaknesses and is very disciplined in how it is used and maintained, it's both performant and flexible. If you follow the path of least resistance it becomes a mess.

patrick451 · 2023-08-02T00:44:31

I agree. For some reason, Airflow really sucks at running something on a cron-like timer. I can't remember the all issues I had with it exactly (this was several jobs ago), but getting it to run a job at the same time everyday was a nightmare. One issue I do recall was that it treated the first run differently, basically ignoring your defined schedule. And I think "first run" meant every time Airflow was restarted. So if Bad Things happened if that job ran the wrong time of day, you had to add extra logic into the job itself to abort if it was being called at the wrong time. How they managed to make this so difficult is mind boggling. By contrast, Jenkins will reliably execute a job on a timer without issues.

pani5ue · 2023-08-02T02:24:07

Use makefiles for as long as you can..

If your DAGs get too complicated for makefiles, it's time to rethink and restructure your DAGs

weitendorf · 2023-08-01T20:10:38

An inherent problem with data engineering dag frameworks is that they are not directly integrated with the nodes, they live above the nodes. The nodes don’t know about the DAG and depending on the software they’re running they’ll have different semantics for checking if something is stuck, cancellation, etc.

I think there’s a lot of room for innovation here. Given 1+ data streams or ingestion locations, a bunch of SQL scripts, a DAG to orchestrate the scripts, and 1+ data destinations there are many different execution models that could be used but aren’t. You’ve only specified the pipeline semantics rather than implementation, so smart tooling should be able to automatically implement patterns like streaming or intermediate queueing without much further input. IMO that’s what DAG frameworks could be: “compilers” for your data pipeline, rather than orchestrators. There’s progress in the area but nothing that quite gets there yet AFAIK

saulpw · 2023-08-01T17:31:24

Re OLAP: 100% agreement on Snowflake being a poor performer. BigQuery is Google-scale I guess which might be necessary but feels way over-complicated for most uses. I've only dabbled with ClickHouse but at every turn I've been impressed with their offering.

gjsman-1000 · 2023-08-01T17:43:25

I’m surprised that naming a poorly-performing database “Snowflake” didn’t seem like an obviously bad marketing decision…

tough · 2023-08-01T17:27:22

> MongoDB has zero use cases. Just Use Postgres.

I kinda am on that bandwagon but at a new job we're using mongoose and mongo for our database, because the consultors that implemented the webapp before us decided so.

How do I convince mgmt that we might want to switch that? When? Is it even worth it or possible now that it's already our default?

burkaman · 2023-08-01T17:31:47

"Just use Postgres" is advice for choosing a database for a new project, it does not apply when you're already running something else in production successfully. In general I don't think it would be worth switching unless Mongo is causing significant tangible problems in your specific environment that Postgres would solve.

btown · 2023-08-01T17:51:10

For one, compare https://jepsen.io/analyses/postgresql-12.3 and https://jepsen.io/analyses/mongodb-4.2.6 - similar vintages, entirely different levels of reliability.

From a migration perspective, the presence of JSONB fields in Postgres, and much of the modern tooling around them, mean that you can even put a highly nested document structure directly into Postgres and have a clear migration path without rewriting code.

(Do note however that Postgres' JSONB overrides the key order of objects to have canonical object identity - it thus avoids pitfall #1 in https://devblog.me/wtf-mongo but legacy code may rely on that behavior.)

MongoDB was great when it worked; I've still never experienced as fluid a workflow as Meteor enabled, where either a client update or a batch process could update a document in MongoDB and it would immediately propagate to every other interested client. RIP - while there's an active community, the Meteor team went on to do Apollo GraphQL, whose real-time support is a shadow of that original developer experience.

That said, on that same project, I can say anecdotally that I was bitten hard by some of the disaster recovery and divergent-oplog issues mentioned in the meme-y https://aphyr.com/posts/284-jepsen-mongodb - granted, the software has improved significantly since then. But I'm of the opinion that a company developing database products can only shift its internal culture so much; such a "cowboy" mentality, one that led to releasing a database product with these glossed-over problems and unreliable defaults, never truly goes away.

Rewrites always require a cost-benefit analysis. Modern MongoDB (you are running a modern version, right?) may have addressed enough concerns that it's the right decision to stick with. But even for document database needs, there's no reason not to use Postgres for green-field projects.

tracker1 · 2023-08-01T17:33:51

I think it deeply depends on how much variation in code is actually there to support the database(s). Along with how much data, how many collections, and how would you handle the migration and/or scaling with your proposed alternative? Is it worth stopping all feature development for weeks/months to migrate? How profit driven or constrained are you?

There's a lot that I like and dislike with MongoDB. As a developer, it can be a relatively nice experience. As an administrator, I hate it beyond words. It really depends on your needs, entrenchment and the skills of your staff.

programmarchy · 2023-08-01T17:39:57

Depends on your data and how it needs to be modeled and used, imo. If it’s sparse and interconnected then obviously SQL makes a lot of sense. But for data that is very document-like e.g. 10Ks you’d get from SEC EDGAR, then document based is suitable. I’ve found even with relational databases, sometimes you need to denormalize the data a bit to achieve performance benchmarks for complex queries, and there’s a tipping point where you may want to sync your data into something document-based like Elasticsearch to get improved search capabilities, too.

bpicolo · 2023-08-01T23:07:22

Doing the work to switch off may well be a worse idea than starting on it in the first place.

There are many, many companies for which this decision won't matter too much.

nicbou · 2023-08-02T06:56:20

I love Postgres, but 100% of my humble use cases are better served by SQLite and I wish I realised that sooner.

btreecat · 2023-08-02T10:58:45

I have a coworker who wanted to use RDS to store a single configuration table with a dozen or so rows.

I told them to please investigate using SQLite with an d3 bucket for storage. Their tools will run about 1x a month on automation, they don't need a full DB.

Due to some unfortunately placed warnings about SQLite in production (context matters too) they were discouraged from this path.

I didn't argue I just said "fine, use aurora" and I'm sure this will bite us in some unforseen way down the road, but that will be Infra issues and then it's my teams job to handle it.

Spooky23 · 2023-08-02T01:44:27

My first database was Informix, and that was a wonderful product. Better than Oracle in many ways. They had a great CLI interface and API. Unfortunately, Oracle had the mindshare and IBM acquired it.

mushufasa · 2023-08-01T17:32:25

have you tried mage ai for DAG/elt?

I haven't yet used it, we are evaluating it though, and I'd love to hear from HN'ers that tried it

Palmik · 2023-08-01T20:36:06

Can you share for what kind of tasks you use DAG frameworks such as temporal?

sohnakukkar · 2023-08-01T17:39:23

How much did you pay for Oracle? Is it a one time payment or annual?

matt_s · 2023-08-02T12:08:11

Some older context is they price per CPU core and I remember something like $10k/CPU core per year. This is from years ago and there were "enterprise" agreements in place and many millions of dollars being spent. I can't imagine they lowered prices other than maybe cloud offerings.

The support contract costs more than the "license" if my memory serves me, but its very expensive. If you're doing anything remotely complicated, using their advanced tools beyond a basic RDBMS (replication, clustering, etc.) then you need that support contract.

slotrans · 2023-08-01T23:40:07

We were paying over $1M/yr for it inclusive of support. One real production instance (with Veritas Cluster), one reporting instance, 3 dev/QA instances. This was circa 2012, when the company had roughly $300M/yr in revenue.

I heard that a year or two after I left the company got an Oracle Audit and ended up paying a huge penalty.

rovr138 · 2023-08-02T01:14:32

Not one time and based on different things. Don't look that way

esafak · 2023-08-01T19:54:43

Here is another workflow orchestrator for you to try: Flyte!

slotrans · 2023-08-01T23:38:17

It's on my list for sure. The hard requirement of running on K8s makes me sad in advance, though.