I am just an amateur, but gosh I love doing ETL. There's something about the specification of the data you're supposed to be getting, and then the harsh reality of the filthy trash you will actually be fed, and making something that can carefully examine each of the assumptions in the spec to test for how it won't go right, making tests for things that will, you are told, "never happen" (only to get an email from your program three years later that this has, in fact, happened), interpolating data where you can, and in general making a "simple load process" into one of those grinders people can feed cows and tungsten carbide rods into.
I feel like Data mentioning that he just ... loves scanning for life forms.
I've always found ETL frameworks to have their own problems. They seem great on paper but usually they don't account for a specific source system, APIs, applications, data size, data distribution or scheduling situations. If your project is using it then developers end up hacking the frameworks instead of writing simple code that does the specific thing they need to do.
Before you know it you have super long and super inefficient code just to fit the framework. It takes about the same time to read and understand an ETL framework as it is to write your own python/bash script, and at least with your own code it's easier to see bottlenecks.
I started writing and never completed a dead simple ETL framework that left most of the work up to the programmer. It was basically just an http server that you could hit with cron jobs, a DAG data structure for mapping the process, and some annotations that you could use to specify that a node/program step was concurrency safe, blocking, etc. You’re entirely right that there’s way more to ETL than meets the eye, but this still makes me want to dig it back up.
Prophecy.io let’s you create visual components from any Spark function. Same with Airflow. So you can use standard components (built-in or your new ones) without being restricted.
Founder here - we’re working to solve this exact problem.
We’re very different from the ETL tools in that we’re bring software development best practices to data.
When you do visual drag and drop - prophecy is generating high quality code on git that is 100% open source (spark, airflow), you have tests and CI/CD - so you’re visually doing solid data engineering.
You can toggle between code and visual - so if you change the code (some), the visual graph updates - so small edits directly to git don’t break the visual layer.
All visual components are generated from a spec - think a Spark function with a some more info. So the data platform teams will create their own library/framework and roll it out to the wider teams. How it works us that in the visual editor, you start with standard Spark library, but can load visual components for delta, or encryption or data quality.
Our customers are typically fed up of these ETL tools and moving to us. We can also import the ETL formats (AbInitio, Informatica, …) in an automated way (we reverse engineered their formats and created source to source compilers)
Couple that with the way ETL frameworks quickly become undocumented, featuriferous that are opaque to anyone who isn't deeply embedded into the framework, yeah.
First exapmle was connecting Iterable - which looks like Airbyte supports - to bigquery.
In the past I had someone help me setup snowflake which was too complicated for me to maintain / understand myself especially AWS is too complicated for me compared to simpler google cloud.
Have also tried stich and fivetran at different times. Mostly to try to save time setting up non webhook syncs from FB marketing, Front. The volume of iterable data would be way hugely prohibitably expensive for us on those as paid platforms.
In the end I was able to do FB Marketing myself less than 1k lines of python modified from a script I found on github which used google cloud scheduler & function. I don't know python so that felt somewhat satisfying.
Another nuance in favor of a hosted/paid platform is that it looks like airbyte uses an api lookup sync instead of webhooks. That lets Airbyte get some more meta data to join to that I don't collect. That's valuable!
For iterable I ended up making a GAE app to accept incoming webhook data -> push to pubsub -> pushes to function -> which writes to bigquery.
The latency for bq writes was too much to try and do it all at once and i don't think iterable does webhook retries. Also Iterable is MEGA bursty like I've seen our GAE will scale up to somethings 40+ instances within minutes after we hit send on a campaign. That was the hardest problem to figure out getting the latency down for cold starts and scaling, cloud functions didn't work. It's not perfect but it's good enough for our needs. The simpler FB function grabs data 100% correct each day which feels good last I talked to some of the paid ETL it was flat $500 minimum a month not worth it.
From learning all this I've been able to reuse this gae, pubsub, function, bq/spanner pattern for other stuff I build and it has saved a lot of time and headache.
I think this is saying that particular class expects to receive a refresh token as input. The "full oauth loop" means the UI needs to produce a refresh token via user consent in the browser.
I agree. I've used AWS Data Pipelines for some jobs but there is a steep learning curve. It is good for launching servers on demand to run your ETL jobs if you need that.
The best solution I have found is writing ETL scripts in Laravel which I use for most projects anyway. The framework has built-in scheduling and error reporting.
I think some of the points made here about ETL scripts being just 'ETL scripts' are very relevant. Definitely been on the other side of the table arguing for a quick 3-hour script.
Having written plenty of ETL scripts - in Java with Hadoop/Spark, Python with Airflow and pure Bash - that later morphed into tech debt monsters, I think many people underestimate how quickly these can quickly snowball into proper products with actual requirements.
Unless one is extremely confident an ETL script will remain a non-critical good-to-have part of the stack, I believe evaluating and adopting a good ETL framework, especially one with pre-built integrations is good case of 'sharpening the axe before cutting the tree' and well worth the time.
We've been very careful to minimise Airbyte's learning curve. Starting up Airbyte is as easy as checking out the git repo and running 'docker compose up'. A UI allows users to select, configure and schedule jobs from a list of 120+ supported connectors. It's not uncommon to see users successfully using Airbyte within tens of mins.
If a connector is not supported, we offer a Python CDK that lets anyone develop their own connectors in a matter of hours. We have a commitment to supporting community contributed connectors so there is no worry about contributions going to waste.
Everything is open source, so anyone is free to deep as dive as they need or want to.
"Definitely been on the other side of the table arguing for a quick 3-hour script."
But that's probably the time it took to write it, right? 80% of the cost of software is in maintenance, so there's another 12 hours worth of maintenance left to account for. If you know you're going to spend 15 hours on it, then you might as well use a system you know will cost less to extend or scale over time.
"We've been very careful to minimise Airbyte's learning curve."
That's good for quickly onboarding new customers, but not necessarily for the system to be scalable or extensible.
"Starting up Airbyte is as easy as checking out the git repo and running 'docker compose up'."
I'm always curious about this. Docker-compose doesn't run on more than a single host, unless you're using it with the AWS ECS integration (and maybe Swarm?). So sure, the developer can "get it running" quickly to look at it, but to actually deploy something to production they'll have to rewrite the docker-compose thing in something else. If you provide them with Terraform modules or a Helm chart, that would get them into production faster. And maybe even a canned CI/CD pipeline in a container so they can start iterating on it immediately. It's more work for your company, but it shortens the friction for the developers to get to production, and enables businesses to start using your product in production immediately, which I think is a pretty big differentiator of business value.
"If you know you're going to spend 15 hours on it, then you might as well use a system you know will cost less to extend or scale over time."
I wish younger me realised that earlier :)
"And maybe even a canned CI/CD pipeline in a container so they can start iterating on it immediately."
Definitely! Although a good number of users are surprisingly happy with their Airbyte instances on a single node.
We do have a Kubernetes offering for those looking to scale Airbyte beyond a single node. We also have Kustomise/Helm deploys for this, though I'll be the first to admit that the Helm charts are mostly community-maintained and can be improved. This is one of our (my) top priorities going into the next Quarter.
I think it is great stuff you are building, especially for data integration of external sources that don’t offers webhooks or other pub/sub integration options. But is Airbyte really Open Source? I’m not questioning the reasons for moving to a source available license due to how AWS treated open source, but I think it is a bit of false marketing of Airbyte referring to it as open source. Sure, some pieces are MIT but the core is Elastic License 2.0 from what I can see. https://airbyte.io/blog/a-new-license-to-future-proof-the-co...
Airflow is a general orchestration tool that fit into the Python stack extremely well. It wasn't build to scale though, so once you want to run something more than once a second, you are going to be jumping through hoops.
My experience with Dataflow is 1.5 years old, so things might have changed, but I felt it more to be a unified, simplified Hadoop/Spark framework. It unifies the batch/streaming concepts but is still pretty low-level.
Within ELT, or ETL, Airflow/Dataflow can fulfills all 3 components.
Airbyte focuses just on EL (though we have basic T functionality around normalisation). Our intention is to leave T to the warehouse, since warehouses like Redshift/Snowflake/BigQuery are extremely powerful these days, and tools like DBT, give the users more more flexibility to recombine and consume the raw data than a specialised ELT pipeline.
In summary, I would say Airbyte is a specialised subset of Airflow/Dataflow, and it's possible to use Airbyte with either tools, though I'd guide someone towards DBT.
Please be aware there is also a newer concept, which is ELT.
Extract Load Transform.
This is the data lake principle: you firstly dump all data verbatim into one coherent data storage. (data ingestion phase)
The transform will eventually be done by the data scientists (data curation phase).
The data curation is really difficult as it requires in-depth knowledge of the data, removing of slack, creation and lookup of reference data, data normalization for the datawarehouse schema, etc.
The data ingestion is relatively easy task, because the dirty data is just collected from upstream databases without any processing/conversion.
I believe data lakes is kind of "cheating" by IT departments.
1. Great explanation of ELT, for a long time ETL monkey this broke through for me.
2. GDPR/COPPA means you really have to think about your data lake before dumping verbatim into long term storage.
Ultimately regulations will change over time. I think the most important thing to keep in mind is storing data in a form that is amenable to large scale batch transformation. This way you can respond by kicking off a task to "rewrite the world" to scrub fields or drop records. Bug reason why I encourage stuff like Avro, Parquet or the even just loading to BigQuery.
Their standard scenario to avoid is actually a perfectly acceptable process to grow though prior to wedging another but of infrastructure into your org that needs it's own provisioning, maintenance and support, redundancy planning etc.
In that scenario the situation is so nebulous that the original implementers had no way to know at what point the business/use case would end up so why would they immediately jump to a belt and braces solution.
It's the infrastructure version of my go to phrase: don't code for every future.
We currently are using Airflow for ELTs/ETLs to ingest from different Postgres databases to BigQuery/Google Cloud Storage. Airbyte looks sweet for the same task and would free us from a big effort burden, but its Postgres source only supports SELECT * statements (i.e. you can't deselect columns).
That's kind of a dealbreaker for us, because for security reasons our Postgres users permissions are granularly configured with column-based security. I hope the Airbyte team solves this eventually because the software is looking great.
That's crazy, why would it have the approach of "select all columns" when you might never need all of them? Could you create a view for it to select from instead?
Either way if someone told me a ETL supports only literally all the columns or nothing then I'd assume it was a MVP product to demonstrate the idea... not a production system.
It’s strange to advertise a framework that doesn’t support more than “select *”. A framework should make easy things easy and complicated things possible. Ideally there is a natural way to do stuff, so different people understand each other’s approach.
The whole article sounds like you target total amateurs in the ETL domain.
Every decent ETL engineer knows all of that what is revealed after “a few months”.
That said, I always welcome new competition in the field. Tools still suck.
I had this same kind of impression writing product/business reports as an engineering manager at Oracle and elsewhere. But my way of solving it was to build a smarter IDE that helps you do this stuff without ETL.
You shouldn't need to waste time every time you build a report figuring out again how to query Postgres/MySQL/Elastic in Python and how to generate graphs and where to host it or how to set up recurring alerts on changes. The only part you should care about is the actual queries to write, the joins to do between datasets, and the fields to graph. The actual integration code (connecting to the database, copying a file from a remote server, etc.) should be handled for you.
The tool I'm building to solve this is open source if you're curious!
Very, very flawed reasoning here. It's basically arguing that YAGNI as a principle is exactly backwards: you're definitely going to need all this stuff eventually so embrace maximum complexity from day 1. Terrible, terrible strategy.
The only ETL framework you need is a general-purpose programming language. Cf. that article "You can't buy integration" from the other day.
I just got out of the job where we were working on legacy ETL "script" in Elixir, and terrible code architecture decisions aside, I think the pattern where you have to launch and monitor long lasting jobs is a breeze in BEAM.
You just spawn one process for each job, report back via mailbox and monitor via Supervisor. Unfortunately making changes to that system where all sources were hardcoded was to say at least abysmal, but the job running core was quite elegant.
Hopefully Elixir and other BEAM compiled languages will gain enough traction, I can't imagine rewriting something that available in Erlang from the box in OOP languages with mutable objects.
I think elixir with broadway is just one step shy of a full fledged etl framework. Thinking seriously about introducing to a big financial org but reluctant because of existing investment on spark and airflow.
Do you use some kind of library to help with transformations? If you have some volume using plain python can become a bottleneck no? Any tips in this area would be appreciated.
I'm not the person you were asking, but I, too, use mostly Python for ETL tasks as a good chunk of my job.
Python's speed, or lack thereof, is rarely an issue. Most tasks that I write/run finish in less than a minute, including the time for transformations. Some of those can get fairly hairy, as well. Most of the time is spent querying the DB and uploading to the remote.
For the tasks where the transformations are too complex, you can greatly decrease the time necessary by tweaking the queries (i.e., limiting rows to just changed rows, etc.). And frankly, once you realize most data doesn't need to update in real time, it doesn't matter how long the transformation step takes (as long as it's fewer than about 23 1/2 hours).
There's a despair that overcame me when the data we were processing changed date/time format in the middle of a file. It was a file for one day's data! Data on 5 minute intervals for 3 thousand data sources. Received in an email, no less. (Long story). But date/time format changed. The initial load was garbage. As many know, some countries use mm/dd/yy and others dd/mm/yy. I just wanted to cry when I dug in and saw the change. Instead, I hacked a little more on our lowly ETL script to check for this circumstance. This was just one example of an ongoing stream of data weirdness, as ETLers everywhere know.
In an early startup, (< 4 engineers in my case) I can't imagine using anything but a skanky script at the outset. We went broke before we needing a framework. Needed more engineering time on other things.
I was once tasked with replacing a dying set of ETLs composed in "ScribeSoft", apparently the built in scheduling and speed left too much to be desired, and calling other jobs from inside the job would halt the current job. Ended up replacing everything with a C# console application that ran every 1 minute unless it was currently running. There were a lot of bugs on both ends, but they were tired of paying $5k/yr for the ETL to run.
After I wrote the initial application, they handed it off to their South African dev team to maintain it.
Of course, the flip side is that sometimes that initial step of:
> "We are doing a prototype on our new mapping product, we need to write a one-off script to pull in some map data."
... is just that - a one off script. And it can prove to be a lot quicker to write a one-off script than getting involved with an ETL framework. I am not arguing against ETL Frameworks (Airbyte etc). Just that over-engineering carries its own costs, just like under-engineering does.
Write your own. You can either learn a general purpose programming language that will do exactly what you want, and your skills will be portable, or you'll learn some ETL-vendor-lockin-domain-specific-language. Sure, they may do some things for you and make some things easier, but you have to learn their system, and in that time you could be writing your own that does exactly what you want.
The work companies put into ETL is absolutely bizarre from a business standpoint.
Say you want to build a new kind of hammer. Normally what you do is you pay a company for access to some facility that has forging process, and you put your effort into designing the hammer, working with the facility to get it forged the way you want, and selling it.
Building ETL pipelines from scratch is like building an ire ore mining facility, building diggers, trucks, shipping containers, trains, boats, iron smelting and forging equipment, and warehouses, on top of designing and selling the hammer. People today are so brainwashed by the idea that they need to be writing software that they go to these extreme lengths to build everything from scratch, when they should almost always be paying somebody else to do most of it, and only spend time working on your business's actual problem. Building factories and logistics chains should not be a problem your business needs to solve if all you need to do is build and sell a hammer.
I have been working a lot in this space for the last two years but especially in the last 6 months. I believe we're about to enter a phase where much more elegant and less restrictive ETL platforms or frameworks are as commonplace as modern software CICD offerings. Prefect and Dagster both stand out to me as viable replacements for Airflow.
I am tasked with standardizing ETL in a small org, airbyte is on my list to evaluate. As I write this there are 64 comments in the thread and only a single comment mentions actual experience using airbyte. Any other actual insights using the tool? What about Airflow downsides mentioned in the article? Thx!
At our company, we actually built ETL-Framework-agnostic wrappers, monitoring, logging and scheduling tooling around the different ETL tools we used for four different ETL Product Frameworks we used: Microfocus COBOL, Torrent Orchestrate, Datastage (which incorporated Torrent) and Abinitio.
The wrappers invoked the ETL command, reformatted and consolidated logs.
For scheduling, we relied mostly on CA Autosys, instead of whatever scheduling mechanisms came with the ETL Product.
We found this approach made it easier to transition from one product to another.
As it consistently faster to plug the ETL framework into the supporting framework than to implement everything a new ETL Product offered.
As we move from our on-prem environment to the cloud, we hope we can implement a similar strategy even if we have to switch the support frameworks.
Lot of dislikes on this - but I find Airbyte to be a great idea - and really exciting for a lot of people. Just because you -can- write a SQL extract in CRON - you're missing a lot of the features that you don't have to face until you hit scale. At that point you're in trouble.
I'm interested in the concept, but i couldnt find a demo or 'hello world' example to look at. This is a very important aspect of promoting a new product.
Look at how Prefect does that (I know they are well along the path), but sonething is missing.
Hiya, I'm the original author. tl;dr for those deciding whether or not to read it:
If you are thinking about build versus buy for your ETL solution, in this day and age, there are enough great tools out there where buy is almost always the right option. You may think you can write a "simple" little ETL script to solve your problem, but invariably it grows into a monster that will be a reliability liability and engineering time suck. The post goes into more depth on why that is. Enjoy!
Argo workflow looks like the best of both worlds to me. You can easily build up complex etl/data processing dags where each step is a docker container, so you can choose the best tool for the job. Argo has all the retry/backoff logic built into it and you can plug the workflows into Argo events. It runs on kubernetes so you can scale your compute according to your workflow demands
Scripts are collection data from external sources
Scripts are inserting Data to DB
Other Scripts are fetching DB and adding Data to other tables
Sure I mean it was mentioned, that data was missing and so on. But in a mature project you basically have already some monitoring, so you could just use existing solution instead of a new Framework.
Can any of these ETL frameworks kick off an ETL script without a rewrite? Something that would handle scheduling, retries, emit metrics around those actions, but let me use my own tools for the actual data manipulation.
One of the standard patterns is airflow and kubernetes executor and then the pod can do whatever. There's the occasional side effect and some setup but more or less it works as advertised.
Reminds me of the xkcd "how standards proliferate"
Arguably you could use Kubernetes as a scheduler, or "ETL framework/kit", it supports cron jobs, has a restful api, local and remote secrets storage, native support for cloud storage, support for multiple logging solutions, distributed workloads, supports cron jobs etc.
Years ago I worked for a financial services company and they would run their batch ETL jobs via a product called Tidal that later got bought by Cisco. I really liked using Tidal for the longest time, but 100% of what Tidal does, you can replicate with the scheduler features of Kubernetes.
I feel like Data mentioning that he just ... loves scanning for life forms.