Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Airbyte (YC W20) – Open-Source ELT (Fivetran/Stitch Alternative)
178 points by mtricot on Jan 26, 2021 | hide | past | favorite | 87 comments
Hi HN!

Michel here with John, Shrif, Jared, Charles, and Chris. We are building an open-source ELT platform that replicates data from any applications, APIs, databases, etc. into your data warehouses, data lakes or databases: https://airbyte.io.

I’ve been in data engineering for 11 years. Before Airbyte, I was the head of integrations at Liveramp, where we built and scaled over 1,000 data ingestion connectors to replicate 100TB worth of data every day. John, on the other end, has already built 3 startups with 2 exits. His latest one didn’t work out, though. He spent almost a year building ETL pipelines for an engineering management platform, but he eventually ran out of money before reaching product-market fit.

By late 2019, we had known each other for 7 years, and always wanted to work together. When John’s third startup shut down, it was finally the right timing for both of us. And we knew which problem we wanted to address: data integration, and ELT more specifically.

We started interviewing Fivetran, Stitchdata, and Matillion’s customers, in order to see if the existing solutions were solving their problems. We learned they all fell short, and always with the same patterns.

Some limitations we identified are due to the fact that they are closed source. This prevents them from addressing the long tail of integrations because they will always have a ROI consideration when building and maintaining new connectors. A good example is Fivetran which, after 8 years, offers around 150 connectors. This is not a lot when you look at the number of existing tools out there (more than 10,000). In fact, all their customers that we talked to are building and maintaining their own connectors (along with orchestration, scheduling, monitoring, etc.) in-house, as the connectors they needed were either not supported in the way they needed or not supported at all.

Some of those customers also tried to leverage existing open-source solutions, but the quality of the existing connectors is inconsistent, as many haven't been updated in years. Plus, they are not usable out of the box.

That’s when we knew we wanted Airbyte to be open-source (MIT license), usable out of the box, and cover the long tail of integrations. By making it trivial to build new connectors on Airbyte in any language (they run as Docker containers), we hope the community will help us build and maintain the long tail of connectors. While open-source also enables us to address all use cases (including internal DBs and APIs), it also allows us to solve the problem inherent to cloud-based solutions: the security and privacy of your data. Companies don’t need to trust yet another 3rd-party vendor. Because it is self-hosted, it will disrupt the pricing of existing solutions.

Here’s a 2-minute demo video if you want to check out how it looks: https://www.youtube.com/watch?v=sKDviQrOAbU

Airbyte can run on a single node without any external infrastructure. We also integrate with Kubernetes (alpha), and will soon integrate with Airflow so you can run replication tasks across your cluster.

Today, our early version supports about 41 sources and 6 destinations (https://docs.airbyte.io/integrations/destinations). We’re releasing new connectors (https://docs.airbyte.io/changelog/connectors) every week (6 of them have already been contributed by the community). We bootstrapped some connectors using the highest-quality ones from Singer. Our connectors will always remain open-source.

Our goal is to solve data integration for as many companies as possible, and the success of Airbyte is predicated on the open-source project becoming loved and ubiquitous. For this reason, we will focus the entirety of 2021 strengthening the open-source edition; we are dedicated to making it amazing for all users. We will eventually create a paid edition (open core model) with enterprise-level features (support, SLA, hosting and management, privacy compliance, role and access management, SSO, etc.) to address the needs of our most demanding users.

Give it a spin: https://github.com/airbytehq/airbyte/ & https://demo.airbyte.io. Let us know what you think. This is our first time building an open-source technology, so we know we have a lot to learn!




The fundamental challenge of open-source ETL is that high-quality connectors require understanding and working around all kinds of corner cases in the API of each data source. It’s very hard to get open source contributors to do this kind of work; it’s a real slog. Hence at Fivetran we’ve always stuck with the commercial route.


Personally, the most infuriating thing about a tool is where I can fix the damn thing given the source code but I have to go through the support staff to the engineering team and then wait for "this is on our roadmap but not something we're currently prioritizing". Right, I know. I don't expect other people to do work for me. I just need them to let me do the work myself.

The massive advantage of the OSS route isn't that you can ask the community to build a tool for you; it's that when you inevitably have a corner case or some behaviour you want to encode, you can just make RenesPostgres connector and copy in the Postgres connector and fix it.

I don't understand why anyone keeps their source all closed. Even one of those "you can't release this but you can edit it" licenses is better.

Half of why I use Kong as an API Gateway is that I can just edit the source code of their plugins. Thank fuck for that.


I think that if the wider open source community can maintain API client libraries for every imaginable SaaS API and every popular programming language, there's no reason that it can't maintain open source ELT connectors for all of these sources as well.

I work at GitLab as project lead of Meltano (https://meltano.com/) — which embraces Singer instead of abandoning it — and we've seen a lot of interest from data consultancies looking for mature tooling around deploying and developing Singer taps, many of whom have expressed that they'd be happy to maintain open source ELT connectors for data sources that are commonly used by their clients, if they can significantly save on ELT costs that would otherwise get passed on to those clients.

Of course, only one data consultancy (or data team at a company) would need to maintain an open source tap, and others that need the same source for _their_ clients can contribute and help keep it up to date.


We couldn’t agree more that producing high-quality connectors requires a lot of work. The hardest part about this task is that connectors must evolve quickly (due to changes in the API, new corner cases, etc). The quality of the connector is not just how well the first version works but how well it works throughout its entire lifetime.

Our perspective is that by providing these connectors as open source we can arrive at higher quality connectors. For a closed source solution, a user has to go through customer service and persuade them that there is indeed a problem. A story we have heard countless times, is that SaaS ETL providers are slow to fix corner cases discovered by users leading to extended downtime. With an OSS solution, a user can fix a problem themselves and be back online immediately.

We proactively maintain all connectors, but we believe that by sharing that responsibility with the OSS community, we can achieve the highest quality connectors.

One of the main focuses of Airbyte is to provide a very strong open-source MIT standard for testing and developing (base packages, standard tests, best practices…) connectors in order to achieve the highest quality.


Similar thoughts (btw I came here looking for your comment ha!).

I guess you had mentioned in one of the videos that at Fivetran, it is your responsibility to ensure data integrity across all of the sources/integrations, and has been since the early days. This led the customers to trust the product in the early days and the team to draw learnings from abstract patterns across sources.

Have come to believe that it is THE MOST important thing to have an explicit ownership for issues whenever there is physical movement of data across an org's ecosystem.


How do you determine this explicit ownership for issues? I've come across many governance problems linked to a lack of transparency in "bug ownership", but I've often failed to find a common ground for clients and third parties: who's responsible? Who should pay for it?

Quite often it's the one with the loudest mouth or the biggest sponsor who wins.


A customer perspective from a mid-stage CFO: I like saving money, but prefer to pay for software solutions like this directly. I pay you, and you make sure this set of connectors {in and out} continues to durably work. Meanwhile, our engineers can focus on building our product.


This is something we will definitely offer as well, with an SLA. And because the maintenance is not only done by us, but the community as well, fixes will be propagated throughout all users much faster than if it has to go through customer support.

Open-source doesn't mean you can't have both. You can check how Databricks or Confluent are doing.


Couldn't agree more.

for me, work like ELT( https://fivetran.com or https://getcensus.com) are the type of work that no engineer in the world will get a promotion from.

data|software|back|platform|etc Engineer's time is better spend on something else than that.


Exactly!

Every engineers we talked to want it out of their plate. Which is why we believe it should be commoditized with an open-source standard.


Yeah, Fivetran seems often to have issues that come down to "we had a discussion with data source/sink provider and found they had a bug in their latest release." Even if an open source contributor gets to that point they won't have the strong arm ability to force the provider to fix the bug ASAP.


Fivetran has built all the custom OAuth flows for their 150 custom integrations and you can build it into your own (internal or external) applications, it is neat. @goergewfraser When do you plan to add the ability to configure connectors that need extra config after the initial connection, e.g. choosing reports from Google Analytics?


That's an excellent point and not easy to demonstrate until someone does experience an edge case with their connectors. The main value of open sourcing a framework for integrations (e.g. Singer), is to allow customers to easily support a large number of long tail integrations that exist out there.


The Open Source Fivetran alternative. Yay, it was about time! A simple license : MIT. Clear differentiation between free & paid plans. I am liking what I am seeing so far. One of our client is in advertising industry and is syncing data from 20 different API vendors to postgres. So I am one of your potential customer.

However, there is a big problem I'm noticing with "Open source alternatives" lately on HN. I had to mention this.

Even a simple installation of airbyte on my local machine fails :( I tried docker-compose up!

I simply wanna know why a basic example is not working on an important day of your company ? :) Is this a genuine mistake ? Sorry, this feedback will sound harsh but companies are taking words 'open source' for a complete ride. It's a great marketing trick. Gets you plenty of eyeballs, good will & trust to begin with. Then later we figure it's not even self hostable.

Here is a bad example that you may not want to follow : Supabase "The Open Source Firebase Alternative". The product is not self hostable despite calling themselves open source firebase all over internet. The Founders of Supabase have been disingenuous not to address self hosting[1][2] and its a been long time since their launch. The self hosting section on their website[3] doesn't provide any details on how to self host and they are careless enough to even mention "how to migrate away" from Supabase in that section.

[1] : https://github.com/supabase/supabase/discussions/219#discuss... [2] : https://github.com/supabase/supabase/issues/85#issuecomment-... [3] : https://supabase.io/docs/guides/platform#self-hosting


hey yclurker, Supabase founder here.

I'm sorry we haven't delivered a better self-hosting experience. This is clearly something we could do better. As I mentioned in your link[1], we're targeting a release of our CLI in Q1 (last week of march).

> its a been long time since their launch

It has been 8 months since our alpha launch*, and just over 1 month since our beta. I hope that sets some context, because personally I think we (and the community) have delivered a lot in that time. I'm very proud of what our small team has been able deliver.

> The product is not self hostable

Note that Supabase is self-hostable, it's just lacking documentation, which we will rectify in Q1.

[*alpha]: https://news.ycombinator.com/item?id=23319901\*


@kiwicopple : I'm really sorry to say this, I understand you try your best to sound reasonable every time. It's just that you alone believe that we can't see past your replies. But it is pretty evident that you wont make Supabase self hostable. Dude, even proprietary closed products provide docker / docker-compose by default in their repos. Quite honestly, shame on Supabase for setting such a shady example.


In the link you shared I mention how to get the docker-compose file by running `supabase eject`

I probably can't satisfy you, but I'll make our intentions very clear to everyone else reading this:

https://github.com/supabase/supabase/commit/913add2e3ca45e55...

Sorry for the lack of documentation - we will add more docs and make everything easier to use before a "Launch week" that we have planned on the last week of March.


From your commit in README.md

>> You can emulate Supabase using `docker-compose` by following these steps inside the `./docker` folder.

What does even emulating mean for an open source product ? Why do I want to emulate and not run the real thing.

>> I probably can't satisfy you

Dude, if you can show me one another product which claims to be "open source" that can't be self hosted I will admit it.


I can confirm that self-hosted is the best of both worlds: fully private and you control the access through your own infrastructure and all works in the browser. This simplicity is important for many small-medium companies. That is why my go-to tools for adhoc data slicing by non-technical users is https://www.metabase.com. I hope that Airbyte/dbt will be the ELT complement. And I hope to find a way to contribute.


Perfectly timed comment :) we are currently writing a tutorial at the moment to showcase how to use Airbyte/DBT/Metabase together


Shameless plug: for adhoc reporting by non-technical users our https://www.seektable.com also can be considered (esp. if your users heavily use pivot tables), it is closed-source but has a fully functional free accounts & self-hosted version.


To continue to compliment this stack, feel free to check out Grouparoo, a "reverse ETL" tool that is also self-hosted. https://www.grouparoo.com


Sorry that it doesn't work :( Murphy law on launch day I would say...

Right now all our users self-host and the whole project is meant to be self-hosted for data privacy and security reasons.

Do you want to join our slack (https://slack.airbyte.io)? We can help you on the resolution!


  Traceback (most recent call last):
  File "site-packages/urllib3/connectionpool.py", line 677, in urlopen
  File "site-packages/urllib3/connectionpool.py", line 392, in _make_request
  File "http/client.py", line 1252, in request
  File "http/client.py", line 1298, in _send_request
  File "http/client.py", line 1247, in endheaders
  File "http/client.py", line 1026, in _send_output
  File "http/client.py", line 966, in send
  File "site-packages/docker/transport/unixconn.py", line 43, in connect


Thanks for posting! I don't have enough context to help you solve.

Do you want to send a screenshot of your terminal? michel [@] airbyte.io


Isn’t there a GitHub issue submission page?



I had a read through your docs but was unable to find any info on how you handle the sync of hard deletions from sources.

We use Stitch at the moment and have found this to be a surprisingly hard problem to solve without binary log replication, and without full refreshes. For the moment we've ruled out binlogs as we use Aurora MySQL which requires you to binlog from the master, and tying my data warehouse replication to the master node concerns me.

In incremental mode, for common DBs e.g. MySQL / Postgres, will Airbyte pick up hard deletions ever?


We don’t today, we will tomorrow when we start working on CDC.

You’re raising a good point with MySQL, we will need to take this limitation into account. Hopefully there is a workaround.


check this out, https://github.com/francoisp/rosettable. Uses binlog a reader to call postgres triggers, upon which you can build 2 way sync realtime.


hello Matt, you might be interested in this: uses binlog to provide 2 way sync to postgres including hard delete detection, using postgres triggers. https://github.com/francoisp/rosettable


I do a lot of ETL over the years and watched the video and read the intro. I apologize but I haven't a chance to read through the full documentation. My question is, a lot of data I pull are relational and hierarchical, how would I pass variables along to related connectors? And is there a way to wait for a parent connection to finish before the child connector runs? I've build many ETL over the years and it's not easy to keep up with the changes. The bottleneck was always the engineering adapting to the changing schema.


It is not possible as of now but we are about to start integrating with DAG managers (Airflow, Dagster, Prefect...). That will give the possibility to schedule the connectors with the proper dependencies.

See: https://github.com/airbytehq/airbyte/issues/836


Correct me if I'm wrong, but Airbyte is mainly an EL to be used with other schedulers? The scheduler will have to provide, internally or externally, the logic and parameters to the nodes that drive Airbyte connectors?


we are still figuring out how we want to integrate with other schedulers.

One option would be that you configure your source/destination with the Airbyte UI or API and with the external scheduler you just reference to a connection object with an id.

We need to run some experimentations and talk to the community to see what makes the most sense. If you have some opinion / scenarios, do you want to write them in the ticket?


Some hard questions that I need to ask as someone that does set up complex data pipelines for customers as part of his job.

How do you want to monetize your product? What is your runway as of today? When do you project you will be self-sustainable?

It's all great to have an open source solution for pushing the data around, but I don't want to invest in learning a new tool only to see it vanish in 2-3 years or so.


Sure! Those are actually great questions!

Regarding monetization, you can see more details here: https://airbyte.io/pricing We consider 2 monetization approaches: Open core (connectors staying open-source forever) with premium features as: hosting & management (cloud-based control panel without access to your data plan), and enterprise features (privacy compliance, SSO, user access management, etc) What we call “Powered by Airbyte” where we enable you to offer integrations to your own customers using our API

Regarding our runway, with the team as is, mid 2025. We intend to grow the team though, given the adoption growth we have. We’ve already been approached for a Series-A, but will consider it in mid 2022.

Regarding self-sustainability, do you mean financially? Possibly at the end of 2023 or 2024.

How does that sound to you? Genuinely curious.


Your pricing page does not have any prices so it's hard to tell. I'll definitely keep an eye on your progress though. Your connector as container approach is interesting although i think IT departments may have issues with an on prem solution that spawns other containers by itself.

One more thing - how will you protect yourself from let's say AWS forking you and selling a managed airbyte version?


The pricing depends on the number of connectors you need (but not the volume of data), the features you use, and the SLA. We already saw that there was a great willingness to pay, so we’re not really worried about that part. We will learn a great deal about that next year. Our sole focus in 2021 is the open-source part of the project, and becoming the new standard to replicate data.

Regarding cloud providers as AWS, there are several things we can do to monetize this part with them (including with “Powered by Airbyte”). But, honestly, that would be a great problem, as it would indicate that we’ve become the standard way to replicate data.


> The pricing depends on the number of connectors you need

It's not uncommon for even small companies to have as many as a dozen sources. This sort of pricing schema can hurt smaller companies with low data volume but many sources (think e-commerce, especially in the era of "headless").


Very true. We're thinking about offering some kind of volume-based for small companies but with a limit that we won't go over. There are some solutions and we will try to offer the pricing the most adapted to the needs of our users. It will be an iterative process for sure!


This is a key question. Licensing via GPL-3 would provide some protection against this, so can you explain why you went with MIT?


Building connectors is a 1000-paper cuts problem. The community will encounter more of these edge cases than anyone who keeps a version of the connector internally and tries to improve it without contributing back.

With this in mind, GPL would bring no additional benefit, it might actually be a friction point for adoption in some organizations.


Very nice! Data wants to be free. Please check out this rosettable thing we open-sourced last year, it could come in handy: it allows real time ETL-like sync between mysql and postgres. https://github.com/francoisp/rosettable I've implemented the struts of a framework built around rosettable, with mautic <-> davical as a proof of concept, it keeps a user defined map (one to one, many to one, one to many and many to many) in sync, live, both ways, surviving restarts even if there's been divergent edits on both sides while offline.

I havent released this as OS yet, but I'll put a demo online shortly.

For datalakes, I've also implemented something that could be useful: check out duplexRsync: https://github.com/francoisp/duplexRsync. I'll try to reach out via email to make sure you guys see this.

cheers! best, F


Thanks! will look at it!


Congrats! We at Hightouch [0] ("reverse ETL") are excited to see Airbyte here on HN. We've been following Michel & John for a while now since the YC days, and from the outside, it seems like they've been consistently shipping incredibly quickly ever sinec the open-source project launch.

@mtricot -- You mention that a big value prop of Airbyte is providing an interface for building custom connectors. Have there been interesting learnings on designing an ideal "interface" to provide developers? How does the interface you provide compare to that of Fivetran's Functions offering [1]?

[0]: https://hightouch.io

[1]: https://fivetran.com/docs/functions


Answering your first question: When talking about the interface we need to separate: the data protocol and the developer experience (DX) creating & maintaining a connector. We believe the data protocol we have in place should address 95% of the use cases and, as we get more sophisticated use cases we will evolve the protocol (for example for more scale). Regarding the DX, we are continuously working on it to make it a breeze and ensure super high quality.

Answering your second question: Fivetran functions are a nice escape hatch but none of the users we talked to mentioned those. They always mentioned building inhouse for missing connectors. My interpretation is that this is too much of a vendor lock-in for a cloud-based product.


Do you plan to support one-click integrations? (= Stitch Connect JS [0] / Fivetran "Connect Card" [1])

[0] https://www.stitchdata.com/docs/developers/stitch-connect/ja... [1] https://fivetran.com/docs/rest-api/connectors/connect-card


Yes, indeed, we do. This should come within the next 3 months. Here's the issue if you want to keep track of our progress on the matter: https://github.com/airbytehq/airbyte/issues/768


I notice dbt integration on your roadmap. As a current Stitch and dbt Cloud customer, can you give some insight into what you're considering there?


Yes, we love DBT too!

It’s great for handling transformations and since we want to focus on the EL part, we think there’s good synergy there.

Airbyte is already using the DBT CLI internally and as we provide more transformations of the data during syncs, we’ll make it easier to give a better integration with DBT projects downstream:

- Native Transformations as part of sync process: for example, schema migrations for source data changes, un-nesting complex objects columns (from APIs), etc

- Customizable models to override or extend further Airbyte’s proposed transformations to be executed in the same sync pipeline

- Seamless DX between custom downstream transformation and transformations made by Airbyte

- Integration with external orchestrators (Airflow, DBT Cloud jobs) with webhook triggers?

We’re happy to hear more ideas/needs to build this roadmap though!

You can have a look at the current state of Airbyte with DBT here: https://docs.airbyte.io/tutorials/connecting-el-with-t-using...


How do you differentiate from Meltano?


The main difference is that Meltano is based on Singer while Airbyte is based on the Airbyte Protocol.

We don’t believe Singer is a good building block, because it requires a significant time investment from its users to compensate for the absence of centralized enforcement of the Singer protocol. Since it is not enforced, there is often no guarantee that any pair of Singer connectors are compatible. It defeats the point of a specification. All taps live in their own repo, and all contributions are made to address the contributor’s case, not the general use case. The lack of standard makes it very difficult to maintain all those connectors, and you end up with a majority of Singer taps being out of date.

Airbyte doesn’t have the same data protocol as Singer (but we are compatible). Our goal is to make building and maintaining new connectors a lot easier than it is with Singer, and therefore Meltano. That’s why we were able to ramp up our connectors (46 now) within just 5 months, while Meltano is focused on fixing the issues with Singer. We think it’s much harder to patch over Singer and reverse course on an abandonware project than it is to start from the ground up with these issues in mind. We wouldn’t be surprised if Meltano starts supporting Airbyte connectors in the future.

We detail these differences here: https://docs.airbyte.io/faq/differences-with.../meltano-vs-a...


Does this imply airbyte only supports connectors that airbyte validates and integrates into the platform? Can I use an airbyte connector that lives in a repo on my private github?

This solves the problem of getting high quality connectors built, but how do you plan to maintain them? What if the original contributor falls off the face of the earth?


You can also use an external connector if you want to. This is a very valid use-case, especially if you connect internal APIs or private sources that wouldn't make sense for the community.

If the original contributor falls off the face of the earth, it is OK! That's the beauty of Open-Source. Another person who is using it can jump in. We can also jump in.


It looks very promising and is something I wish to scale Quantale(https://quantale.io/) to, It's a project I am working on, which has very similar logic of adding data sources and make it available for data visualization. Congrats on your launch!


Thanks!


Will it support change data capture (CDC) from SQL databases?


Yes, it is on our roadmap. We will likely have an alpha version in the next two months.

https://github.com/airbytehq/airbyte/issues/957


Fantastic. We've using both managed service and in-house solution (debezium + kafka) for CDC from transactional to analytics. Looking forward to see this on Airbyte.


It's nice to see some competition in this space, especially open source. I'm a little confused though I thought Singer was the open source version of Stitch (which you mention briefly): https://www.singer.io. But maybe Singer doesn't have all the UI features that Stitch does and that's where Airbyte is different? I would love to know more about the differences


At GitLab, we're not ready to give up on the Singer spec, community, and ecosystem yet, which is why I've been working on Meltano for the past year: https://meltano.com/

We think that the biggest things holding back Singer are the lack of documentation and tooling around taking existing taps and targets to production, and around building, debugging, maintaining, and testing new or existing high-quality taps and targets.

Meltano itself addresses the first problem, and provides a robust and reliable platform for building, running & orchestrating Singer- and dbt-based ELT pipelines. It's built for developers who are comfortable with CLIs and YML files, and want their pipelines to be defined in a Git repository so that they get the benefits of DevOps best practices like code review and CI/CD.

At the same time, we have been working with some members of the community on a new framework for building taps and targets: https://gitlab.com/meltano/meltano/-/issues/2401, which we have decided to call the Singer SDK: https://gitlab.com/meltano/singer-sdk. We are moving as many Singer specification-specific details around things like incremental state replication and stream/field selection into the framework, so that individual taps only need to worry about getting the data from the source and can be expected to behave more consistently and correctly across the board.


I am one of Airbyte's founder. The initial version of was actually fully based on Singer and this is when we realize it wouldn't be possible for us to depend on it.

Amongst the main reasons: Singer seems to have been abandoned by StitchData (after they got acquired by Talend), the quality of the connectors is too unpredictable, Singer connectors are not usable outside the box.

We would have preferred to use an existing standard if one already existed. It was a tough decision for us to create something from scratch but now we are very satisfied with the decision. It is way easier for the community and for us to build connectors that meet quality standards and we can make it MIT so the community can have control on the evolution of the protocol.

We actually wrote a few articles about it:

https://docs.airbyte.io/faq/differences-with.../singer-vs-ai...

https://airbyte.io/articles/data-engineering-thoughts/airbyt...


Forgot to mention. We have a compatibility layer with Singer so it possible to run Singer Taps in Airbyte. We have a few sources that are actually some of the high quality Singer's.


Stitch is partially open as stitch - many of the integrations they list on their website are hosted versions of OSS singer, but a couple are not OSS.

I'm not sure how Stitch's acquisition will affect Singer contributions and such going forward.

Also, Singer has no UI, it's all CLI.


Awesome tooling, great to see some open source work in this space.

How do you handle Testing/Data Reconciliation with Airbyte?

How do I know if I have successfully transferred 100/100 records for the day? Is there a pattern that you can recommend here or a batch row count id that is recorded that can be queried against the source for confirmation that the correct amount of rows were added for x batch?


Awesome project. When compared to major vendor solutions such as Azure Data Factory, is your core advantage the breadth of connectors and open-source core? Are you concerned that other such providers offer polish like no-code web applications to broaden product reach beyond programmers?

Also, while I see the E&L, what sort of T solutions do you offer or have planned on your roadmap?


Thank you for your kind words!

The breadth of connectors is a large part of it, as well as the fact that we are by design multi-cloud, so we limit the vendor lock-in.

We are not too worried about other providers using our technology. The goal is to make Airbyte the open-source standard for EL, and every time a company builds on top of Airbyte we become more of a standard. The data market is so big and keeps growing, so there is room for more than one player. Who knows -- maybe one day Azure will use “Powered by Airbyte” to offer more connectors to their customers.

For the T part, it will depend on your destination. If you’re using a warehouse, DBT is an amazing tool, so we will deepen our integration with them. If you’re using other processing technologies, then we will see where the community brings us.


Shouldn't it be ETL (extract, transform, load) [1]?

[1] https://en.wikipedia.org/wiki/Extract%2C_transform%2C_load


no, it should not. The point of modern data warehouses like Snowflake is that you run the transformations in the warehouse, vs. some external transformation layer (think Informatica).

In the old approach, you would run the transform BEFORE loading data into the warehouse. The disadvantage of that approach is that you loose all fidelity of the raw data.

In the new approach (Airbyte's approach), you load the raw data into the warehouse, and then run your transform jobs in the warehouse. You can do that because modern warehouses are cheap and scalable. The benefit of that approach is that you keep your raw data with all its fidelity, opening up endless opportunities for exploratory slicing and dicing.

That's why it's called "ELT" (new) these days, to distinguish from "ELT" (old).


If you are interested, John (one of the co-founders) wrote an article about how we are imagining ELT evolving. https://airbyte.io/articles/data-engineering-thoughts/why-th...


ETL and ELT are similar but separate things. ELT has been popular for a few years now, and the data lake approach as well as cheap storage has solidified it as the current preferred way to do a data warehouse.


Is there a guide to set it up behind a reverse proxy with https? I have it running with Nginx but it won't load, it insist in using /8001/api. How do I change the API endpoint?


You can. You can play the the env variable: API_URL

For example if you have a reverse proxy that serves both the webapp and the api you can just launch with: API_URL=/api/v1/ docker-compose up


I just effortlessly synced Freshdesk to Postgres in 20 minutes. Thank you.


You're welcome! Glad you liked it!


Looks promising. Any plans for support Amplitude API?


Yes, we'll definitely be looking at having an Amplitude source connector, the issue is already here: http://github.com/airbytehq/airbyte/issues/1457


Yes, we'll definitely be looking at having a source-amplitude in the future, the issue is already here:

https://github.com/airbytehq/airbyte/issues/1457


Congratulations on the launch. I’m just getting started in the data pipelines world and was wondering, how does Airbyte compare to Embulk?


Airbyte addresses two pains of data engineers:

- having to build and maintain connectors

- managing data integration pipelines on behalf of less technical profiles

Because Airtybe is UI/API based, we can offer a great experience to less technical profiles so that they can become independent and leverage data in the most efficient way.

Embulk addresses a different usecase.


I work for a big financial institution and we are a actively evaluating tools in this space. Do you include connectors for Salesforce?


Yes we do. You can check our connectors here: https://docs.airbyte.io/integrations


Will Airbyte support MySQL as destination?


Yes! it is on our short term roadmap: https://github.com/airbytehq/airbyte/issues/1483

Also, we are OSS so if you want to contribute, we can guide you through it!


qq: how do I sell ELT to my managers who have their heart set on ETL ?


As mentioned by cgardens in reply to another question, you can read through John's (one of the founders of Airbyte) article on the future of ETL/ELT: https://airbyte.io/articles/data-engineering-thoughts/why-th...

It provides some solid arguments that should help your discussion with your managers.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: