Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: OLake[open source] Fastest database to Iceberg data replication tool
14 points by pkhodiyar 43 days ago | hide | past | favorite | 3 comments
Hi HN,

Today we’re excited to introduce OLake (github.com/datazip-inc/olake, 130+ and growing fast), an open-source tool built to help you replicate Database (MongoDB, for now, mysql and postgres under development) data into Data Lakehouse at faster speed without any hassle of managing Debezium or kafka (at least 10x faster than Airbyte and Fivetran at fraction of the cost, refer docs for benchmarks - https://olake.io/docs/connectors/mongodb/benchmarks).

You might think “we don't need yet another ETL tool”, true but we tried existing tools (proprietary and open sourced as well) none of them were good fit.

We made this mistake in our first product by building a lot of connectors and learnt the hard way to pick a pressing pain point and build a world class solution for it

Who is it for?

We built this for data engineers and engineers teams struggling with:

1. Debezium + Kafka setup and that 16MB per document size limitation of Debezium when working with mongoDB. We are Debezium free.

2. lost cursors during the CDC process, with no way left other than to resync the entire data.

3. sync running for hours and hours and you have no visibility into what's happening under the hood. Limited visibility (the sync logs, completion time, which table is being replicated, etc).

4. complexity of setting with Debezium + Kafka pipeline or other solutions.

5. present ETL tools are very generic and not optimised to sync DB data to a lakehouse and handling all the associated complexities (metadata + schema management)

6. knowing from where to restart the sync. Here, features like resumable syncs + visibility of exactly where the sync paused + stored cursor token you get with OLake.

What is OLake?

OLake is engineered from the ground up to address the above common pain points.

We intend to use the native features of Databases ( e.g extracting data in BSON format for mongodb) and the modern table format of Apache Iceberg (the future going ahead), OLake delivers:

Parallelized initial loads and continuous change-data capture (CDC), so you can replicate 100s of GB in minutes into parquet format and dump it to S3. Read about OLake architecture - https://olake.io/blog/olake-architecture

Adaptive Fault Tolerance: We designed it to handle disruptions like lost cursor, making sure data integrity with minimal latency (configure the sync speed yourself). We store the state with a resume token, so that you know exactly where to resume your sync.

Modular architecture, scalable, with configurable batching (select streams you want to sync) and parallelism among them to avoid OOMs or crashes.

Why OLake?

As your production data grows, so do the challenges of managing it. For small businesses, self-serve tools and third-party SaaS connectors are often enough—but they typically max out at around 1TB of data per month and you are back to square one googling for a perfect tool that's quick and fits your budget.

If you have data which looks something like 1TB/month in a database with probability of it growing rapidly AND are looking to replicate it to Data Lakes for Analytics use cases, we can help. Reach out to us at hello@olake.io

We are not saying we are the perfect solution to your every problem, this open source project is very new and we want to build it with your support.

Join our slack community - https://getolake.slack.com and help us to build and set the industry standard for database ETL tools to lake houses so that there is no need for “yet another” attempt to fix something that isn’t broken.

About Us OLake is a proud open source project from Datazip, founded by data enthusiasts Sandeep Devarapalli, Shubham Baldava, and Rohan Khameshra built from India.

Contribute - olake.io/docs/getting-started

We are calling out for contributors, OLake is an Apache 2.0 license maintained by Datazip.




We had tried multiple tools for our end users to manage ELT workloads for analytics. I was skeptical earlier because of the massive time it would take us to migrate everything. We did it anyway, and it paid off. Totally worth the effort. I wonder what is it that these guys are doing differently, but keep at it guys [Thumbs Up]



We're using Olake and we absolutely love it. Great job team.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: