Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Replicate App-Embedded DuckDBs into PostgreSQL with SyncLite OSS (github.com/syncliteio)
73 points by syncliteio 3 months ago | hide | past | favorite | 20 comments



Hello HN,

Thank you for feedbacks on our earlier introductory post about SyncLite, and we have been incorporating it !

Earlier, we posted about a specific case of replicating app-embedded DuckDBs into a centralized PostgreSQL database.

We would like further highlight SyncLite as a generic data consolidation framework to replicate/consolidate data from edge/mobile applications using various popular embedded databases: SQLite, DuckDB, Apache Derby, H2, HyperSQL into various centralized industry leading databases: PostgreSQL, MySQL, MongoDB, DuckDB SQLite, etc.

We would love to get suggestions for improvements, new features/functionalities, new connectors etc.

Brief summary of SyncLite's core infrastructure:

SyncLite Logger: is a single Java Library (JDBC Driver): SyncLite encapsulates popular embedded databases: SQLite, DuckDB, Apache Derby, H2, HyperSQL(HSQLDB), allowing user applications to perform transactional operations on them while capturing and writing them into log files.

Staging Storage: The log files are continuously staged on a configurable staging storage such as S3, MinIO, Kafka, SFTP, etc.

SyncLite Consolidator: A Java application that continuously scans these log files from the configured staging storage, reads incoming command logs, translates them into change-data-capture logs, and applies them onto one or more configured destination databases. It includes many advanced features such as table/column/value filtering and mapping, trigger installation, fine-tunable writes, support for multiple destinations etc.


I’ve read the initial paragraphs of Readme and I have no idea what it does. I think a technical no buzzword summary would be good.


SyncLite at the very core offers two key capabilities :

1. A holistic solution for Database Replication/Consolidation for numerous embedded databases used in several instances of edge/desktop applications, into centralized databases/dw/data lake etc.

2. A holistic data streaming solution offering a Kafka Producer and SQL API designed for final destination data delivery.


Please take this as well-intentioned feedback. #1 and #2 are really difficult to parse. A really carefully considered elevator pitch that distills the essence of your product into concrete use cases would go really far in helping people understand what you're building.

e.g. I think you're trying to say: SyncLite helps <developer?/dba?/who?> replicate databases from a variety of apps into a centralized location <for what purpose?>.

What's not clear is what problem this is solving, or what the ultimate goal is. Same comment for data streaming. I can imagine a few, but would prefer not to imagine.

Assume some of your potential users/customers don't know which specific solution they need, but probably know what their problem is. Your description should help someone with a problem your product solves recognize that your product seeks to address that problem. As someone who has worked with ETL products extensively (as a PM for those products and as a developer using those products), it's hard for me to know if your tool is interesting to me even as someone who's been intimately involved with the space.

One area where some jargon may be helpful is using more standard ETL terminology. You appear to have a kind of ETL tool, but are using terms that have me unsure about whether or not that's the space you sit in or how to mentally model your product.

Based on a quick skim, I suspect a lot of my questions are answered by digging deeper into the docs, and that's good. Just gotta get those details boiled down into a more descriptive summary.


Thank you @haswell for the feedback!

We will improve this further: Here is a brief summary updated in the README at the start:

SyncLite is an open-source, no-code, no-limits relational data consolidation platform empowering developers to rapidly build data intensive applications for edge, desktop and mobile environments. SyncLite excels at performing real-time, transactional data replication and consolidation from a myriad of sources including edge/desktop applications using popular embedded databases (SQLite, DuckDB, Apache Derby, H2, HyperSQL), data streaming applications, IoT message brokers, traditional database systems(ETL) and more into a diverse array of databases, data warehouses, and data lakes, enabling AI and ML use-cases at all three levels: Edge, Fog and Cloud.


That sounds like almost pure buzz-speak. Here's a more approachable try:

SyncLite is a tool for managing and synchronizing data between different systems. It allows you to copy and merge data from various sources — like desktop applications' and IoT devices' embedded databases — with a central database or data storage system in real time. Thanks to change data capture (CDC), SyncLite is particularly useful when you need to keep data up-to-date across multiple locations or devices, or when you want to consolidate data from many sources into a single place for analysis or machine learning — then sync that back to the edge.

On it's own, that's a much better intro, I think. But readers may have more questions, so here's sort of TL;DR takeaways from the rest of the README, aiming at "Why do I care?":

Here’s when SyncLite can be useful:

1. Real-Time Data Sync: If your application requires data from various sources (like local apps or sensors) to be updated continuously and in sync, SyncLite automates this process without needing to write a lot of custom code.

2. Data Consolidation: When you have multiple data streams—whether from embedded databases, IoT devices, or streaming data apps—and you need to bring them together into a central database or storage for analysis, SyncLite handles this consolidation efficiently.

3. Simplifying ETL and Migration: If you're migrating data between different databases or setting up ETL (Extract, Transform, Load) pipelines, SyncLite offers straightforward tools to manage these tasks, reducing the need for complex scripting or manual intervention.

4. IoT and Edge Data Integration: For applications involving IoT devices or edge computing, SyncLite makes it easier to capture and process data from many distributed devices, syncing it to central servers for processing or analysis.

5. Flexible Deployment: SyncLite can be set up in various environments, whether you prefer using Docker, traditional servers, or cloud services. This makes it adaptable to your existing infrastructure.

SyncLite's goal is a simple, scalable way to manage data synchronization and consolidation across different environments and data sources, aiming to reduce the need for custom development and provides tools to manage real-time data flows effectively.

As for the buzzword-laden take posted above, here's an attempt to unpack those against your longer form README that doesn't seem to fully justify all the buzzwords (I left out jargon that could be justified or mostly justified, even though it shouldn't have been jargon):

1. No-code: The documentation does not fully justify this claim. While it describes the platform as "no-code," the setup involves deploying servers, configuring data pipelines, and potentially writing scripts for integration. "No-code" would imply a more user-friendly interface without the need for configuration or scripting, which isn't the case here.

2. No-limits: The documentation does not provide evidence to support the "no-limits" claim. The platform seems robust, but every system has limitations related to scalability, performance, or specific use cases. The documentation doesn’t address any of these potential limitations, so this claim remains unsubstantiated.

3. Empowering developers: The documentation uses this phrase to market the tool but doesn't provide concrete examples or evidence of how it empowers developers in practice. It describes features that could make data management easier, but "empowering" is subjective and not directly substantiated with user testimonials or specific use case examples.

4. Rapidly build: The claim is somewhat justified but lacks specific examples or benchmarks to show how SyncLite speeds up development compared to other tools. The documentation touches on various features that could potentially reduce development time, but doesn’t quantify what or how it's more rapid.

5. Excels at performing real-time, transactional data replication and consolidation: This claim is partially justified. The documentation describes the real-time replication and consolidation capabilities but lacks performance metrics or comparisons to show how it "excels" over other similar tools. It would benefit from more specific examples or case studies demonstrating its effectiveness.

6. Enabling AI and ML use-cases: The claim is not fully justified. While the documentation mentions AI and ML, it does not provide specific examples or tools for these use cases, such as data preparation or model training features. This makes the claim feel more like a marketing angle to cynically claim applicability for AI/ML targeted funding, rather than a substantiated capability.

7. Edge, Fog, and Cloud: The documentation mentions deployment across these environments, but it doesn’t fully explain the differences or advantages in each case. The term "Fog" computing is less commonly understood, and the documentation does not provide enough detail to clarify this concept or its benefits within SyncLite.

By contrast, the revised intro for SyncLite clearly and concisely describes its core function — synchronizing and managing data across between various distributed systems and a core in real-time — while specifying practical use cases and benefits without jargon or overpromising.


Thank you @Terretta for your thoughtful feedback on SyncLite! I appreciate your suggestions for a clearer and more practical introduction. Your version definitely captures the core benefits more effectively, and we will be revising the README and documentation to reflect this more straightforward approach.

We will ensure our messaging accurately represents SyncLite's capabilities. Thanks again for your input!


"holistics", "edge", "final destination data delivery" are all buzz words.


Removing the buzzwords gives me:

1. Database Replication/Consolidation for embedded databases used in edge/desktop applications, into centralized databases/a data warehouse/a data lake

2. A Kafka Producer and SQL API for streaming data back to embedded edge/desktop application databases

Is my understanding correct?


1. Database Replication/Consolidation for embedded databases used in edge/desktop applications, into centralized databases/a data warehouse/a data lake

2. A Kafka Producer and SQL API for streaming data from edge/desktop applications into centralized databases/data warehouses/data lakes.

And then there are tools built on top of this infra: Database ETL tool, IoT data connector tool etc.


SyncLite Open Source (https://www.synclite.io/) ==>

- SyncLite Logger, a wrapper JDBC driver, supporting popular embedded databases (SQLite, DuckDB, Apache Derby, H2, HyperSQL), creates transactional sql logs for embedded dbs, ships them to configured staging storage ( S3/MinIO/SFTP/Kafka/OneDrive/GoogleDriver etc)

- Several application instances each creating multiple embedded databases/devices are synchronized in real-time on the staging storage.

- SyncLite consolidator, a centralized tool, continuously processes device logs, generates CDC logs, replicates and consolidates staged devices into a wide range of industry leading database, data warehouses and data lakes.

Look forward to feedback and suggestions.


SyncLite looks great for companies that are starting to build from scratch. Ingest data in an embedded database and via CDC move the data into a centralized database. I'm also trying out a similar idea but only it's a bit reverse approach, from Cloud Warehouse -> Embedded DuckDB to reduce the compute cost for BI and embedded analytics use-cases. The combination of cloud and embedded databases is the future IMO.

For the project I'm working on, the tech such as Apache Iceberg and embedded DuckDB enables querying Snowflake + BigQuery tables directly from your BI tool, without any compute cost: https://github.com/buremba/universql


Thanks @buremba

Absolutely agree on "combination of cloud and embedded databases is the future IMO"

Universql looks interesting as well.

SyncLite also provides an ability to send back custom commands from SyncLite consolidator to individual applications(devices) while edge/desktop applications can implement callbacks to be invoked on receiving these commands. A command can be anything and could be a away to tell the application to download data from a cloud hosted data warehouse and use it as a starting point.


Is this Java/JVM-specific? How are changes captured on the client?

From reading through the docs it seems like the CDC work is done in the JDBC driver (SyncLite Logger), but I had to read quite far to see that, and it's difficult to put together how it works and in what types of applications this can be used. For example, can I capture changes to a SQLite or DuckDB database in an Android or iOS application?

I'm interested to see the approaches used, since there is very little out there for capturing changes in embedded databases.


SyncLite Logger is JDBC wrapper/driver, can be consumed in Java and Python applications:

Few code samples are shown here: https://github.com/syncliteio/SyncLite and https://github.com/syncliteio/SyncLite/tree/main/synclite-co...

A jsp-servlet sample web app is here: https://github.com/syncliteio/synclite-sample-web-app

SyncLite logger transactionally captures the exact sql statements (DDLs and DMls) as executed on the underlying embedded databases, into sqllog files.

The log files for each device (i.e. database) are shipped to a configured staging storage(Local FS/SFTP/S3/MinIO/Kafka etc).

SyncLite consolidator continuously looks for new devices and new logs for each devices.

The received raw sql logs are not directly applicable on final destination as they are logical statements, may involve complex dependencies/joins e.g. insert into select <cols> from t1 JOIN t2 ...

Consolidator applies the incoming sql logs transactionally to a native embedded database (SQLite) and in the process generates record level CDC logs by identifying exact changed records and tables

The CDC logs are record-level INSERT, UPDATE, DELETE statements, DDL statements : CREATE TABLE, ALTER TABLE, DROP TABLE, RENAME TABLE.

The CDC log files as generated are applied to final destination databases.

A summary of this is here : https://www.synclite.io/synclite/sync-ready-apps

Thank you for the feedback and we are improving the documentation.


Thanks, that gives a good overview


The whole concept here feels backwards to me. Why are we writing into an OLAP database and replicating to an OLTP one?


The framework is generic to handle data consolidation from - numerous applications which may be using one or more popular embedded databases: SQLite, DuckDB, Apache Derby, H2, HyperSQL and - into a wide range of industry leading databases including PostgreSQL, MySQL, MongoDB, DuckDB and more.

A potential use-case for consolidating data from many duckdbs into a destination PostgreSQL + PGVector would be to empower developers build Edge first Gen AI,Rag Search applications using DuckDB's vector storage and search capabilities, while enabling real-time data data + embeddings consolidation from all application instances into a centralized PG + PGVector to readily enable global RAG applications.

More details here:

https://www.synclite.io/solutions/gen-ai-search-rag

https://medium.com/@mahendra.chavan/synclite-bridging-the-ga...


Why does the title mention specifically duckdb and postgres? It sounds like it supports multiple embedded and remote databases, and I don't see anything special in the README specifically about this pairing...


While the title mentions about one combination, the framework is generic to handle data consolidation from - numerous applications which may be using one or more popular embedded databases: SQLite, DuckDB, Apache Derby, H2, HyperSQL and - into a wide range of industry leading databases including PostgreSQL, MySQL, MongoDB, DuckDB and more.

A potential use-case for consolidating data from many duckdbs into a destination PostgreSQL + PGVector would be to build Edge first Gen AI,Rag Search applications using DuckDB's vector storage and search capabilities, while enabling real-time data data + embeddings consolidation from all application instances into a centralized PG + PGVector to readily enable global RAG applications.

More details here:

https://www.synclite.io/solutions/gen-ai-search-rag




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: