Hacker Newsnew | past | comments | ask | show | jobs | submit | edrenova's commentslogin

Just to jump in here -> We support RDS + more and you can self-host, Neosync.

https://github.com/nucleuscloud/neosync

(I'm one of the co-founders)


I tried to figure out how/if this does what I need and your README had no examples. I clicked a couple of level deep, found no obvious demonstrations and left.

I checked the homepage but I do not watch Loom-style demos personally, definitely not 5 minute ones, and so I left.

-

When I click on OP's link, or just search for it on Google, it takes less than a full page for the extension to show me an extremely straightforward demonstration of its value. You should have something like that.

A simple example of what queries will look like, what setup will look like, all concisely communicated, no 5 minute lectures involved.


Thanks for the shout-out! Co-founder of Neosync here - love seeing more tools in this space and pushing the envelope further. Good luck!


Nice write up, mock data generation with LLMs is pretty tough. We spent time trying to do it across multiple tables and it always had issues. Whether you look at classical ML models like GANs or even LLMs, they struggle with producing a lot of data and respecting FKs, Constraints and other relationships.

Maybe some day, it gets better but for now, we've found that using a more traditional algorithmic approach is more consistent.

Transparency: founder of Neosync - open source data anonymization - github.com/nucleuscloud/neosync


I’ve spent some time in enterprise TFO/demo engineering, and this kind of generative tool would’ve been a game changer. When it comes to synthetic data, the challenge lies at the sweet spot of being both "super tough" and in high business need. When you're working with customer data, it’s pretty risky—just anonymizing PII doesn’t cut it. You’ve got to create data that’s far enough removed from the original to really stay in the clear. But even if you can do it once, AI tools often need thousands of data rows to make the demo worthwhile. Without that volume, the visualizations fall flat, and the demo doesn’t have any impact.

I found challenge with LLMs isn’t generating a "real enough" data point—that’s doable. It’s about, "How do I load this in?", then, "How do I generate hundreds of these?" And even beyond that, "How do I make these pseudo-random in a way that tells a coherent story with the graphs?" It always feels like you’re right on the edge, but getting it to work reliably in the way you need is harder than it looks.


Yup agreed. We built an orchestration engine into Neosync for that reason. Can handles all of the reading/writing from DBs for you. Also can generate data from scratch (using LLMs or not).


GANs are barely ten years old and already they have reached the classical ML algorithm status.


Thanks for the question! Faker is useful but doesn't have a lot of features. For example, referential integrity, data orchestration or the ability to read/write to a db. So faker can work for simple API schemas but if you need something more robust for an entire database, then that's where we can help.


Thanks! Yeah we generally recommend not making your databases public and instead connecting to them using a bastion host. We support this at Neosync. Also, ideally, not connecting to a live DB and instead a snapshot or back up. A read replica could work as well but a snapshot is better.


cool to see this launch, actually came across this a few weeks ago and tried it out, really nice for local dev :)


Thanks for trying it out edrenova!


The ideal experience is that you anonymize prod and sync it locally. Whether it's for testing or debugging, it's the only way to get representative data.

When you write mock data, you almost always write "happy path" data that usually just works. But prod data is messy and chaotic which is really hard to replicate manually.

This is actually exactly what we do at Neosync (https://github.com/nucleuscloud/neosync). We help you anonymize your prod data and then sync it across environments. You can also generate synthetic data as well. We take care of all of the orchestration. And Neosync is open source.

(for transparency: I'm one of the co-founders)


Excited to announce a new partnership between Neon (open source serverless postgres) and Neosync (open source data anonymization) to give developers the easiest way to create data branches with anonymized production data for better testing, debugging and developer experience.


hey! so sorry about this - it's fixed now!

also - happy to chat further if you have any questions - evis@neosync.dev


Thanks for addressing it and your availability. Keen to look into how Neosync might fit my team's dev needs very soon.


Nice! appreciate you sharing it - would love to see the code at some point but looks like it's confidential.

I spent a lot of time building tokenization solutions at a previous startup so we'll definitely support tokenization at some point. There is a good use-case for it as well!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: