Show HN: ZomboDB – Postgres extension for indexes backed by Elasticsearch

zombodb · on July 20, 2015

ZomboDB developer here. I don't frequent HN so please forgive any missteps...

Just wanted to say that while the code has proven itself production worthy, I'm still re-writing documentation and the test suite and that stuff is going to take awhile.

What ZDB does, how it performs, and the ways it can be (creatively) used aren't reflected in the current docs. Neither are the caveats, edge cases, and known-bugs.

If this thread shows any interest I'll do my best to field questions and generally discuss how it's actually fun to work in a world with MVCC-correct Elasticsearch indexes.

rattray · on July 20, 2015

I love that you publicly acknowledge what is and isn't done. The documentation topics you line out in this comment are super-important; it was great to see that you already include a list of "non-features".

Would also be great to see the "production credentials" - where has it been used, under what load/conditions, how much data, what kinds of queries, etc.

zombodb · on July 20, 2015

I'm just some guy on the Internet that is fortunate enough to be able bring a thing he's spent the past two years working on behind closed doors to the public. I haven't participated in open-source since 1999 (no, really) so I'm out of the loop with how things work in 2015.

But if I'm going to continue to shepherd this beast forward, there's no point in hiding its flaws. Besides, how will I know to fix them if I don't document them? Along those lines, I'm sure it's clear the docs are still a WIP, and the more I've written the more I realize needs to be written. I suspect the actual code won't change one bit (ha!) over the next few weeks.

Regarding "production credentials", the README mentions the company where this started, and I've alluded to some large (to me) round numbers. Those will have to do for now. :)

This sort of scenario (a PG index based on ES) isn't what I'd want to use for the scale of something like Netflix (for example) where you've got billions of rows and tens-of-thousands of queries a second. But at the same time, that level of scale only happens at the top. And it's lonely up there. There's a lot more room down here on the ground.

----

While I'm sitting here at 4:30am waiting for HN to stop telling me I'm posting too quickly (wth, I'd like to get some sleep eventually!), here's another thought...

Data is hard. I've spent my entire professional career dealing with data. Trying to bridge the gap between two distinct databases has proven really challenging (and fun and rewarding) but there's quite a bit of work left to do, and there's thousands of programmers out there that are waay smarter than I (starting with the entire crew of postgresql-hackers), so I feel like if I can at least list the things I know I don't know, someone else may come along and have an answer.

rattray · on July 20, 2015

You mentioned elsewhere this is going to be your fulltime occupation. May I ask what the business model is, and the progress to-date?

dordoka · on July 20, 2015

Thanks a lot for open sourcing this!

zombodb · on July 20, 2015

You're welcome. It's hard not to when you've spent so much of your career using quality products like Postgres.

idoco · on July 20, 2015

Liked the homage to zombo.com :)

Had to reference this xkcd.com/855

strictfp · on July 20, 2015

Quick question, what does not being crash safe mean in practice? Will I have to recreate all indexes in pg and this will trigger a restart of Elastic and recreation of all indexes there?

zombodb · on July 20, 2015

That's a good question and a point of improvement for the documentation.

What I was trying to get across by saying that it's not "crash safe" is that the Elasticsearch index is not WAL logged.

As such, if Postgres crashes (bugs, hardware failure, kernel oops, etc) and goes into recovery mode and has to recover, from WAL, transactions that touched blocks belonging to tables with ZomboDB indexes, the ZomboDB indexes are now inconsistent. :(

In this regard, it's akin to Postgres' "hash" index type.

That said, this may be a "simple matter of programming" to resolve. I have some ideas, but it hasn't been a priority yet.

The recovery path is to REINDEX TABLE foo; Of course, the total time it takes to reindex is highly dependent on how much data you have, but indexing is really fast. My MBP can sustain 30k records/sec indexing documents with an average size of 7.5k against a single-node ES cluster. We've seen >100k recs/sec on high-performance hardware with very large ES clusters.

It's also worth noting that if the ES cluster disappears, any transaction that tries to touch the index will abort.

comrade1 · on July 20, 2015

Can you do anything with zombo? Anything at all? Is the only limit yourself?

dano · on July 20, 2015

Does it welcome you? I fear many in this audience may not get a 1990's Internet meme http://zombo.com/

zombodb · on July 20, 2015

You can at least execute unstructured queries against your structured data, and stay transactionally safe while doing so.

Maybe it's not anything, but it's at least something. ;)

x5n1 · on July 20, 2015

i am wondering whether this is a good idea or not. i would like to build a search app on top of this. but worry about support going forward and think just using elastic search would be better in the long term. anyone have any thoughts?

zombodb · on July 20, 2015

Despite being the developer, I understand where you're coming from. It's always hard to choose an open-source project and decide that you're going to base an important part of your "app" on it.

ZomboDB hit github on Friday and I tweeted about it on Sunday. There's no doubt it's new. There's no doubt it's different. There's no doubt it ain't for everybody.

As the developer, this thing is my full time job and livelihood now, so I've got a lot of motivation to continue development and definitely a lot of work to do to turn it into something lucrative.

If you evaluate it on its own merits, you might find it's worth at least spending a few minutes playing with. It really is nice to do something like...

SELECT * FROM table WHERE zdb(table) ==> '(beer w/3 wine) wo/10 cheese and title:(food, dinner, restaurant) and operating_hours:"11:00am"';

... against 100M rows of mixed structured and full-text data.

Plus, and here's the kicker: it's transaction safe. If an INSERT/UPDATE/DELETE statement aborts, the ES index isn't DOA. VACUUM will come along later and cleanup the dead row and in the meantime PG will maintain MVCC correctness for all active/future sessions. The pain of trying to keep an external index synchronized literally goes away.

On the other hand, maybe you don't care about transactions, have a read-only dataset, and don't have a "source of truth" system for your data. If any of that is true, then you don't need ZomboDB.

nstart · on July 20, 2015

As you've seen it being used in production environments, I'm curious. What kind of applications have you seen being built with this?

zombodb · on July 20, 2015

It was built as _the_ search platform for a web-based document search and review system, providing real-time search, analytics, and high-volume updates.

zombodb · on July 20, 2015

Let me try to be a little less vague here (I guess to much time has passed to edit?).

In general, we developed this for searching structured (but not necessarily relational) full-text content with an enormous amount of associated metadata.

ZomboDB has come out of the legal e-Discovery world and is the backing search platform for one of the major e-Disco service providers in the industry.

It's hard to describe a typical dataset, but anywhere from 600k rows to 100M rows. Some datasets are just a few gig (on disk in Postgres) and others approach 1TB.

A typical usage pattern for a document review system is that humans (and sometimes automated computer processes) literally read each document, make judgement calls about each one, apply metadata, and move to the next document. Rinse-wash-repeat. On a large-scale litigation review, it's not uncommon to have hundreds of users doing this simultaneously.

As such, over time, every document gets UPDATEd at least once (typically 4-5 times due to administrative tasks).

You might can imagine that 100M documents with maybe 400 reviewers is a bit of an organizational problem in terms of teams of reviewers, what they should be reviewing, etc, so it's important that the data never lie. If the system says there's 1,863,462 "Potentially Privileged" documents, then that better the actual answer.

Because a system like this has to provide real-time progress reporting, analytics on judgement calls, along with the ability to generally search and organize a lot of data, we needed something that first and foremost provided transaction isolation. Enter Postgres. We also needed sophisticated full-text searching. Enter Elasticsearch.

From there it was trying to answer the question "how do we make full-text searching honor our transaction isolation rules?" That's how ZomboDB came to be.

I would think any kind of data management system that wants a canonical "source of truth" system (ie, Postgres) easily text-searchable, ZomboDB would be useful. Document Review, Medical Records Management, Inventory/Catalog Systems, etc. As I said in another post, the fact that ES is abstracted away and that need to asynchronously synchronize data into it goes away, it's fairly compelling. My hope is that some of its bigger caveats (such as crash recovery) get solved sooner rather than later.