
Show HN: ZomboDB – Postgres extension for indexes backed by Elasticsearch - nstart
https://github.com/zombodb/zombodb
======
zombodb
ZomboDB developer here. I don't frequent HN so please forgive any missteps...

Just wanted to say that while the code has proven itself production worthy,
I'm still re-writing documentation and the test suite and that stuff is going
to take awhile.

What ZDB does, how it performs, and the ways it can be (creatively) used
aren't reflected in the current docs. Neither are the caveats, edge cases, and
known-bugs.

If this thread shows any interest I'll do my best to field questions and
generally discuss how it's actually fun to work in a world with MVCC-correct
Elasticsearch indexes.

~~~
rattray
I love that you publicly acknowledge what is and isn't done. The documentation
topics you line out in this comment are super-important; it was great to see
that you already include a list of "non-features".

Would also be great to see the "production credentials" \- where has it been
used, under what load/conditions, how much data, what kinds of queries, etc.

~~~
zombodb
I'm just some guy on the Internet that is fortunate enough to be able bring a
thing he's spent the past two years working on behind closed doors to the
public. I haven't participated in open-source since 1999 (no, really) so I'm
out of the loop with how things work in 2015.

But if I'm going to continue to shepherd this beast forward, there's no point
in hiding its flaws. Besides, how will I know to fix them if I don't document
them? Along those lines, I'm sure it's clear the docs are still a WIP, and the
more I've written the more I realize needs to be written. I suspect the actual
code won't change one bit (ha!) over the next few weeks.

Regarding "production credentials", the README mentions the company where this
started, and I've alluded to some large (to me) round numbers. Those will have
to do for now. :)

This sort of scenario (a PG index based on ES) isn't what I'd want to use for
the scale of something like Netflix (for example) where you've got billions of
rows and tens-of-thousands of queries a second. But at the same time, that
level of scale only happens at the top. And it's lonely up there. There's a
lot more room down here on the ground.

\----

While I'm sitting here at 4:30am waiting for HN to stop telling me I'm posting
too quickly (wth, I'd like to get some sleep eventually!), here's another
thought...

Data is hard. I've spent my entire professional career dealing with data.
Trying to bridge the gap between two distinct databases has proven really
challenging (and fun and rewarding) but there's quite a bit of work left to
do, and there's thousands of programmers out there that are waay smarter than
I (starting with the entire crew of postgresql-hackers), so I feel like if I
can at least list the things I know I don't know, someone else may come along
and have an answer.

------
idoco
Liked the homage to zombo.com :)

Had to reference this xkcd.com/855

------
strictfp
Quick question, what does not being crash safe mean in practice? Will I have
to recreate all indexes in pg and this will trigger a restart of Elastic and
recreation of all indexes there?

~~~
zombodb
That's a good question and a point of improvement for the documentation.

What I was trying to get across by saying that it's not "crash safe" is that
the Elasticsearch index is _not_ WAL logged.

As such, if _Postgres_ crashes (bugs, hardware failure, kernel oops, etc) and
goes into recovery mode and has to recover, from WAL, transactions that
touched blocks belonging to tables with ZomboDB indexes, the ZomboDB indexes
are now inconsistent. :(

In this regard, it's akin to Postgres' "hash" index type.

That said, this may be a "simple matter of programming" to resolve. I have
some ideas, but it hasn't been a priority yet.

The recovery path is to REINDEX TABLE foo; Of course, the total time it takes
to reindex is highly dependent on how much data you have, but indexing is
really fast. My MBP can sustain 30k records/sec indexing documents with an
average size of 7.5k against a single-node ES cluster. We've seen >100k
recs/sec on high-performance hardware with _very_ large ES clusters.

It's also worth noting that if the ES cluster disappears, any transaction that
tries to touch the index will abort.

------
comrade1
Can you do anything with zombo? Anything at all? Is the only limit yourself?

~~~
zombodb
You can at least execute unstructured queries against your structured data,
and stay transactionally safe while doing so.

Maybe it's not anything, but it's at least something. ;)

------
x5n1
i am wondering whether this is a good idea or not. i would like to build a
search app on top of this. but worry about support going forward and think
just using elastic search would be better in the long term. anyone have any
thoughts?

~~~
zombodb
Despite being the developer, I understand where you're coming from. It's
always hard to choose an open-source project and decide that you're going to
base an important part of your "app" on it.

ZomboDB hit github on Friday and I tweeted about it on Sunday. There's no
doubt it's new. There's no doubt it's different. There's no doubt it ain't for
everybody.

As the developer, this thing is my full time job and livelihood now, so I've
got a lot of motivation to continue development and definitely a lot of work
to do to turn it into something lucrative.

If you evaluate it on its own merits, you might find it's worth at least
spending a few minutes playing with. It really is nice to do something like...

SELECT * FROM table WHERE zdb(table) ==> '(beer w/3 wine) wo/10 cheese and
title:(food, dinner, restaurant) and operating_hours:"11:00am"';

... against 100M rows of mixed structured and full-text data.

Plus, and here's the kicker: it's transaction safe. If an INSERT/UPDATE/DELETE
statement aborts, the ES index isn't DOA. VACUUM will come along later and
cleanup the dead row and in the meantime PG will maintain MVCC correctness for
all active/future sessions. The pain of trying to keep an external index
synchronized literally goes away.

On the other hand, maybe you don't care about transactions, have a read-only
dataset, and don't have a "source of truth" system for your data. If any of
that is true, then you don't need ZomboDB.

~~~
nstart
As you've seen it being used in production environments, I'm curious. What
kind of applications have you seen being built with this?

~~~
zombodb
It was built as _the_ search platform for a web-based document search and
review system, providing real-time search, analytics, and high-volume updates.

~~~
zombodb
Let me try to be a little less vague here (I guess to much time has passed to
edit?).

In general, we developed this for searching structured (but not necessarily
relational) full-text content with an enormous amount of associated metadata.

ZomboDB has come out of the legal e-Discovery world and is the backing search
platform for one of the major e-Disco service providers in the industry.

It's hard to describe a typical dataset, but anywhere from 600k rows to 100M
rows. Some datasets are just a few gig (on disk in Postgres) and others
approach 1TB.

A typical usage pattern for a document review system is that humans (and
sometimes automated computer processes) literally read each document, make
judgement calls about each one, apply metadata, and move to the next document.
Rinse-wash-repeat. On a large-scale litigation review, it's not uncommon to
have hundreds of users doing this simultaneously.

As such, over time, every document gets UPDATEd at least once (typically 4-5
times due to administrative tasks).

You might can imagine that 100M documents with maybe 400 reviewers is a bit of
an organizational problem in terms of teams of reviewers, what they should be
reviewing, etc, so it's important that the data never lie. If the system says
there's 1,863,462 "Potentially Privileged" documents, then that better the
actual answer.

Because a system like this has to provide real-time progress reporting,
analytics on judgement calls, along with the ability to generally search and
organize a lot of data, we needed something that first and foremost provided
transaction isolation. Enter Postgres. We also needed sophisticated full-text
searching. Enter Elasticsearch.

From there it was trying to answer the question "how do we make full-text
searching honor our transaction isolation rules?" That's how ZomboDB came to
be.

I would think any kind of data management system that wants a canonical
"source of truth" system (ie, Postgres) easily text-searchable, ZomboDB would
be useful. Document Review, Medical Records Management, Inventory/Catalog
Systems, etc. As I said in another post, the fact that ES is abstracted away
and that need to asynchronously synchronize data into it goes away, it's
fairly compelling. My hope is that some of its bigger caveats (such as crash
recovery) get solved sooner rather than later.

