
Ask HN: Graph DB for full text search? - staticautomatic
I was hoping those of you familiar with graph databases and FTS could offer some advice. I&#x27;m trying to identify a good DB for some kind of unusual requirements. I am working with documents that are themselves collections of documents and properties, best represented as a graph.<p>I need to perform a mix of key&#x2F;value queries and full-text search, often in combination.<p>Hard requirements:<p>1. Suitable as a primary database (i.e. not Elastic)<p>2. First class graph support<p>3. Stable in production<p>4. Strong FTS capabilities (primarily meaning lots of available analyzers and&#x2F;or relatively painless support for custom analyzers, i.e. not Bleve)<p>5. Not fully proprietary<p>6. A sane multi-tenant model<p>Preferably also capable of being packaged for on-prem use.<p>It does not have to be massively scalable, HA, or capable of holding huge amounts of data. I need really solid capabilities but on what most would consider a quite small number of documents, each of small size.<p>I&#x27;ve looked at a lot of the usual suspects but have used hardly any myself, and the trade offs are making my head swim despite my best effort at feature comparison.
======
verdverm
Have you looked into postgres with extensions? I believe there is one for
graph like queries and it has pretty good full text search as well

~~~
staticautomatic
Postgres is great but it's totally unsuitable for the kinds of queries I need
to run. They'd be horribly inefficient even with indices significantly larger
than the data tables themselves.

------
FridgeSeal
Have you had look at DGraph?

Supports the kind of queries you're looking for: full text, kv/etc.
Appropriately indexed cols will automatically have things like stemming, stop
word removal, large variety of language support, etc

Definitely primary db capable. Apart from Neo4J it seems to be the only graph-
database I've found that _actually_ uses graph data structures as opposed to
putting a graph query language over the top of a relational database.

It's open source core (enterprise/paid for is hosting and a couple of extra
features).

Deployment is super straightforward: comes with pre-built containers, just
deploy them anywhere you run containers, or you can also just deploy the
single binary. They also, super-handily provide a whole bunch of tools for
deploying it in other ways (docker compose files, Terraform files, single and
HA mode Kubernetes deployment yamls, etc).

~~~
staticautomatic
Believe it or not, Dgraph has far and away the worst FTS of any graph or
multi-model DB I evaluated. After speaking with Dgraph's engineers about it on
their Slack channel-- for reasons left unsaid and which I'm pretty sure I
would charitably describe as ridiculous-- that is apparently by design.

Like Couchbase, Dgraph has decided to use the Go library Bleve for FTS. Bleve
is good at what it does, but it just doesn't do very much. The number and
kinds of analyzers that Bleve has absolutely pale in comparison to Lucene. So
for starters, it's just not all that great. Never mind that Bleve is pretty
easy to extend. I don't want to reinvent the wheel writing analyzers that are
freely available in Lucene, and I certainly don't want to have to deal with
incorporating them back into my code base and testing them every time there's
a new release.

But it gets worse. Unlike Couchbase, Dgraph doesn't even fully use Bleve.
Rather than tracking Bleve releases and inheriting its analyzers, Dgraph has
made the completely baffling decision to implement only a subset of them. It's
already the case with Bleve, for example, that pretty much the only sentence
tokenizer available is the unicode tokenizer. I don't want to use the unicode
tokenizer anyway, but even if I wanted to do the tokenization myself, there's
no straightforward way for me to get them into Dgraph because Bleve's "single
token" analyzer (which just accepts a stream of individual tokens) is not one
of the two or three analyzers Dgraph elected to incorporate.

As far as I'm concerned, that is some bullllll shit.

~~~
mrjn
(Author of Dgraph) This is the first complaint I’ve heard about the full text
index of Dgraph — are there examples where Dgraph’s FTS isn’t as good compared
to others? If so, could you please file an issue on our GitHub repo, so we can
investigate and bring it at par with what Elastic and others have to offer.

Also, Dgraph allows custom indexers. So, you could build a custom FTS which
would better fit the task at hand.

~~~
staticautomatic
I appreciate you weighing in here, but with all due respect this is homework
you would have already done if you were serious about your FTS being on par
with Lucene, and that I shouldn't have to do for you.

It's totally obvious just from comparing Elastic's documentation with Bleve's
that Elastic has way more tokenizers and filters than Bleve. And it's also
totally obvious from comparing Bleve's code to Dgraph's that Dgraph implements
a subset of Bleve's.

Whoever was responding to me on Slack sounded like they didn't even know what
I was talking about. When I asked whether Dgraph planned to implement all of
Bleve's tokenizers, the response I got was "Dgraph uses Bleve to generate the
full-text tokens for the full-text index." When I pointed out that currently
only some of them have been implemented and reiterated my question, the answer
I got was "Dgraph's product decisions are independent of Bleve's features."

Bleve would be a fine choice for FTS if you were planning on implementing the
whole thing, writing additional analyzers to reach parity with Elastic, and
preferably up-streaming them to Bleve. But if you're asking me to open a
GitHub issue saying "will you please consider implementing the rest of Bleve?"
the answer is no, thanks. I'll just revisit Dgraph some other time.

------
winrid
What have you looked into and dismissed already besides Elastic?

~~~
staticautomatic
Dgraph is out bc of Bleve. Others have varying levels of shortcomings but
aren't necessarily out. Like Arango supports custom analyzers but I'd really
rather have more analyzers available out of the box.

Neo4j seems like a clear winner, but I'm a little skittish because so many
people seem divided on using it in production. OrientDB also seems like it
could be a very good choice.

I think my main problem with evaluating the multi-model alternatives to Neo is
that I don't feel capable of reasoning about the appropriateness of
implementing this in a multi-model database. I have to admit I don't quite get
how their graph layers work beyond "graph something something linked lists".
At best I have a suspicion that it could turn into some kind of composite
indexing hell.

