

GraphLint: Creating a Domain Specific Language for Graph Validation - wfleiss
http://www.knewton.com/tech/blog/2012/02/graphlint-creating-a-domain-specific-language-for-graph-validation/

======
andrewcooke
fwiw, when faced with a similar problem i've used graph databases (these days
i guess neo4j, but at the time i used datalog). you either find errors loading
the database, or via queries that parallel the rule-like approach used here.
it's not as cool as this, but you end up with something quite similar to what
is described (a way to find errors "declaratively") without needing to
implement a new language.

also, i wonder if life would have been easier if the implementation had used
prolog? not only is it famously nice for supporting dsls (don't ask me how -
i've never done it, but you can imagine that parsing is pretty easy, since you
get backtracking "for free"), but it might provide much of the scaffolding
needed to evaluate queries.

~~~
jordanlewis
I'm a software engineer at Knewton who worked with Kenny on this project.

We're not using Neo4j because of other unrelated constraints that have led us
to write our own graph data store. Kenny's work implemented the "rule
paralleling queries" approach you mention.

I'm not sure about Prolog, but Java's Antlr library was really easy to use.
Writing the parser+lexer infrastructure only took a few days. It wasn't quite
as clean as using Antlr in a language like SML, but for implementing a parser
in Java it was pretty great.

~~~
andrewcooke
interesting, thanks (sorry for misunderstanding slightly). possibly related /
useful in future - did you notice that neo4j's licence terms changed a while
back? i can't remember the details, but i had avoided it in the past because
of licensing and remember that the change seemed to be an improvement.

~~~
espeed
The Neo4j community edition is now free and licensed under GPL
(<http://neo4j.org/download/>), and they are working on an Enterprise license
that is free for startups ([https://groups.google.com/d/msg/gremlin-
users/kw9N6xYb6l4/uH...](https://groups.google.com/d/msg/gremlin-
users/kw9N6xYb6l4/uHW23ukWsukJ)).

There is also a Neo4j Heroku Add On (<http://addons.heroku.com/neo4j>) that is
currently free for developers.

------
alexholehouse
This is really cool - prolog and datalog are great for a wide variety of
things, but I'd imagine the performance advantages writing a DSL rather than
having to go through the whole *log interpreter is a major plus.

Out of interest, what's the performance like at scale (50 000+ nodes)? (In
hind-site, I guess I don't actually have anything to compare it to, but a 50
000 node graph seems pretty big!)

~~~
jordanlewis
Kenny's initial implementation of the validation on a sample 50,000 node data
graph and a specific ontology took four seconds.

The goal of Kenny's project was to get a working DSL and prototype of the
validation algorithm. We don't really have much performance data to compare it
to either at this point. As we gather that data by using the system, we'll
make it faster as necessary. (Nathan Marz's "suffering-oriented programming"
post is a perfect description of this development process
[http://nathanmarz.com/blog/suffering-oriented-
programming.ht...](http://nathanmarz.com/blog/suffering-oriented-
programming.html))

------
mbell
What advantages does this hold over doing the same thing with Gremlin? Other
than being slightly prettier syntactically. [0]

[0] <https://github.com/tinkerpop/gremlin/wiki>

~~~
jordanlewis
Kenny's work is a language for describing the schema of a graph (e.g. "this
graph is a bipartite graph"). We want our non-programmers to be able to write
graph schemas, hence the DSL. I imagine one could use the Gremlin language to
specify a graph schema, but I think it would be a bit obtuse as Gremlin is a
query language.

We're actually using the Blueprints model under the hood to implement the
validation algorithm.

