Hacker News new | past | comments | ask | show | jobs | submit login
PathQuery, Google's Graph Query Language (arxiv.org)
172 points by mpweiher 29 days ago | hide | past | favorite | 47 comments



I was invited to work at Google in 2013 on an internal project using their Knowledge Graph. I had just written two books on RDF/SPARQL/linked-data and at first I was taken aback by the query language used at the time. However, when I realized how well the KG scaled and how fast the queries were, I became a fan.

BTW, I think this paper is very well written. Graph queries is not an easy topic and their examples and text make the data types, syntax, and how to perform matches and aggregations easy to follow.


More than a decade ago I worked for a few years on a product that was based on OWL/RDF/SPARQL and since then I still see a lot of problems I'm facing in terms of subject, predicate, object triples.

The semantic web feels like a fad that has clearly passed but there are plenty of good ideas from that field that are under-utilized. For example, a lot of what gets encoded as JSON can be described with these triples and once the data is in that form, there are interesting questions you can ask of it.


I think it's far from a fad, the adoption of the semantic web has been enormous in the past 5 years. Almost every major tech company (and many financial institutions) now have their own Knowledge Graph team.


I'm interested in industry history, so I wonder if you can comment on how much of this, if any, came into Google from Metaweb. There are no Metaweb-connected people among the authors (that I can tell) but Warren Harris is acknowledged. Was it a language that was in development before the acquisition, or was it needed mainly after?


Warren was certainly more familiar with the guts of MQL than anybody else, but PathQuery was designed for Google. PathQuery was actually the second system he built for extracting protobufs from graphs. He was also a longtime functional languages aficionado and the prototype implementation of PathQuery was written in Haskell. It took many smart-person-hours to get Warren's interpreter to work at Google scale and speed.

I think what differentiates PathQuery from many "query languages" is that it takes on the task of transforming that data into somebody else's schema (aka "turning protos into other protos"). Having a DSL for this is particularly attractive in a company that will otherwise expect you to do string formatting in C++. But even under normal circumstances there are wins from having data transformation vertically integrated with your query language.


Mainly after. Metaweb brought MQL with them, which ran on their own single server graphd, which was eventually shutdown.

Warren started working on PathQuery as a replacement for MQL. PathQuery was originally executed on Pregel and was far from a realtime query language.

A realtime query engine was later developed for PathQuery in the search stack so interesting search queries could be answered a la minute from the Knowledge Graph, rather than just pre-generating results for a limited class of queries.

This was all done in the 2012-2013 timeframe. Most of the people involved are long gone.


I've forgotten the details: by real-time query engine are you referring to DGraph, Livegraph or something else?


Definitely not graphd. Maybe Livegraph? I forget what the project was called, but it was the first realtime query engine built for the Knowledge Graph and it used PathQuery as the query language.


PathQuery was started by Warren Harris. At Metaweb, Warren did a rewrite of MQL in OCaml which we never shipped because of the Google acquisition. He spent much of the first year at Google working on a prototype of PathQuery in Haskell. The prototype was rewritten in C++ and used in the first KG-serving infrastructure sometime around 2012/2013.


As someone who worked on the Metaweb/KG team during that period, it's safe to say the development of PathQuery was a direct result of the acquisition. The lack of Metaweb-linked authors relates to the fact that PathQuery sits closer to 'serving knowledge' than 'representing knowledge'. But lessons learned from the development of MQL surely made their way into PathQuery, through Warren and others.


What are your thoughts on just using plain datalog/prolog?


I have only used datalog in the form of Clojure and Datomic on a consulting job about ten years ago.

I used to be a huge Prolog fan, but I don't think that anyone has paid me to do Prolog develop in over 20 years. My largest Prolog project was porting a prototype AI planning system that I wrote in Common Lisp to ExperProlog. It took about 6 weeks to prototype the system in Common Lisp, and re-writing it in Prolog only took about two weeks.

I noticed last month that someone was offering an online Prolog course using Swi-Prolog. That might be a good place to start.


Having written a chunk of my phd in prolog and even a simplified implementation of one, I'm definitely a fan. At the same time, our startup (graphistry) works w/ teams needing to look at all sorts of graph data, including knowledge graphs like the topic here, and I'd be nervous about meeting the performance & scale needs with out-of-the-box prolog systems. Likewise, a lot of prolog's power comes from ideas like automatically backtracking search, and I can't imagine using that in most team environments.

Datalog solves most of that :) My read of the paper is their semantics largely reduces to datalog. The most interesting semantic exception is their bounded recursion, which suggests they're doing optimizations normal datalog wouldn't. Likewise, graph workloads often have weird phenomena to optimize for in terms of expected queries & data, where I'm guessing datalog might fit semantically but not how people would normally implement an engine. The PathQuery paper carefully side-stepped all such discussion, so I've been curious.

At Graphistry, we do end-to-end GPU stuff, so I've been thinking about this particular problem, and came to a similar conclusion of a datalog-ish subset being the most straightforward way to get more performance/$ for this kind of task.


You might like Google's Logica

https://opensource.googleblog.com/2021/04/logica-organizing-...

(Datalog/Prolog family language compiled to SQL)


I've been using this fairly heavily recently (internal to Google), to the point where I'm thinking of investing in writing an org-babel mode for it. It's a really nice way to structure queries!


Oh wow that is neat!

And yes, this kind of thing is why datalog is a lot more amenable to fast query plans & runtimes than prolog. This part is especially cool: https://github.com/EvgSkv/logica/blob/main/compiler/dialects...


PathQuery looks a lot like GROQ [1], which is the query language used by the Sanity data store [2]. For example, one of the queries in the paper:

  @entities
    .[/type == (Id(/ museum ) , Id(/ theme_park ))]
    .{
      id: ?cur
      require name: /name.[TextLang() == en ]
      @merge : events::GetInfo()
    }
can be written something like:

  *[type == "museum" || type == "theme_park"] {
    id,
    name: select(name[lang == "en"] => name),
    ...{
      // GROQ doesn't have functions, so GetInfo()
      // would need to be inlined here
    }
  }
PathQuery is of course a lot more complex, but the basic structure seems very similar. One thing GROQ does not have yet is recursive querying of the type needed to traverse graphs, but this is on our roadmap to implement.

We've published a public draft specification for GROQ, and hoping it will be adopted by more tools. For example, we already have an open-source JavaScript implementation of it.

(Disclosure: I work on GROQ at Sanity.)

[1] https://github.com/sanity-io/GROQ

[2] https://www.sanity.io/


It seems PathQuery's entire novelty is better handling of recursive querying for query optimization, but "details are admittedly not discussed herein".

So the paper has a sort of weird feel. It goes like this: PathQuery is a good language, because it has good semantics and it is optimizable. Some words on its good semantics. We don't discuss how it is optimizable, but trust us, it is.


From what I can tell, PathQuery is also clever in how queries express the shape of the returned data, which we in GROQ call projection. For example, in GROQ you can do:

  *[type == "user"] {
    id, name,
    "slug": id + "-" + lower(name),
    "photos": photos[] {
      url, width, height
    } | order(position),
    "newestComment": *[_type == "comment" && author == ^.id] | order(createdAt desc)[0] {
      id, body
    }
  }
PathQuery seems to give you similar tools in transforming data as part of your pipeline, which I think is how query languages should be like.


Curious that there is no mention of Datalog [^1] for graph traversal in the paper.

To support multi-locale names in Datomic or Datascript-flavoured Datalog attractions_v1.pq would be written as (where :entity/locale is of type :db.type/ref):

    (d/q [:find ?id ?name ?lang
         :in $ ?type
         :where
         [?e :entity/type ?type]
         [?e :entity/locale ?loc]
         [?loc :locale/name ?name]
         [?loc :locale/lang ?lang]
      db #{"museum" "theme_park"})
    => (["/z/38dwfnb8" "Museum of Modern Art" "en"]
        ["/z/38dwfnb8" "Museo de Arte Moderno" "es"]
        ...)
You could store the locale inside the name string, but then you can't efficiently filter on it. If you want the whole shebang, can do `(pull ?id [*])`. It's not obvious in their example if the locale filtering is post-query or part of the unification.

With only 4-tuple EAVT-style indices, you need an intermediate join to support compound values like ["Hello" "en"] if you want to query against the components, so I have been toying with making my own graph DB with 5-tuple (or more) extensible indices, e.g. EAXVT, where X would be the locale, but then index order matters.

[^1]: http://www.learndatalogtoday.org/


Clojure's Datalog ("Clojurelog") basically looks a whole lot like SPARQL written using Clojure data structure literals and they do reference SPARQL in the paper.


Can someone explain why they did not mention SPARQL CONSTRUCT statements? I struggle with the notion that SPARQL isn’t “graphy” per their assertion.

I may be missing something though.

https://www.w3.org/TR/sparql11-query/#construct


I feel like graph languages can be so esoteric. It would be nice to have something closer to RDF, but with filter syntax, and the ability to group filters.

I've never built a language, so maybe I'm just crazy, but what's wrong with, say,

https://gist.github.com/insanitybit/cd997e8d367889708edc3ec2...

Basically I defined the subject in a sort of 'namespace' of constraints ("parent", "children", etc), and then the edge or predicate constraint on it.

I guess it's easy to just "make up" a language, and implementing is the difficult part, and then it's like "ok well then how do you solve <some unanticipated problem because idk what I'm even talking about>" but also, graphql, as much as it isn't good for a database API, is also quite straightforward and familiar looking and I feel like taking something like rdf as a basis would be reasonable.


I wrote a language that's very close to natural language for writing to and querying a RSLT, a kind of higher order graph.

https://github.com/JeffreyBenjaminBrown/hode/blob/master/doc...


I've tried using SPARQL in the past and found it to be very difficult. There are some aspects here that seem like improvements: - I really like that a query only operates on a single graph. Quads make what should be simple triple queries on a graph much harder than necessary in RDF/SPARQL. - I like the path-oriented declarative design. It seems like a declarative version of the imperative graph traversal libraries. - The inline creation of records based on state in a traversal is also intuitive. - The simple type system is a welcome change from RDF's xsd monstrosity

It still seems to be missing some things. For example, they show a recursive query, but no way to return the actual path taken, only the start and end points (I also can't figure out how to do in SPARQL); I don't see a way to explore predicate relationships; is there a method for reification?


Interesting read. I think there are some ideas in there that can be used for sparql 1.2. especially regarding external function definitions.

On the other hand the sparql in the paper is decent but not excellent. The result is subtly different from the path query one, but I think fine for example use case.


One issue I have with query languages is how poorly they interact with the "host language" . SQL requires non-composable string templating or complex, low-performance ORM's.

The only query languages that do this "right" are datalog (Datomic) and Q/K (KDB+, Shakti).


I think RethinkDB also did a great job with ReQL: https://rethinkdb.com/docs/sql-to-reql/javascript/


I think people focus much on syntax and not enough on semantics.

After working on 2-3 iterations of similar languages (sexpressions, clojure enhanced with pipe operator), my take away is to define an API (not a new language) and support multiple languages on top of it.

I picked python and the API is here:

https://github.com/adsharma/fquery/ https://adsharma.github.io/fquery/

Like SQL, DDL and DML should be separated too.

https://adsharma.github.io/flattools-programs/


Reminds me of GQL [2]. Anybody know the differences and or similarities?

[2] https://www.gqlstandards.org/


GQL is much more clear and readable while PathQuery is confusing and non semantic


Anybody know why a paper like this doesn’t include a date?


In case you're just wondering how recent it is (vs. "Why is the date not in the PDF?"), the page says the submission date was 17 Jun 2021.


Syntactically how does PathQuery compare to Gremlin?


Strange that this paper doesn't even _mention_ GraphQL.


I don't think GraphQL is used to query graph databases directly, is it? I've always understood it to be an API protocol that was built to serve some specific needs Facebook had (e.g. bundling multiple HTTP requests), not a full-on graph query language. I should probably mention that I don't really have extensive experience using it, though.


I think it is used as foundation for DQL[0], the query language of Dgraph.

But in and of itself, GraphQL isn't really built to query graph databases as you said.

[0] https://dgraph.io/docs/query-language/graphql-fundamentals/


GraphQL is basically used as SQL queries over stupid API:s. 99% of the graph DB advocates does not even understand what problem the graph DB solves. Soooo funny to argue with people when you can say that, oh, yes, we did this 20 years ago with recursive CTE:s.


you're right, on every point


Probably because GraphQL (despite it's name) isn't made for querying graphs.


Technically it is. But in a very restricted way.

Any graph, when you traverse it from a specific node, with fixed depth, and you don't explicitly work with node references, but rather node "values", looks like a tree that sprawls from that node.

Anyway, PathQuery is a completely different beast regardless.


If that counts as querying a graph then SQL is also a graph language. If anything SQL is more advanced because it can query a hypergraph.


SQL is neither a graph or hypergraph query language, because relations don't define the graph (or hypergraph). The query does. Deriving a graph and querying a graph are two different things.

A graph query language implies you're querying a graph with specific edges, not ones constructed on the fly from the query.

GraphQL's goal isn't to be "advanced" either. Its goal is to allow access to any data structure (be it backed by graph, RDBMS, file system, NoSQL, etc. etc.) without burdening said structure with capabilities that are not characteristic to it.

Like, if you query a graph, the idea the query may provide any arbitrary join expression would completely drive that graph's performance into the ground.


I can get behind the idea that SQL is a bit too powerful to allow it to talk to other databases easily.

And not automatically joining by foreign key is a mistake that's going to haunt us for a while I bet.

However when we're arguing capability then it's just as capable of querying a graph as graphDB is. More if the version of SQL supports recursive joins.


The thing I can't seem to get across to you, is that you don't query a graph by specifying a join expression. Graph databases aren't built for that. So no, SQL is not "capable of querying a graph". It's capable of deriving a graph out of relational data. Two completely different things. It's a bit like confusing a restaurant's chef with the restaurant's customer. They both have a meal in common, but one is consuming what the other is producing.

And no, arbitrary join expressions are not a feature that "haunts" SQL databases, because unlike graph databases, relational databases ARE built for that, and it's one of the primary reasons SQL databases are very resilient to change in face of constantly changing ad-hoc query requirements. And it's an important feature of relational algebra that is used every day by countless applications.

SQL and GraphQL serve different purposes at different application layers. Both do precisely what they have to do. The fact they're a bit similar is not coincidental, but also they're not mutually replaceable.


Sure it is. That's a major component - you can perform joins (edge traversals) across backend APIs.


GraphQL isn't really a comparable language, and PathQuery predates it by at least 3 years. While the paper is recent, GraphQL quite probably made zero to low impact on its topic.




Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: