Graphviz has its own foundation graph library, that's not used by any other project. It has its good and bad points.
Based on that experience, we had our very own second-system-syndrome experience.
We decided our graph library should be modular, type safe, and efficient. (These properties came up in the comments here, too.) This is probably just a variation of "good, fast, cheap - pick any two."
By modular, want to write collections of graph algorithm libraries that are developed and even compiled independently.
By type safe, we mean we want to detect programming errors during compilation, or link time at the latest. We don't want programs to throw runtime errors like "your node does not have a color attribute".
By efficient, we mean that accessing an attribute of a graph is as cheap as a field in a C struct. (So, we don't want to carry around external hash table, or do a lot of string conversions, for instance.)
You can argue about whether these things are worth the price or even make sense, but that's what we wanted. We had some famous C++ creators in our lab, so we figured we could get help, and we were willing to give C++ another chance.
Gordon Woodhull, who had been an intern and kept working for us, is a brilliant programmer, and wrote an implementation of this kind of graph library working in templated C++. It's even published at https://www.dynagraph.org/ via sourceforge. The rest of us were not really sure we could ever understand how the code worked, so we had a code review with said famous C++ inventors. There were a lot of screens of code, and chinstroking, and greybeards pronounced "That would probably work." We knew we might have gone over the complexity cliff. (Let's not even talk about compile-time template errors, where one error fills an entire screen with details that only a... C++ inventor could love.) It was our fault, not anyone else's, and Gordon kept plugging away and even made all the dynamic graph layout stuff work, in Microsoft OLE, too. In hindsight it was probably our own Project Xanadu. While we got lost in this, a lot of things like Gephi (Java) and NetworkX and NetworKit (python) happened.
Also, John Ellson, a very talented software engineer who had written parts of Graphviz, revitalized the main effort.
As someone who did a lot of work with graphs, "why don't programming languages have a built-in graph data type?" is a question I’ve been asked a million times.
I'm thrilled I'll be able to point folks to a much more in-depth analysis like the one here, instead of saying some variation of "it's really hard to do it well" and having them just take my word for it.
> "why don't programming languages have a built-in graph data type?"
What I find a little funny about that question is that people miss the fact that there isn't even a tree data structure in most languages. Most languages have static arrays, dynamic arrays, linked lists, and... that's it as far as structural types go. Everything else (BSTs, hashtables, etc.) is some semantic abstraction that hides some capabilities of the underlying structure, not a purely structural representation.
Typed functional languages like Haskell (data)/ML (type) do have a built-in way to define new tree types, and so does Rust (enum). It's one of the biggest things I miss when I'm not using these languages, especially when combined with some metaprogramming (deriving/derive) to get some functions defined for these new types very quickly.
I think we might be speaking past each other? How is enum in Rust a tree type? You might be able to use it to create a tree type, but that's no different from using struct to make struct Tree : std::vector<Tree> {}; in C++. That wouldn't mean C++ has a tree type, it just means it's not hard to create your own. Whereas std::list is actually a linked list type that's already there.
Well, std::list is something you can create with C++ code (and likely is in most implementations of C++?), it doesn't need any special support. So I can see why someone might not treat std::list as any more special than a tree datastructure supplied by a different library?
However, algebraic data types really make your life easier, and more languages should have them.
I used to think that since graphs are such a broad data structure that can be represented in different ways depending on requirements that it just made more sense to implement them at a domain-ish level (the article mentions this in the "There are too many implementation choices" section).
Then I saw Petgraph [0] which is the first time I had really looked at a generic graph library. It's very interesting, but I still have implemented graphs at a domain level.
Yeah this is exactly how I think of it. I think "graphs" are just at a different level of the abstraction hierarchy than the data structures they're often grouped together with.
This is even further up the abstraction hierarchy, but to illustrate the point, nobody really wonders why languages don't ship with a built-in database implementation. And it's the same basic reason as with graphs; one size doesn't fit most.
Well, the original article actually describes that relations are great way to model graphs, and suggests that your language (or its standard library) should ship with a good datastructure for relations.
You would get most of a what you need for a simple relation database this way.
For importing and exporting data, it often makes more sense to use something like a table or a tree rather than a graph. (Like a CSV file or a JSON file.)
So it's not clear what the interface would do. What methods should there be? Again, there are too many choices, and a Graph interface often isn't the best way to represent a view of some subset of a graph.
To add to this, recursively enumerable is the same as semi-decidable, and that only gets you to finite time.
One of the big reasons to fight to make dependencies DAG like is because exhaustive search gets you to exponential time.
NP-complete, NP-hard are easy to run into with graphs.
Graph k-colorability, finding Hamiltonian cycles, max cliques, max independent sets, and vertex cover on (n)vertex graphs aren't just NP, they have no sub-exponential time algorithms.
Btw, solving practical instance of NP problems is often not all that bad in practice. Even solving them to optimality.
But you need to move away from writing your own solvers. Instead you use a library that lets you describe your problem, and then throws off-the-shelf solvers at them. See eg https://developers.google.com/optimization
That's a good approach even for problems that are in P, because minor changes in the business logic requirements often only translate into minor changes in the programmatic problem description, but would translate to major changes in the a bespoke, custom algorithm to solve them, even if everything stays in P.
It also separates the description of the problem from the solution. In the real world, the business logic requirements are seldom written down explicitly somewhere, and are only available implicitly as described by the code. So without this separation, it can be hard to disentangle what's a real requirement, and what's just something your custom heuristic algorithm happens to spit out.
Many graph problems also end up having lower bounds, being subject to conjectures like SETH or 3sum
SETH has a sub quadratic lower bound and several other graph problems have cubic lower bounds.
Many real systems are often saved because it is actually hard to write code that aren't primitive recursive functions.
Cycles often are what destroy that, as considering WHILE and GOTO being the difference between primitive and general recursive functions helps show.
If you consider NP as second-order queries where the second-order quantifiers are only existantials, That will help explain why heuristics (educated guesses) help.
> Many graph problems also end up having lower bounds, being subject to conjectures like SETH or 3sum
Those are lower bounds on worst case instances. Not lower bounds on solving typical, practical instances.
> If you consider NP as second-order queries where the second-order quantifiers are only existantials, That will help explain why heuristics (educated guesses) help.
> A graph data type wouldn't have those heuristics.
Sounds like your heuristic for why heuristics help with many NP problems is less than helpful here.
In practice, you can encode many graph problems as eg SAT or integer programming or SMT etc and get good performance.
Even biggish instances of eg the traveling salesman problem are often solved well in practice.
I'm not sure why you bring up primitive recursive functions? Primitive recursion is able to express all of NP (and much more), so it's not much of a constraint in this discussion? (I agree that you have to try hard in practice to go beyond primitive recursion but stay finite.)
I had an opposite experience. I was doing some new-to-me graph work in tcl and had just assumed the tcl standard library wouldn't have any graph algorithms. Come to find out I was wrong, and ended up saving myself a bunch of time not having to re-invent the wheel:
> You can't really do that with a graph. Maybe if you offered a bunch of graph types.
And so why isn't this the solution?
Most languages support both hash map (fast lookup) and balanced tree (ordered entries) primitives, even though they both implement the "associative map" container type.
Can't we have 2, 3, 5, or even 8 different graph types?
It's also a bit different in that every insertion entails at least one allocation, and every lookup entails pointer-chasing through at least 3 dereferences. Java's lack of proper value types really kneecaps their hashmap implementation in many practical performance scenarios...
This is a super naive take. but I would consider the pointer the be the native graph type. What is wanted is not a graph type but the tooling to traverse graphs.
Generalized pointers (RAM addresses, disk storage locations, network locations, etc.) would be the general way to implement explicit literal graphs.
Other graphs, that are partly or wholly computed or recomputed as needed from other relationships, could be considered "implicit" graphs and can be implemented many other ways.
The fact that graphs can be any combinations of literally or dependently defined, static or dynamic defined, would add even more complexity to any truly general graph library.
Take rust. Tree edges use Box, and DAG edges use Arc or Rc. Rust doesn’t really help with building non-DAG graphs as native data types - you can drop back to using “handles” for more generic graphs.
And I actually think that is kinda fair. Languages and standard libraries should support trees and DAGs out of the box. Perhaps Hillel is right that more complex graphs should be in a user-space library or something.
Hmm, think you're taking the title of the article a little too literally, and focusing on just the "data type" aspect. Yeah, the article itself indirectly agrees with you that pointers are often how graphs are implemented: "Graphs are a generalization of linked lists, binary trees, and hash tables," which are data-structures often implemented by pointers.
Yeah, the article is really saying what you're saying, that building a graph library ("tooling") is very complicated.
But, maybe I'm misunderstanding what you're saying??
You can use a pointer when building a graph data structure. You can also use numeric indices into arrays. Or you can store your graph in an entirely different way.
You are right that the data representation itself is only one small part of a data structure, the operations on that representation are also really important.
It's a bit like saying that a byte is a native number type, providing comparison to zero and addition as the only operations, and asking to figure the rest yourself.
I mean, it's technically sufficient, but I'd not call it proper support on a language level.
I actually brought this hot take up in my conversation with Hillel -- something along the lines of "void * * is the canonical native graph representation in C."
I think it's because you have to think it through when they get too big for one machine. A lot of those so-called "NoSQL" databases are really meant to represent graph databases (think DynamoDB) in an adjacency list. I've had success with Gremlin and JanusGraph, but it's a really messy space. It's not a programming language problem in my opinion, but a distributed systems one.
Puthon has one, as mentioned, but as a non CS-person I never encountered any problem in my programming-life that so fundamentally called for graphs that I needed one, not did I had so much fascination for graphs that it was a tool I wanted to force onto my problems.
I guess it is just that there are many problems that can be solved incredibly well without graphs and not that many where graphs outshine everything else so clearly that people would use them.
That being said, convince me of the opposite and I am happy.
What kinds of projects did you build? I'd risk saying that it's more likely you just didn't spot where graphs could be useful, or you implemented a graph not recognising it as such.
Even if you work with pure webdev - the least CS-requiring field of all programming, if you work with DOM, it's a tree/graph structure, or if you work with a sitemap - it's a graph.
From my experience working with non-cs programmers, they tend to struggle with finding solutions to problems that would be solved instantly with graph structures. Or they write suboptimal code because they end up doing BFS, where they should DFS, or brute force when they could use a specific graph algorithm.
I think this is because a graph is not a data-structure nor a data-type. It is really an abstraction.
Fundamentally, all I need to define a graph is a set of vertices v \in V and function Neighbors(v). And that really is all is needed for the most foundational set of graph algorithms.
Everything else are case-by-case constraints. Does A->B imply B->A? is the node set partitionable with certain constraints? Are there colors? labels?
To make things even more general I can go up one level and consider the hypergraph. In which case I just have a set of vertices, and a set of sets of vertices. This can be represented in a myriad of different ways depending on what you are interested in. Of which (non-hyper) graph is simply a special case.
An alternative way to think about it perhaps from the database perspective, is that its a query optimization and indexing problem. Depending on what questions you want to ask about the graph, there will be different ways to represent the graph to answer the question better. Just like there is not one way to represent the abstraction called "Table", there is not one way to do "Graph" either. It really depends on the questions you care about.
Yes, graphs are ubiquitous because they are so abstract. They live on the same level of abstraction as pure numbers. There are useful "numerical" libraries that exist, and by analogy I think you could say there are also useful "graphical" libraries that exist. But we don't really have "number" libraries, and we don't really have "graph" libraries, because those concepts are a bit too abstract to write APIs against.
it's true that numbers are very abstract, which is what makes it so easy to design apis for them
the python runtime includes four built-in number types (small integer, arbitrary-precision integer, float, and complex) and the python standard library includes two more number types (decimal and fractions), and one of the most popular non-standard libraries for python is numpy, which provides some other kinds of numbers such as single-precision floats, vectors, and matrices. other systems like pari/gp have number libraries that provide other kinds of numbers, such as p-adic numbers and galois field elements
the only programming languages i've ever used that didn't have 'number' libraries were esoteric languages like brainfuck and the lambda calculus
numbers have all of mathematics as a background which is what makes it so easy to design apis for them
graphs are a much newer development, I think there's a very deep connection between category theory and graphs in general (and also computers make both much more useful somehow)
lambda calculus can be used to define numbers but it's a wonky construction, it's reminiscent of how sets can also be used to define numbers.
Maybe the naming would be a little weird for that use case, but they didn't specify what the output of `Neighbors(v)` is; I don't see any reason why it couldn't return a multiset or a relation (w, c) where `c` is the number of connections between `v` and `w`
Returning the same neighbor multiple times kind of misses the point. The point was that you need to return edges (not neighbors) because the edges connecting the same neighbors can be different.
Like, imagine two train lines between the same pair of stations. Or two roads between the same intersections. They might have different travel times, costs, etc.
I still don't see how this couldn't work with a `Neighbors(v)` function with an unspecified return type. The outputs I gave were examples of how it could be adapted for various use cases, not an exhaustive list of all possibilities; in the example with multiple edges with multiple weights, the output could instead be a relation of (v2, w, c) that indicates that v connects to v2 with weight w with c as the number different edges between the two with that weight.
In all of your earlier examples (and actually, including the current one too), you're treating vertices as first-class objects, but edges as second-class. There's no way to even identify an edge in your formulation -- only vertices.
This is a common oversight in graph structures that ends up being very annoying in many applications. You keep trying to work around it by associating the edge's properties with those of the vertex pairs and hoping that's sufficient for the application, but I'm trying to point out that the abstraction you're implicitly dancing around -- and the one that many practical uses need -- is actually one that treats edges as first-class. In fact, I would go further and say that if anything should be second-class, it ought to be the vertices, since they're already implied by the edges. (That is to say, for many practical applications of graphs, an edge determines its endpoints, but the endpoints don't determine the edge.)
I'm not clear what an abstraction that makes edges first class but vertices second class looks like. An edge connects two vertices, so if two edges connect to the same vertex, what does this look like?
The most minimal example of it I can think of would be a little ridiculous, but it could look like this:
interface Edge { }
interface Graph {
List<Edge> getRoots(); // Returns some ~minimal set of edges with connectivity to all the others
List<Edge> getAdjacentEdges(Edge e, boolean tail);
}
There's no way to directly refer to a vertex here at all (unlike with edges), and yet (since edges have identity) there's enough information to determine the graph structure!
What's an example of an algorithm that could use this sort of interface? All of the algorithms that immediately come to mind are more require more vertex information.
Well to be fair, that constraint is also part of the mathematical definition of a graph, where the set of edges E is a binary relation over vertices V (i.e., a subset of V x V). You'd need either a multiset or a labelled relation (i.e., a subset of V x L x V for some set of labels L) to overcome that.
"Definitions in graph theory vary. [...] A multigraph is a generalization that allows multiple edges to have the same pair of endpoints. In some texts, multigraphs are simply called graphs."
It's a bit disingenuous to skip over the Graph subsection of that article, right after the "definitions vary" line:
> A graph (sometimes called an undirected graph to distinguish it from a directed graph, or a simple graph to distinguish it from a multigraph) is a pair G = (V, E), where V is a set whose elements are called vertices (singular: vertex), and E is a set of unordered pairs of vertices, whose elements are called edges (sometimes links or lines).
An unqualified "graph" is almost always this one—a simple, undirected graph. If you mean something different you almost always need to use one of the more specific names to be clear.
Sorry, I didn't intend to make it personal, I was just pointing out that the very next paragraph after the chunk you quoted included the definition of "graph" that lou1306 was referring to, almost verbatim.
Definitions sometimes vary, but lou1306 is correct on the merits that the most widely accepted definition of an unqualified "graph" states that "the set of edges E is a binary relation over vertices V (i.e., a subset of V x V)".
You're pulling in context from ylow's post that isn't relevant to this subthread. I'm not defending ylow's definition, I'm defending lou1306's.
Here's the first few parts of the chain of thought of this subthread:
ylow> Fundamentally, all I need to define a graph is a set of vertices v \in V and function Neighbors(v).
You> Even that is severely overconstrained. It doesn't allow multiple edges to the same neighbor!
lou1306> Well to be fair, that constraint is also part of the mathematical definition of a graph ...
You> There is no "the" definition. From Wikipedia ...
You explicitly were only replying to the portion of ylow's comment that was about vertices and a neighbors function, and lou1306 was replying to your assertion that vertices+neighbors was overconstrained because it wouldn't allow multiple edges. All I'm saying is that lou1306 is correct in their definition of a graph. If that means that both you and ylow are wrong, that's fine with me!
> All I'm saying is that lou1306 is correct in their definition of a graph.
I never claimed otherwise. I explicitly said the opposite - there are multiple correct definitions. That's literally one of the reasons why there's no general purpose graph type - there are multiple definitions with different properties, all of which are referred to in various contexts as "graphs".
> If that means that both you and ylow are wrong, that's fine with me!
This gives off very strong "somebody is wrong on the internet" vibes...
All I said was (a) in the context of the current discussion (not decided by me!), graphs were already assumed to encompass more than the vanilla undirected V x V definition people are pointing me to, and (b) in that context, one more example (supporting the parent's point that I was replying to!) was graphs with multiple edges. All of which seems quite uncontroversial, true, and in line with the context of the parent comment I replied to. I have nothing to add.
I seriously don't get where the desire to die on this hill is coming from, but I don't share it to keep continuing here.
1. for simple and small graph problems, a simple vector-of-vectors adjacency list is easy enough to code up.
2. For complex and huge graph problems, the only way to get performant solutions is to tailor the graph implementation to the specific details of the problem to be solved.
And its hard to see what kind of language support would help, other than just having a super-smart compiler which could analyze the code and determine whether an adjacency list, matrix, 3d array, etc was the best way to implement it. That's the kind of optimization which we won't see in compilers for a while.
It's another instance of the phenomenon which Strousroup noticed: we are really good at code sharing of small things like vectors, and of large things like operating systems. Its the middle-sized problems we are bad at.
> we are really good at code sharing of small things like vectors, and of large things like operating systems. Its the middle-sized problems we are bad at.
Interesting. But I am not sure we are good at sharing small things - every programming language has its own implementation of vectors. Within one language ecosystem, the API of a vector is small, and that's probably what makes it easy to share.
For operating systems, the API is relatively small compared to the internal complexity of the OS. This is also true for libraries for numerical problems, which are also easily shared. But the more you want to customize things (e.g. share a complicated data structure), this complicates the API and inhibits sharing.
So it seems to this is determined by the surface area (relative size of the API) of the thing being shared.
Well, we could always be better at sharing small things; but recall, the comment was made by Bjarne Stroustrup, and he probably thought that he had pretty much nailed the vector by that time :-)
The point of the OP is a bit broader than that: for something like a vector, we have at least figured out some language features which would help a programmer make an efficient and generic implementation. Templates are not great, but at least they are something.
For graphs, we don't even have that. What kind of built-in graph support would work for graphs which would work for pathfinding in a video game, or the internet, or a social networking graph a la facebook, or a routing graph routing a 100 million transistor chip....
We are getting better at abstraction all the time, but to abstract across all these kinds of applications is something which eludes us. Its really hard to see how you could give a programmer anything which would actually save him some time.
> every programming language has its own implementation of vectors
Many programming languages have more than one implementations of vectors. Turns out you want tiny vectors stored on the stack, and big vectors stored on the heap...
And its hard to see what kind of language support would help, other than just having a super-smart compiler which could analyze the code and determine whether an adjacency list, matrix, 3d array, etc was the best way to implement it. That's the kind of optimization which we won't see in compilers for a while.
I’m not so sure? Looking at an algorithm against an abstract graph type, then filling in the implementation to optimize for the particular algorithm seems right in the wheelhouse of code-specialized LLM’s.
My experience with cipher is that the query optimizer doesn't know enough about the graph to pick up on trivial optimizations. This can't be fixed without a way to tell the optimizer about those properties, and even just dreiging a language to tell the optimizer those things is difficult.
Just an LLM looking at your query isn't going to cut it. It will need to take your actual data into account.
Good point. The game has really changed in terms of what kinds of programs we can write now. Perhaps it's too pessimistic to not expect these sorts of optimizing compilers soon.
Electric Clojure uses Clojure itself (s-expressions) as a graph authoring syntax, using a macro to reify dataflow through a reactive client/server system (here the use case is full stack user interfaces but the idea generalizes) https://github.com/hyperfiddle/electric (I'm the founder).
IMO, the answer to the question "Where are all the graph types?" is: the graph authoring DSL needs to express scope, control flow and abstraction, which essentially makes it isomorphic to a programming language, freed of its evaluation model. In Python and Typescript, embedding a complete programming language is something that's rather hard to do!
I think the post mostly answers the questions "why are graph _algorithms_ not better supported in programming languages", with a focus that is much more on "big data" graph processing than graph support in general.
I think if you look at graph support in general you are also looking at wider questions, like "why are OGMs (Object Graph Mappers) not as popular as ORMs" and "why is JSON so prevalent while RDF (or another low-level graph serialization) isn't"?
And I think in the end it comes down to historic reasons (RDF emerged a bit too early and never evolved and accrued an ecosystem of horrible academic standards and implementations), and a graphs having just a smidge more of inherent complexity in implementation and learning curve that just doesn't scale well across many developers.
------
I also wouldn't put too much weight on the "Graph Querying Language" part of the article. It sadly reads like exactly the marketing copy you would read from Neo4J or SPARQL enthusiasts that haven't tried building a product on top of it.
> The main difference between all GQLs and SQL is that the “joins” (relationships) are first-class entities.
Joins are first-class entities in SQL. They even have their own keyword (hint: it starts with J and ends with OIN) ;)
If you also go to the lower levels of any graph query language and look at it's query plans you'll notice that there isn't any meaningful difference to that of one you'll find in an SQL based query. The standardization of GQL[0] as an SQL extension is evidence for that.
> In SPARQL relationships are just edges, making the same query easy.
SPARQL is easy if you need to do exact path traversals. If you try to do anything sophisticated with it (like you would in the backend of a webapp), you'll quickly run into footguns like joins with unbound values and you accidently join your whole result set away.
> Joins are first-class entities in SQL. They even have their own keyword (hint: it starts with J and ends with OIN) ;)
Having its own keyword is pretty strong evidence that something isn't first-class (e.g. typeclasses are not first-class in Haskell; control flow is not first-class in most programming languages).
I think OP is using “first class” as in “an explicit semantic affordance”, and you seem to be using “first class” as in “a supported operand” (or similar, feel free to correct me if I’m misunderstanding). In which case both points are right, albeit orthogonal.
JOINs, and joins in RECURSIVE queries at that, are the heart of "graph" databases, so SQL RDBMSes generally do it fine, but... without syntactic shortcuts. A graph QL is all about adding those syntactic shortcuts.
Graph drawing tools are also very underwhelming, they work pretty good for small graphs until you have something like 500 nodes or more, then eventually their output becomes complete incompressible or very difficult to look at it, they miss the ability to automatically organize those graph in hierarchical structures and provide a nice interface to explore them, we are used that everything around us have some kind of hierarchy, I think that is the same kind of problem that will need to be solved in order to have a generic graph data type, also this kind of thing will need to be implemented at the compiler level where those graph generic algos will be adapted to the generated hierarchy of structures, and if you add a theorem prover that can check that certain subgraph will always have certain structures you can statically generated those procedures and for the other super graphs those methods will be generated dynamically at runtime.
So whoever solve the problem for generic graph drawing will have the ability or the insight to implement this too.
Some algorithms do better at this than others, but "make a good diagram of a graph" is an intelligence-complete problem in the general case. Two people might render structurally-identical graphs in very different ways, to emphasize different aspects of the data. This is in fact a similar problem to the "generic graph algorithm" and "generic graph data structure" problems.
Graphs straddle the line between code and data. For instance, any given program has a call graph, so in a real sense, the "generic graph algorithm" is just computation.
> we are used that everything around us have some kind of hierarchy
I think the problem is more that we are used to the illusion/delusion that everything is hierarchical. The problem that we then encouter is with graph drawing is that it has to try and reconcile the fact that things in practice are rarely really hierarchical, and it's hard to draw those lines of where the hierarchies are with mathematical rigor. And that problem gets worse and worse the less properties you are allowed to assume about the underlying graph structure (connectedness, cyclic/acyclic, sparse/dense).
In practice when you want build a UI that interacts with graphs it's often feasible to determine/impose one or two levels of meta-hierarchy with which you can do clustering (allows for reducing layout destroying impact of hairball nodes + improves rendering performance by reducing node count) and layout with fCOSE (Cytoscape.js has an implementation of that).
On the core observation "there are too many implementation choices", that is not quite right. True, the author mentions 4, and there are further variations. In practice, a library can:
1. Implement all suitable graph representations.
2. Implement algorithms tailored to the representation(s) that offer the highest performance.
3. Provide transformations from one representation to another. This is O(#representations), trivial to implement and trivial to use. Quite fair workload for both maintainers and users.
4. Bonus, provide import / export transformations from / to common standard library datatypes and idioms.
Memory and transformations are cheap, 99% of use-cases would likely find the overhead of transforming data, both in RAM and CPU, negligible.
Sounds like the makings of a huge library that I’m not sure I’d even use in my work. I use graphs heavily in my work, and my experience matches the people the author interviewed.
We always end up reimplementing graphs because:
- Performance matters, and no off the shelf graph library I’ve seen can take advantage of many of the regularities in our particular data set. (We have an append-only DAG which we can internally run-length encode because almost all nodes just have an edge pointing to the last added item).
- I haven’t seen any generic graph library which supports the specific queries I need to make on my graphs. The big one is a subgraph diffing function.
- Writing something custom just isn’t much work anyway! Graphs are way simpler to reimplement than btrees. You can have a simple graph implementation in tens of lines. Our highly optimised library - with all the supporting algorithms - is still only a few hundred lines of code.
I think it would be handy to have ways to export the data into some standard format. But eh. I think pulling a library in for our use case would add more problems than it would solve.
What do you mean by “subgraph diffing”? I work with graphs a lot and use SQL almost all the time. Sometimes I need to compute connected components with python.
I have DAG. I can then define a proper subgraph from some set of nodes {A, B, C, ...} such that the subgraph contains all transitive dependencies of that set of nodes. Given two sets of nodes X and Y, I want the set difference between the subgraphs of nodes defined by X and Y (and all of their transitive dependencies). So, what nodes exist in the subgraph of X but not in Y, and vice versa?
Ie, if we have the graph { A -> B, A -> C } then the diff between {A} and {C} is ({}, {C}). And the diff between {B} and {C} is... well, ({B}, {C}).
I've often wondered about a missing application: "Excel for graphs".
Just like Excel for tabular data, it would support RAM-sized data (enough to require a computer, but not so much that you need a data center), implement lots of algorithms and visualizations "well enough", and require no programming skill to operate.
As the article says, a lot of our real-world problems are graph problems - why are programmers the only ones who should have the tools to solve them?
Yeah, I feel like the article is too quick with its conclusions. Many other problems can be made arbitrary complex and difficult with additional requirements. But there are still data structure and standard libraries to provide good enough experience that fits most use-cases. And if you need something extra spicy you need to build a custom solution.
The article claims that graphs are often just too big, but yeah, if you ask people who are actively working on graph algorithms they might have that sort of experience. But most programmers and users probably only work with really small graphs.
> As the article says, a lot of our real-world problems are graph problems - why are programmers the only ones who should have the tools to solve them?
I think programmers and mathematicians are the only ones that model these problems as graphs. I doubt a casual person sees graphs in random real world problems.
And something I learned working in a big corporations, everything can be an excel spreadsheet if you try hard enough.
> As the article says, a lot of our real-world problems are graph problems
The article struggles to back that up though. Eg, it notes that the internet can be modelled with a graph. Undeniably true. But so what? The internet can be represented as many different things and it is unclear that representing it as a graph has any generically useful engineering implications. There is an argument that is just as good that representing the internet as a neural-network (ie, a black-box matrix-encoded function of arbitrary inputs to coherent outputs) is the ideal representation for getting useful info out of it.
Maybe for someone like Google that is a billion-dollar idea (even then though, it might not be - I don't know if they represent their index as a graph or not). But the internet overall isn't much of a graph problem to many other people, and representing it as a graph doesn't solve much.
It is rare to see someone solving a real-life problem on paper as a graph. Using tables happens all the time. Graphs are common, graph problems are uncommon.
Another comment in this thread is about how hard graphs are to visualize, but a 3D interface gives you a lot more room.
When VR hype began I thought "well what's the excel of VR?". Microsoft's answer was "2D spreadsheets floating in 3D space". What nonsense. I think graphs.
email my username at gmail.com if anyone is interested in exploring this together!
> There’s a gap between how often software engineers could use graphs and how little our programming ecosystems support them. Where are all the graph types?
Erlang's briefly mentioned at the end of the article:
> There are two other languages I found with graph types: Erlang and SWI-Prolog. I don’t know either language and cannot tell when they were added; with Erlang, at least, it was before 2008. I reached out to a person on the Erlang core language committee but did not hear back.
Thanks for sharing. That looks well done. It has some pathfinding algorithms and the reducer is very neat for traversing the graph. I like that it's based on maps, which could make it more performant than the ets based erlang digraph.
I wonder if it would be possible to mathematically define (in a theorem proving language like Coq) a bunch of accessor methods as well as a bunch of implementation primitives and then "compile" a custom graph implementation with whatever properties you need for your application. Some accessor methods will be very efficient for some implementations and very inefficient for others, but every method will still be available for every implementation. Profiling your application performance can help adjust the implementation "compiler" settings.
This sounds like Partial Evaluation and the Futamura Projection. The research around that shows that your interpreter determines the shape of the compiled output, so a formal proof of its application isn't necessary, if the $mix$-equivalent has the appropriate syntax and semantics for graph processes in its design.
I know this has been done for procedural languages and for declarative logical languages but I'm not aware of something like this specifically for graph processing and highly specialized code generation of graph processing. I wouldn't be surprised if Mix has been extended for this already, even if it has I'm sure there is still value in it.
For example, I'd like to program against a sequence abstraction. When sort is applied to it, I hope it's a vector. When slice or splice, I hope it's some sort of linked structure. Size is as cheap as empty for the vector but much more expensive for a linked list.
It should be possible to determine a reasonable data representation statically based on the operations and control flow graph, inserting conversions where the optimal choice is different.
The drawback of course is that people write different programs for different data structures. Knowing what things are cheap and what aren't guides the design. There's also a relinquishing of control implied by letting the compiler choose for you that people may dislike.
As an anecdote for the latter, clojure uses vectors for lambda arguments. I thought that was silly since it's a lisp that mostly works in terms of seq abstractions, why not have the compiler choose based on what you do with the sequence? The professional clojure devs I was talking to really didn't like that idea.
Clojure uses vector syntax for lambda arguments. `read` sees a vector. What comes out of eval is a lambda. Does a Vector get built in the process? You'd have to check, my bet would be that the argument list spends a little while as a Java Array, for performance reasons, but that a Clojure Vector is not actually constructed.
You can do something like this with OCaml/SML's module system.
And certainly from an abstraction point of view you can do this in any dependently typed language like Idris/Agda/Coq, but these don't have great implementations.
Ive been thinking about something like this. A mathematical definition of a function such that we can search it. Imagine we had something like "Find a function that fits this signature -> Input arr[numbers] out-> for every x in arr, x2>x1.
I'd highly recommend Erwigs FGL library in Haskell as a really nice example of a generally performant graph data structure that is easy to work with. The API feels like working with lists because you are essentially consing contexts(node, neighbours) into a list of contexts that form your graph. Many standard graph algorithms are then built up from depth or breadth first search and you can compose really succinct programs to manipulate the graph. Graphs are labelled with any Haskell data structure and there is a graphviz library complementary to FGL to generate dot files which can be rendered according to the data carried in the node labels. Often in an application you want to both perform computations on your graph and render a visualization simultaneously to the end user or for debugging and in FGL you minimise duplication of application and rendering logic.
FGL is a great example of how to make a "nice" high-level graph interface suited for functional programming. I'm a big fan. But it's orders of magnitude too slow and memory-inefficient for performance-sensitive graph computations—if you have even moderately sized graphs and graph algorithms are a bottleneck, you'll need to use something else, and probably something domain-specific. Given the way the interface works, I don't think you could have a high-performance version that would scale to large graphs.
In my experience this leaves FGL in an awkward spot: on the one hand, it isn't sufficient for heavy-duty graph processing; on the other, if you don't need fancy high-performance graph algorithms, chance are that encoding your problem as a graph is going to be more awkward than defining some domain-specific types for what you're doing. Graphs are such a general structure that they're usually the wrong level of abstraction for higher-level domain-specific logic.
Of course, sometimes you're writing graph code specifically and you need a nice way to express your graph algorithms without worrying about performance. In that case, FGL is great. I wrote a tutorial about using it to [generate mazes][1] and it helped me express the algorithms better than I would have been able to do without it. But that still leaves it as too narrow for something to be "the" graph representation in a language's standard library.
But it's orders of magnitude too slow and memory-inefficient for performance-sensitive graph computations—if you have even moderately sized graphs and graph algorithms are a bottleneck, you'll need to use something else, and probably something domain-specific.
This seems a little pessimistic to me. There are plenty of application domains that can be conveniently represented using graphs where you might have thousands of nodes and edges — which is what I’d characterise as “moderately sized” — and your needs might only extend to relatively simple and efficient graph algorithms. FGL is excellent in this kind of scenario.
If you do need the kind of algorithms that explode in complexity then even a representation a couple of orders of magnitude more efficient won’t help you much either. Big-O is the thing that is going to spoil your day in this story, not the constant factor. Some problems simply don’t have convenient fast solutions and ideally with those you find a way to change the representation so the original problem doesn’t arise in the first place.
It’s true that there’s also a zone where you have significantly larger graphs but still only need computationally tractable algorithms, and in that case the overheads of a library like FGL become a factor in what is viable. I also don’t disagree with you (and Hillel in the original piece) that it would be difficult to define comprehensive graph functionality to include in a standard library when there are so many different trade-offs involved.
A good — and not entirely unconnected — analogy might be calculating with matrices. It’s convenient to have support for simple but widely useful cases like 3x3 and 4x4 built into your language or standard library. However, once you’re solving systems with hundreds or thousands of rows, you probably want more specialised tools like BLAS/LAPACK, and the structure of your matrix and how you can decompose it start to matter a lot more.
Interesting. Under the hood FGL is mapping the graph to relatively efficient data structures like Patricia Trees as implemented in Data.IntMap so I would expect it to scale reasonably for inserting edges and mapping over nodes. I agree the memory inefficiency is definitely a limiting factor of the library. As you say I think it is best suited for expressing graph algorithms and if those calculations become the bottle neck a custom solution can be developed with the proof of concept already in place.
Out of curiosity what number of nodes/edges would you consider a "moderately sized graph"? My user cases are typically on the order of 500-1000 nodes with a similar number of edges that require bfs and topological sorting.
I don't have an exact number in mind—I imagine it's pretty context-specific—but 500–1000 nodes seems like it would qualify.
I've played around with IntMap before and it's not a great data structure. It's a binary Patricia trie, which means that you quickly get a relatively deep tree with lots of pointer traversals. Unless I've managed to confuse myself on how it works, you'd end up with, what, at least 10 traversals to look up a value from 1000 keys?
In Haskell though I think Alga has an even nicer API. Don't know about performance as I haven't had a need to use Haskell to process enormous graphs. https://github.com/snowleopard/alga
One interesting perspective is to view the sequence of lists -> trees -> DAGs -> general graphs as a loosening of restrictions:
- list nodes may have one child
- tree nodes may have multiple
- DAG nodes may have multiple parents though restricted by topological ordering
- graph nodes may have multiple parents from anywhere in the collection
Lists and trees can be fully captured by sum and product types, but extending this representation style to DAGs and graphs doesn't work--you either get inefficiency (for DAGs) and then infinite regress (for cyclic graphs) attempting to continue the "syntactic" style of representation, or you need to adopt an "indirect" representation based on identifiers or indices or hash consing.
> Lists and trees can be fully captured by sum and product types, but extending this representation style to DAGs and graphs doesn't work--you either get inefficiency (for DAGs) and then infinite regress (for cyclic graphs) attempting to continue the "syntactic" style of representation
You also need "inductive" recursive types to represent lists and trees, in addition to sums and products.
One way of representing the type of a list of T is like:
mu X.1+T*X
(Hence sum and product types, but also inductive or "least fixed point" recursion.)
But you can also use "coinductive" recursive types to represent "processes" (or "greatest fixed point" recursion) with almost the same notation:
nu X.1+T*X
This represents a FSM which at any point yields either a termination (the "1" on the left side of the recursive sum) or a value and a continuation (the "T" in "T*X" is the value and the "X" in "T*X" is the continuation).
This doesn't answer every question about graph representation, obviously, but it's a useful tool for attacking some graph problems you'd like to represent "syntactically" as you say (though you have to think in terms of "coinduction" instead of "induction" e.g. bisimilarity instead of equality).
That’s a neat idea. The other side of it might be expressiveness.
The more expressive constructs are usually more productive and concise. The less expressive constructs are usually easier to optimize or analyze with a machine. The old rule, like in LANGSEC, is to pick the least-expressive option that works. Some people also develop transforming code (eg netaprogramming) to let you write highly-expressive code that generates correct, low-expressiveness code.
I like thinking of this concept via free constructions.
As is somewhat commonly known the free Monoid is the List type; monoids are not commutative so we get a sense of "direction", like a list has a start and an end.
If we add commutativity and look at free groups, we find they are equivalent to multisets.
If we take associativity away from monoids and look at free semigroups, we get binary finger trees, I think?
In some sense removing constraints from the binary operator results in more general free types. Would be interesting to find what free construction makes digraphs but I have to bounce.
Another data type that would be quite useful is a table (like in a database). For the same reasons, too many design choices.
Anyway, that being said, I have felt that progress will be made in programming languages if the compiler gets to choose an implementation of a data structure, kinda like when a database chooses an execution plan. So you just use an abstract structure (like sequence, map, set, table, graph) and based on the program profile, the compiler will pick the specific implementation. It will also transform the structure into another isomorphic one as needed. (Some programming languages already do something like this, for example, array of structs to struct of arrays conversion.)
> So you just use an abstract structure (like sequence, map, set, table, graph) and based on the program profile, the compiler will pick the specific implementation. It will also transform the structure into another isomorphic one as needed.
I'm so not looking forward to having to debug a sudden change in perf characteristics when one additional usage of some feature tips a heuristic over the line and an implementation gets swapped out between builds.
> a sudden change in perf characteristics when one additional usage of some feature tips a heuristic over
This already happens with humans, changing features will change how the product is used and thus performance characteristics changes.
The question is, do you trust the compiler to do a good job? Of course you won't, till the late 90s, people didn't trust compilers to do a better job than humans in assembler.
So it's important to have a good UX for this feature, where the compiler communicates what data types is it using, and gives human option to override its decisions. So that users would gain trust in this feature.
> This already happens with humans, changing features will change how the product is used and thus performance characteristics changes.
Yes, but I can usually look at the function itself for what changed, or the function it calls. I don't need to look three functions away (assuming no inheritance, which I tend to avoid).
devils advocate: Is this maybe a case of discarding an 80% solution because you can’t do the last 20%?
I understand the constraints, but imagine how legible you could make code by replacing some key parts with a graph type that everybody knows.
I honestly think that having a type that supports a small subset of possibilities and only has the simplest algorithms implemented would go a long way.
It isn't just an 80/20 problem. Imagine that you replace every linked list, binary tree, trie, etc., with a generic directed graph datatype. The resulting performance would be abysmal. The resulting notation would be horrid, too.
Here's our nice linked list:
def last_element(ll):
last = ll
while ll is not None:
last = ll
ll = ll.next
return last
And here's an implementation with generic graph notation:
def last_element(g):
for v, deg in g.out_degree:
if deg == 0:
return v
return None
There are several problems with this; most importantly, there can be silent failures when g is not a linked list. But it also throws out a useful abstraction where a list is equivalent to a node, so I wrote a horrid implementation that takes O(n) regardless of the position in the list. And then comes all the baggage of representation, because you can't just represent a node with a pointer anymore.
When your data structure better reflects the, well, structure of your data, you can go faster and safer. There's a reason we teach undergrads about these specific datatypes and don't just sweep it all under a rug with "it's a graph!"
I think most languages support representing many kinds of graphs very well, particularly directed graphs without data on the edges: with objects, lists, and object references (or pointers).
A tree is a graph. A typical Java-style object composing other objects composing other objects again, etc etc, often with cycles and parent backreferences and whatnot, is a graph. The html DOM is a graph.
I recognize that these are often very tree-like, which feels like cheating in the same way as saying “well a list is also a graph!” is. But given that cycles are common enough that serializers (eg JSON.stringify) need to special-case those, I think maybe this is simply not true, and they’re really just graphs. Very few tree-like class structures tend to remain pure trees.
The only thing missing from references/pointers to be able to represent what the author is looking for, is having data on the edges. I think this is trivially solvable by putting nodes halfway the edge (= add a level of indirection, an operation so common that we don’t even think of it as “adding data to the edges”).
So I think the answer is that there’s no explicit data structure named “graph” because the basic building block of composition in nearly every language (reference/pointer) is an edge, and the basic building block of data representation (objects/structs/records) is a node. So for most graphs, trying to pour it all into some fancy Graph<V, E> datastructure feels like needless complexity.
Back in the C days, it was common to not use generic data structures like lists; instead you'd have a next_item pointer as a field in the struct. For linked lists, this would save you from needing another memory allocation or wrapper struct, and since C doesn't have templates you'd either have to blindly cast the type or use macros or write a type-specific iterator anyways.
Lists eventually became a standard language feature in C++ and other languages, but it's trickier for trees and graphs. Taking the DOM example, you might be searching through child elements (div, span, etc) or nodes (text nodes, comment nodes) and different operations might only work with a specific subset of the "edges". There might be pointers to other objects, like from a DOM node to accessibility tree node. You might even have multiple parent node pointers, such as a pointer that takes you to the nearest shadow root or something.
Since there are multiple ways to traverse the same data structure, generic functions don't work on it. You could create a separate tree/graph for each thing you want to use it for, but that takes additional memory and has to be updated when the original struct changes. Or you could create some kind of adapter that has a get_edges() function, but this might not be very well optimized or might be clunky for many other reasons. So it usually just ends up being simpler rolling your own functions instead of using a library.
A graph, as presented in this article, is a model of something else. That we have more than one way to implement this model is rather natural, all told. The hope that we can have a singular syntax and data structure that represents a graph in code is almost exactly why the author of Java's Linked List posted this gem: https://twitter.com/joshbloch/status/583813919019573248
My favorite on the idea of having a linked list where the node is first class in your code, is almost precisely the problem. You rarely want/need to work at that level. In a very real sense, objects that have other objects are already trees of data. Many can back reference, such that then you have a graph.
And then there is the joy of trying to use matrix operations to work with graphs. You can do some powerful things, but at that point, you almost certainly want the matrix to be the abstraction.
Excited to see someone come up with good things in this idea. I retain very serious doubts that I want a singular model for my data.
This is basically my PhD thesis proposal, I don't think there's any fundamental technological problem here, just that for a graph to be efficient to process you need high-level optimisations that can take mathematical properties of graphs into account. For that you need to either reimplement a compiler into your framework, or be integrated into an existing compiler, both obviously take a lot of work.
Some comments here mention GraphBLAS, which is the big breakthrough in decoupling the layout of the graph from an efficient implementation of an algorithm, but none mention MLIR-GraphBLAS [0] which is the most promising integration into a compiler that I've seen.
I still think it's early days, I wouldn't throw in the towel quite yet :)
How will I have any expectations of run-time behavior if I have to hope that my graph will fuse or fail to fuse at run time?
Reminds me of the issues that haskell programmers face when an innocuous change causes list fusion to fail tanking performance; to know how to coax the compiler to fuse again you have to have intimate knowledge of how that fusion process works which isn't visible in the API; you need knowledge of compiler implementation/behavior.
The same could be said of a lot of things. For example in hash maps, you might have a cliff in performance if your hash function is not good for your data distribution, but you'll still happily use the default ones until you're really sure that they're not the right tool for the job. I also feel like this depends a lot on your language philosophy, some languages generally accept the cliffs, some try to expose enough of an API for you to work around the cliff if you feel like you know what you're doing, like custom allocators etc.
I have some personal hunches about how to have better guarantees about these properties but I feel like it's ok for this to not be solved with the v1.
If you code in .NET, please try my graph library Arborescence [1]. It's small and not very feature-rich. However, I believe it provides a good separation between abstractions [2], algorithms, and data structures.
Regarding the points mentioned in the article:
- you can use the edges with or without their own identity,
- you can use implicit graphs unfolded on the fly,
- you can use both adjacency (out-neighbors) and incidence (out-edges + head) interfaces,
- no edge type is emposed by the library, although it does provide the basic tail-head-pair structure as a utility.
One of the complications described by the author is performance. Personally, I find stdlib graph libraries extremely useful even if their performance is poor because it’s often the case that my dataset is small enough, and even if performance turns out to be an issue, first spending time on the problem with a underperforming graph library is a very worthwhile exercise before trying to write my own optimized implementation.
By analogy, many programming languages are far from being the fastest, but they can nevertheless be very useful.
That said, I’m not surprised performance came up in interviews with experts; they probably have tons of interesting performance-related stories to tell from their extensive work on graphs.
The C++ Standard Template Library is now 29 years old. It doesn't have a graph (or generic B-tree) data structure. That says it all for me. In my too many years of programming, I have only needed to program a graph structure once or twice. Yes, the are "ubiquitous in software engineering", but still incredibly rare in most enterprise programming projects.
How many times you had to program recursion with detection and special-casing of cycles? How many times the tree thing you wrote was artificially constraining the domain due to not allowing multiple parents? Those are the cases where graph is the right mental model, and graph structure ought to be considered (you may still hack the solution from special-cased recursion on lists, for engineering reasons).
Also, how much does the control flow jump between objects in your code? There's nothing more core to enterprise programming (at least of C++/Java school of thought) than the object graph. Which is what it says on the tin: a runtime directed graph of objects connected by pointers/references. A lot of enterprise code is, in a way, graph algorithms, just inlined and so smeared out that people don't recognize them for what they are.
Also, how many times the domain model you were using was plain broken, because whoever designed it didn't understand that most things in life don't arrange well into hierarchy - they tend to form directed graphs.
Boost does have the BGL (which, following the STL, abstracts graph algorithms from graph implementations). But I don't think it has seen significant updates in more than two decades.
I think graph data structures are generally missing because they are not something that you would explicitly use most of the time. Sometimes graphs just emerge from existing data structures in an application.
You might want to run generic graph algorithms on such emergent graph data structures, but usually they don't have a uniform graph interface that you can make use of. So you either would need to copy the graph over to some normalized graph data structure, or implement a uniform graph interface facade over the existing objects. The latter is usually more efficient.
Anyway, there are libraries that tend to do this decently, I think Boost does it well[1]. But there are a lot of inherent complexities and so much open design space that you can't really serve with one or even a handful of data structures.
Graphs are such a general concept that if you squint everything is a graph. Your example works just as well with structs and member fields, we don't even need the OO hypothesis.
C and structs let you write code using whatever paradigm you want. So you can do object graphs. And you can also do closures. And you can do many other things.
Lots of things are possible when you just treat memory as bytes and pointers are just integers.
A limitation is that you can only have one such graph in the program. So any graph algorithm that returns a graph, or uses another graph during the computation, doesn't fit.
Don't fall for the abstraction! Mathematically speaking, if you have graphs G1 and G2; you can make another graph H with nodes {G1, G2} and edges {(G1, G2)} and nothing goes wrong. You can definitely view your whole operating system as a graph; it doesn't invalidate having a program running which processes graphs.
Right, I think that's just what I said the first time this came up: all languages with GC have graphs builtin.
(Though C has graphs too. If every node shares the same lifetime, then it's pretty easy to manage. Otherwise it can be pretty painful)
And the good news is that you simply use the TYPE SYSTEM to categorize your nodes and edges.
Your edges are references to other objects, which are typed. Node can be typed as well.
---
Although the original article does get at this -- there are many types of graphs, and some of them can be encoded in a typed object graph.
Some of them can't -- you need the equivalent of void* for the edges.
Others would need a List[T] for the edges, if the out degree is not fixed.
And that only covers directed graphs, etc.
Also, it's true that allocating all these tiny objects as GC objects can be very slow, so then you use other representations of graphs, like a list of pairs of node IDs.
I don't really think of it as a "missing" data structure, but yeah now I do see how that framing can be useful.
I think this is a useful model for me personally, and don't want to diminish its potential value to others; I often think of programs as graphs.
I think it's interesting to add to the discussion that I'm wary to reduce anything to any particular "Turing-complete concept". Because anything can be represented by anything.
For that matter it's true about a certain kind of C program. In a civilized language you have run-time typing and introspection and have all the type information so that it is straightforward to look at the program's data structures as a graph.
If you are looking at the debug symbols and a copy of the program you can usually figure this graph out but you might need to think about it sometimes.
> And then for each of those types we have hypergraphs, where an edge can connect three or more nodes, and ubergraphs, where edges can point to other edges.
Huh, I've heard of hypergraphs (although never actually really used them) but never an 'ubergraph'. Sounds tricky!
In practice, how often are there situations you definitely need hypergraphs? I had a particular situation where I needed graphs that were both vertex coloured (labelled) and edge coloured (labelled) - even then it was outside the normal situation for what I was doing (graph canonicalization).
ubergraphs are pretty weird, i've never actually seen them really used anywhere.
Just a couple of papers pointing out their existence. They have some weird quirks like
every hypergraph and thus graph has a dual, but ubergraphs with uberedges do not appear to have one.
This reminds me of my quest to find the proper method to model what I've termed 'large types' in an ideal language.
As is well known, algebraic data types as commonly found, consist of sums of products, yet a great deal of useful types are larger than that; some hopefully illustrative examples include:
1) the type of subsets of another type would be 2^X (hopefully demonstrating what I mean by 'large'ness);
2) in practical languages like TypeScript, the 'Partial' of a product type A x B x C would be (1 + A) x (1 + B) x (1 + C);
3) data structures in general, as a term amenable to some certain set of operations, when needing to be represented for performance reasons e.g. a) union-find structures (quotients?); b) a list of words and their inverted indexes for searching; c) a sorted list
Reading more about type modelling, and learning of the disagreements in how even basic things like quotients ought to be represented as types, I've since resigned to an understanding of this as an unsolved problem, and relegated the modelling the kitchen sinks of types with the kitchen sink of types - i.e. the function type (curbed with suitable type constraints upon the signature - from an index type to a suitable base type) - after all, its power and province being the irreducible kernel of type polymorphism, shadow over Church's types, original sin against type decidability.
These types don't seem to escape the scope of what can be described with algebraic types, but the relationships between them seem like you're looking for a notion of type-level functions: subset ≡ X => 2^X, partial ≡ A×B => (A+1)×partial(B)
Consider the case of Partials - we might like to restrict the Partials to different subsets of the fields for different purposes; consider the modelling of an inverted index.
Certainly it is possible to represent each specific case as some algebraic type; but beyond trivial cases, I find that when I need such of these types, quickly I discover that there are myriad ways to express them, none of them uniquely natural, unlike the way a sum of products type (and its terms) can be pretty much unambiguously drawn from a specification.
This matters especially when e.g. I need to evolve my types in a data migration.
graph is a data structure, not a data type. if you squint enough pointers are the building blocks (the data type is you please) for building graph data structure. a->b is pointer access, looks like an edge in the graph.
graph data structure is parent of tree, code execution/ function call stacks work like a tree, think flame graphs.
stacks and pointers are baked in assembly and cpu architecture. your claims can't be farther from the truth.
Pointer-based graph structure will make matrix algos painful to implement. A graph is a concept. Article is quite meaningful about how it follows the various subtypes. Would recommend reading.
That should provides you some more context about my earlier comment.
By definition of concept (think conceptnet) anything is a concept. Any noun is a concept. Graph theory defines graph as set of two more sets. The set of nodes and set of edges, where each edge itself is set of two nodes (or tuple of two nodes if directionality of the edge also needs to be encoded). A node is anything that you can consider putting into set. And according to set theory, a set is well defined collection of things.
According web ontology language, a "thing" is the root of all things that can exist (see https://www.w3.org/TR/owl-ref/, specifically owl:Thing), except "nothing" maybe.
What all this means is a graph is collection of things, with things pointing to each other sometimes.
Pointers are the underlying data type that makes all other higher level data structures possible, including arrays, matrices, hashmaps, graphs, structs and more.
Any concrete data structure that uses indirection — which means pretty much anything more complicated than dense arrays and records — is indeed a form of graph.
But graphs, and more constrained forms like DAGs and trees, can also be abstract data types, implemented by a variety of concrete representations.
One of life’s little ironies is that implementing a general abstract graph using a general concrete graph whose records and pointers correspond (roughly) 1:1 with the nodes and edges in the abstract graph is often a poor choice.
Moreover, it’s not unusual to have an abstract graph implemented using a non-graph data structure (for example, a dense adjacency matrix) or to use a graph-like data structure to implement an abstract data type whose interface doesn’t look particularly graph-like (for example, a piece table).
A graph is a group of two sets, the set of nodes and set of edges.
As an abstract data type, you may define operations on the data structure (aka abstract data type).
In case of graph, for example, you can define connectivity check (existence of an edge). And graph theory provides plenty more.
And set (in set theoretic sense) is also a data structure, you may define the membership check as on operation on that.
On the other hand, a data type is a tag that a compiler associates with raw bits and bytes on the memory in order to operate on them. Examples, datetime is a data type, string is a data type, array is a data type, numbers are data type, these are not data structures. These are language primitives.
Further graph is superseded by hi-graph (which is foundation for relational data bases and tuple algebra), and subseded by for example DAGs and trees.
To build an edge in a graph, you need something that could point to something. Like A points to B, the most fundamental way to capture this mapping is by using pointers (that is the address of B, stored at a known location accessible by A). A->B or A.B are just syntactic elements that underlie this.
Arrays, Matrices, Structs, Strings, are all made possible by pointers.
Pointers are a data type, it tags the value (usually in range 0..usize), as being an address of something else in the memory. Pointers are not data structures.
I should say primitives vs non-primitives if that makes the difference between what is data type vs data structure.
First, the discussion about representation highlights that the issue is a lack of infinite resources. If we had an infinite computer, that could execute an infinite number of operations in zero time, and had infinite memory, then we wouldn't be worrying about whether it's better to store the graph as a matrix, an edge list, or a pointer graph.
Software Engineering is everywhere and always a job of optimization. Sometimes that optimization is premature, and sometimes it's too little too late. It's always about optimization.
Second, when we're talking about a graph of 10 nodes, it really doesn't matter what data structure we use. It can quickly matter if we have 100s or 1000s of nodes and edges because now the possible arrangements are huge as is the search space. But this is no different than other problems like the knapsack problem where the search space is huge: depending on the problem, there is very likely a "trick" to make it tractable, and that trick is different depending on the problem.
So, like the knapsack problem, there are different, specific solutions for specific graph problems.
Ahh, Graphs.. One of my favorite subject. I love doing graphs but im kinda bad at implementation. But, I still managed to slap D3 + Cola to make my own little interactive visualizer I use for various networking visualizations (L1,L2,L3).
Maybe there could be some utility for a decently general Graph type that can be used for high-level testing and verification. Maybe you could implement your efficient graph per-problem and then validate it with a more high-level and declarative type. You would have to implement some isomorphism between the types though.
This is a good write up, but for me seems to miss the mark. I agree with the author that different algorithms require differently organized data structures. But what if your problem requires 2 different algorithms, which each work best with a different data structure? Either you pick one and run both (which is not ideal) or you run the first algorithm with one structure, convert it, and the run the second algorithm in the second structure.
And once you can do that… why not have every algorithm ensure it runs in its best structure, and convert where necessary (and possible) on the way in? Yes, there’s absolutely a performance or storage cost… but if the algorithm is that much faster, it should be worth it.
Basically a beefier version of sorting your data before searching in it. If an algorithm works best with a specific model of a graph, then let that be part of the algorithm.
> why not have every algorithm ensure it runs in its best structure, and convert where necessary (and possible) on the way in? Yes, there’s absolutely a performance or storage cost… but if the algorithm is that much faster, it should be worth it.
Because it's not that much faster, so it's not worth it. You're severely underestimating the amount of thought that went into the article, or the work of the experts interviewed.
Sorting your data before searching it will only pay off if you need to search multiple things. If instead you need to search for one specific thing then going through things linearly is O(n) while sorting and searching the sorted result will be O(n log(n)).
I would supplement it with the observation that when I was a younger programmer, like many people, I considered "generic" or "flexible" a positive when describing a library or framework. I have come to see it as generally negative, especially when the developer's summary puts these adjectives or something similar front and center.
Let me show you the most flexible possible Javascript framework. This will look like a joke, but it's not. It fits perfectly into an HN post. The most flexible possible JS framework is simply:
eval
Similarly flexible frameworks exist for dynamic scripting languages. For static languages one must invoke the entire compiler as the framework. Of course, if you think about it hard enough, I'm doing that for dynamic languages here too, it just has a snappier representation for dynamic languages.
Frameworks and libraries provide their value precisely through limiting things, and then building on those limitations. Of course, the limitations must be well-chosen, to make what can be built on them interesting enough to pay for the limitations the framework chooses. But the essence of them are in their limitations. I start out from the get-go with the maximally flexible framework my language allows, which is itself the language, and the additional framework needs to make limitations on my code in order to do anything useful.
(A problem when framework designers don't understand this is that they make a series of little incorrect design decisions that can often add up to a real pain. For instance, if I were to design a web framework that took over some amount of routing from the user, I would still leave you the ability to claim some bit of the URL space and route it entirely out of my framework, because I understand that my framework is based around limitations and you may need to expose a URL to something that can't work under those limitations. But someone who doesn't realize that frameworks intrinsically involve limitations might fail to give that callout because they can't imagine that someone might have a problem that their framework is not "flexible" and "generic" enough to handle. Imagine a CRUD framework, even a very good one, but I need to offer an endpoint based on streaming server events in the same URL space, which is intrinsically foreign to the CRUD framework's concept of page loads. This is just one example; real frameworks designed without this understanding will make dozens or hundreds of such little mistakes.)
Graphs have the same problem. It seems like they're so flexible and generic that they ought to be more used and more useful. But that's precisely what kills them. Even if you nail down the problem to exactly one representation, they still don't fit. For instance I have a great need for data structures that don't admit cycles, but if all I have is a graph, imposing that limitation from a coding perspective is a real challenge. Mathematically it's trivial, I just say "and this graph has no cycles" et voila [1], there are no cycles, but in code I need to enforce that somehow and there's no trivial solution to that.
Another way of viewing graphs is that we do work in graphs all the time, precisely because everything in RAM can be seen as a graph. GC algorithms even generally work by viewing everything in very raw graphy terms. It just turns out the API you'd expect to work over a graph just isn't useful in the general sense when applied to everything in a programming language's memory space. It seems like it ought to be, but it just isn't. It may seem like it would be great to have a set of employees and extract their names and then look that up into another database etc. etc. with a general graph query language or something, but it turns out the special considerations at each layer make it so that what the general purpose programming language is already doing is actually generally better. The details at each layer matter.
I like the metaphor of architecture astronautics and have often discussed "the 30,000 foot view" versus the view on the ground here on HN, and to my mind the key to the metaphor isn't the nerdery of being an astronaut or the difficulty. The key is that when you get high up, everything looks the same. It feels like graphs ought to be awesome when you're looking down at the entire computing landscape from metaphorical low Earth orbit. But down in the trenches, the local concerns overwhelm that viewpoint... and this is real. This is not just because we all suck or we don't try hard enough or we just Don't Get It. It's real. The architecture astronaut is just wrong in this case. It's not even a beautiful vision this world isn't good enough to manifest or any such conciliatory thing... it's just wrong. It is good to write good code and reduce the amount of bespoke details to be considered. The programming community has made great progress there and there is still great opportunity to do more. But there are an awful, awful lot of details in the world, and the world being detailed is fundamental.
[1]: Or if you are, like me, kinda a fan of surprise stringed instrument attacks, et viola.
> when I was a younger programmer, like many people, I considered "generic" or "flexible" a positive when describing a library or framework [...]
I've come to prefer what I call "design for deletion": Most of those long-term "flexibility someday" needs are best-met by making sure the inflexible modules or flows can be clearly identified and ripped out for replacement. This leads to a certain kind of decoupling, although with a higher tolerance for coupling that can kept in check by static analysis.
This is a contrast to my days of youthful exuberance where I thought I could solve the problem by making my work extensible or customizable. No, I cannot make the immortal program, so I should focus on making a mortal one which can pass gracefully.
> Even if you nail down the problem to exactly one representation, they still don't fit.
This is exactly the problem I've found with graph databases. I've never successfully used a graph database to solve a problem, and multiple other engineers I've spoken to have bounced off them in a similar way. The problem is I don't have an arbitrary graph problem, mine is specific, and as you say the choices a graph database makes matter. It really gives me greater appreciation for the relational model because so many things can be made to work on a relational database. It may not be elegant, but it works.
I think the way I'd like to approach graphs, if I do it again, would be to use a graph represented as sparse matrices in memory as an index. This is more or less in line with what you get from expressing the graph problem in application code, but maybe easier to understand and maintain? I guess that is to say I'm still optimistic there might be some general purpose solution like RedisGraph (now FalkorDB) that makes sense to use this way, but I'm not sure I'd try to use an out-of-core graph database again.
In my last job, I implemented a transpiler for most GQLs: we supported Gremlin, Cypher, SPARQL NetworkX, GraphQL and SQL.
It took me about 3 months to implement each language.
We could also convert between different graph serialization formats, like RDF and some JSON formats.
The product is now dead (KgBase). The transpiler tool wasn't exposed to users, it was just part of the internal machinery used to seamlessly support many DBs on the same frontend.
The graph community should split in 2. Some are interested in graphs from a math/statistics view point, while others are interested in graphs as a generalization of relational DBs. Graph tools attempt to satisfy both camps simultaneously, but their interests and needs are very different.
I suspect a lot of graph theory can be reduced to abstract algebra. You consider matrices over lots of different scalar types. The scalar types should be:
- Rigs (rings without negation),
- idempotent (that is, where x + x = x for all x),
- equipped with involution (so that undirected graphs can be made the default by restricting the matrices which represent graphs to only self-adjoint matrices),
- and the entries of the matrix can be restricted to a *-ideal. Note that a *-ideal can be considered a scalar type in its own right.
Different choices of the above scalar types can be used to capture different graph types: Weighted, directed, undirected, bipartite.
There's no clue to physical implementation, other than that sparse graphs can be treated like sparse matrices. Anybody tried this? How did it work out?
The post "A Very General Method of Computing Shortest Paths" (https://r6.ca/blog/20110808T035622Z.html) shows that the Gauss-Jordan algorithm for solving matrix equations, the Floyd-Warshall algorithm for all-pairs shortest paths, and Kleene's algorithm for converting DFAs to regular expressions, are all the same algorithm on matrices of elements from a star-semiring.
According to the article, the physical implementation of graphs is precisely the thing you cannot ignore, because the performance difference between a generic solution and a "correct" solution is huge, basically the difference between the algorithm not completing and the algorithm completing.
That's the reason why it's hard to come up with a single one-size-fits-all graph implementation.
> Using semirings (uh, rigs) alone isn't impressive. Do they consider semirings with more algebraic structure attached to them?
Wiki says they have R, the two tropical semirings, the 'max-min' semiring, and GF(2). The tropical and max-min have your idempotency requirement, all but the max-min have involution.
FWIW, despite its name being an abbreviation for LISt Processing, the fundamental datatype of lisp (the cons cell) is actually a vertice on a directed graph with the restriction that each vertice can only have two outgoing edges.
There's another glaring omission: grammars. Regular expressions are everywhere, languages have literals for them. Context free parsing with grammars? You're gonna have to break out the compiler compilers.
Why there is no parse(grammar, input) function in standard libraries is beyond me. The Earley algorithm seems well suited for it, it can take a grammar as input and and it even work online.
This is looking like a programming language problem. There is no way to make graph algorithms available in a way where the details are abstracted away.
What would a programming language look like that could address all those issues?
Some C++ libraries do pretty well. In the Eigen math library you can call most functions on dense or sparse matrices, and some template magic makes it work efficiently for all combinations.
It's a shame it's so hard to write that kind of generic template code.
Datalog is great, but you don't need it. You can have relations as datastructures in more conventional languages and still reap many benefits.
I programmed professionally in a system that was a dialect of Haskell that had relations as a standard library data type. Expressing business logic in them was very pleasant.
I've also toyed around with adding relations to Python; that also worked just fine. (My toy library wasn't very fast: all operations were implemented naively. But it was still expressive.)
Partly it’s just our beloved overthinking and rationalization of it. When I needed a graph to repesent and edit a structure of supply contracts, I just stored {who, to-who[]} and loaded these arrays into a graph library written in js. Performance, formats didn’t matter because there’s only so much contracts, 5-20 per single graph. If there was no graph library, that would suck, and no amount of rationalization would change that. Complexity of the area never offsets the value of having at least something useful in it.
I needed a directed graph yesterday. I gave the nodes integer ids by appending them to an arena, then used a hashtable from integer to vector of integer. Iterating over it involves a set of integers to track which nodes have already been visited.
Deduplicating nodes on insert into the area was more hassle than cobbling together the graph structure out of a hashtable and a vector.
Maybe one reason against putting graphs in the standard library is they're easily put together from more common structures for whatever special case you have in mind.
> Maybe one reason against putting graphs in the standard library is they're easily put together from more common structures for whatever special case you have in mind.
This is a fair argument (how implementations tend to combine existing structures in bespoke ways). But any time I've needed to use a graph explicitly, it hasn't really mattered what underlying structures were involved.
What has mattered each time is having to invent my own little API to expose well-known/primitive graph operations, then go and implement them which is unnecessarily error prone.
Your example of de-duplicating nodes on insert sounds like it describes a property of your particular graph that may be better expressed through a type, which would also afford the necessary API.
I'm approaching this from an OOP-ish perspective so do with that what you will.
> I gave the nodes integer ids by appending them to an arena, then used a hashtable from integer to vector of integer. Iterating over it involves a set of integers to track which nodes have already been visited.
This is what sucks about using graphs IMO. I don't want to think about all that stuff, I just want think about graphs. In practice I spend most of the time toiling around with noisy boilerplate that dominates my mental model and allows graph concerns to leak into business concerns.
I agree with this take - graphs are in an interesting superposition where implementing a large set of algorithms correctly and in their full generality is hard (as the article points out), but getting started with an implementation that solves a particular problem well enough for a particular case is easy and fun.
Well, we have many implementations for simpler abstractions like lists where it might be useful to have contiguous memory, or to have quick append/pop, or maybe inserting at the front, or slicing.
I think that having clearly defined "instances" of these tailored lists, like vector, deque, linked list helps a bit, but graphs are a harder problem since there's more ways of tailoring them to specific purposes. and with this comes more tradeoffs.
> Mathematica, MATLAB, Maple, etc all have graph libraries of some form or another. I am not paying the thousands of dollars in licensing needed to learn more.
Wolfram provides a free Mathematica called Wolfram Engine https://www.wolfram.com/engine/. It's Mathematica without the UI. I hear you can combine it with Jupyter Notebook to get a similar experience to Mathematica.
Ha! I wrote the predecessor for the Python integration some ~25 years ago https://library.wolfram.com/infocenter/MathSource/585/
and I think it uses a similar approach. The mathematica engine had a protocol for sending and receiving expressions and evaluating them. I was pretty proud of my implementation as it it was my first real attempt at doing anything remotely complicated with recursion.
Finding the balance between OO principals, Fluid coding capabilities, separating the data, grammar, parser, and world model and then constructing a standard IF library of common IF "things" is like juggling 20 kittens and 10 chainsaws.
Some things are confounding like do I define a container with a boolean property on an object or is a container a subclass of the base Thing? How does that extend to the underlying graph data store? What will queries look like and which solution is more meaningful to authors?
Seriously, 95% of the fun is figuring all of these things out.
I found that the best to think of graph implementation is sparse matrices of the adjacency matrix. CSR/CSC format has fast lookup abilities, building and switching between formats is "relative" efficient. Most graph algorithms need primitives that can be built on top of this.
For distributed computing one can look int GraphLab or its smaller version, now largely abandoned GraphChi.
> Because graphs require mutation (respectively break linearity).
I don't think this is actually the case. Graph nodes go in one container (`Vec` or `HashMap` or `BTreeMap`), and the edges go in another container (`HashMap` or `BTreeMap`). The object in which you store the node only needs to know what its name is, you can let something else know what its neighbors are.
I disagree: trees are connected graphs with n nodes and n-1 edges. They're graphs with a special structure that is highly exploitable for performance gains, but they're graphs.
Even with direct mutual references between some node type this can be represented in a lazy functional language such as Haskell pretty easily without mutation.
I think maybe they're talking about how, in order to create a data structure where two nodes contain a reference (edge) to each other, you have to resort to mutation. i.e. you have to create A, then B with reference to A, then mutate A to refer to B, or you have to create B, then A with reference to B, then mutate B to refer to A.
Though this ignores that there are other ways to represent graphs, such as adjacency matrices, etc.
Given the wide variety of implementations, what I'd like is a questionnaire that asks you what sort of graph you are intending to work with, and then recommends what sort of algorithm implementations or software packages would be best... I'm hoping that the language models will get better at this sort of question over time.
I take solace in such findings. I learned that my quest was ill-conceived, and I abandoned it; I should be glad that I'm no longer searching for what cannot be found.
> Relational databases are graphs where the nodes are records and the edges are foreign keys
I disagree with the premise of the article: programming languages do have strong and mature support for graphs in the form of relational database interfaces, which cover most of the real-world use-cases for linked data.
Just in the last month I've been working with PEG patterns, which are mostly tree-shaped but rules make it a cyclic graph, and writing a rope, which is a DAG.
This is an interesting article but one major effort it doesn’t mention is the Boost Graph Library [0].
It is kind of clunky in places because it is written in C++03 and uses some weird idioms to simulate keyword arguments and provide generic ways of getting attributes for nodes. Also it suffers from the terrible template instantiation errors that most C++ template libraries do. But I still think it addresses a lot of the difficulties covered in the article:
> There are too many design choices
BGL is limited to directed/undirected multigraphs, so hypergraphs are not supported. However I think these cover most use cases.
In terms of implementation choices, BGL provides several concrete data types, such as an adjacency list and an adjacency matrix. It also provides adaptors for the GraphBase and LEDA graph libraries. If none of these are suitable you can write adaptor functions to support your custom data type. All algorithms* work unmodified on these concrete implementations.
> So which algorithms should come with the library?
BGL comes with most of the common ones [1], but I do wish it came with more. The implementations of them are quite hard to read because they are written in highly generic (and before many of the conveniences offered in C++11) C++ code.
> Performance is too important
Since BGL is generic using C++ templates instead of runtime polymorphism, it should (in theory) be able to work with a concrete graph implementation that is performant for a certain task and so let you reuse its generic algorithms.
I think the article describes a lot of the difficulties that Stepanov’s generic programming approach tries to solve (e.g. finding the most abstract but still efficient implementation of an algorithm, writing algorithms that depend on a limited set of type requirements, having many data types that can reuse the same algorithms). While C++ supports this style of programming it is not ideal for it, but I think BGL is the closest thing I have seen to a generic graph library that is also performant in many cases.
*Algorithms have varying requirements, e.g. some may need to be able to remove edges while others do not. But these requirements are generic and can be fulfilled by many different graph implementations.
As a sidenote, Stepanov’s introduction [0] to the Boost Graph Library book explains a lot about his approach to programming. He didn’t write the BGL, but his work on the STL was a major inspiration for it.
I fully agree. Despite the syntactical clunkiness and the need to learn some new concepts, BGL's design is beautiful -- surprisingly, it mostly achieves the STL goal of implementing M algorithms on N different representations using just O(M+N) code (vs. the usual O(M*N)), without sacrificing efficiency. Like the OP, after much wandering through the desert I had come to the conclusion that there was no way to generically yet performantly model graphs -- but once I understood (the basics of) BGL, I was persuaded otherwise.
It's a little disappointing that BGL didn't make an appearance in TFA.
To make matters worse, people often run graph algorithms on graphs that are too large to store, the details of which are generated on the fly at each step of the algorithm from the definition of the graph.
My own take: a document database, which devs seem to love, combined with relations aka edges,is basically a property graph and that's what basically you need. Directionality is a property on the edge.
I think one of the elements that author is missing here is that graphs are sparse matrices, and thus can be expressed with Linear Algebra. They mention adjacency matrices, but not sparse adjacency matrices, or incidence matrices (which can express muti and hypergraphs).
Linear Algebra is how almost all academic graph theory is expressed, and large chunks of machine learning and AI research are expressed in this language as well. There was recent thread here about PageRank and how it's really an eigenvector problem over a matrix, and the reality is, all graphs are matrices, they're typically sparse ones.
One question you might ask is, why would I do this? Why not just write my graph algorithms as a function that traverses nodes and edges? And one of the big answers is, parallelism. How are you going to do it? Fork a thread at each edge? Use a thread pool? What if you want to do it on CUDA too? Now you have many problems. How do you know how to efficiently schedule work? By treating graph traversal as a matrix multiplication, you just say Ax = b, and let the library figure it out on the specific hardware you want to target.
Here for example is a recent question on the NetworkX repo for how to find the boundary of a triangular mesh, it's one single line of GraphBLAS if you consider the graph as a matrix:
This brings a very powerful language to the table, Linear Algebra. A language spoken by every scientist, engineer, mathematician and researcher on the planet. By treating graphs like matrices graph algorithms become expressible as mathematical formulas. For example, neural networks are graphs of adjacent layers, and the operation used to traverse from layer to layer is matrix multiplication. This generalizes to all matrices.
There is a lot of very new and powerful research and development going on around sparse graphs with linear algebra in the GraphBLAS API standard, and it's best reference implementation, SuiteSparse:GraphBLAS:
SuiteSparse provides a highly optimized, parallel and CPU/GPU supported sparse Matrix Multiplication. This is relevant because traversing graph edges IS matrix multiplication when you realize that graphs are matrices.
Recently NetworkX has grown the ability to have different "graph engine" backends, and one of the first to be developed uses the python-graphblas library that binds to SuiteSparse. I'm not a directly contributor to that particular work but as I understand it there has been great results.
I was really excited about RedisGraph, and sad to see it was cancelled. In my (limited) experience with graph databases they proved very frustrating because it seemed like they tried to do too much. Ultimately the way I thought of a graph was an indexing strategy into some underlying data. So I needed the graph to be very quick, but I didn't have any requirement to store actual data in it--just references. This made triples based graph storage seem very heavy-handed.
The idea of composing sparse linear transformations to optimize queries is really cool. You can get a lot of work done in one shot that way, in a manner that's just quite a lot easier on the machine than chasing pointers around.
Hrm. This reminds me of a thought I had about, for the lack of a better term, "groups" in Python. Each data type has a property which is paramount and more or less defines the type.
Lists are ordered. The tuple is immutable. The dict is keyed. The set is unique. (But there are some overlaps: dicts are both keyed and their keys are unique.)
And I thought, what if you had a group where you could make it immutable or mutable, ordered or not-ordered, where the value was a key to something else (or not), and so on? But then I saw the weird edge cases and the explosions of complexity. Some combinations of these attributes are less appealing than others.
I really enjoyed reading this thread. The opening line (and core argument?)
"Graphs are ubiquitous in software engineering" triggered my neural network to respond.
TLDR: graphs are ubiquitous in _science_, and for a data type to be useful it should not put the crossbar too high.
(some history.) The Center for Nonlinear Studies (https://cnls.lanl.gov/External/) has a rich legacy of organizing annual Los Alamos Lab driven conferences in Santa Fe that bring together emergent disciplines. We combine overview talks by world experts and enough spaces in between these talks so that new bridges can be built at outstanding Santa Fe restaurants. I was co-organizer of the 2003 conference on Complex Networks, and this one was turning out to be a real banger. Sitting in the back with Aric Hagberg and Dan Schult (from Colgate University, but then spending his sabbatical at the CNLS) we were struck by how many really smart people were using "complex networks", but with very few computational tools available to them. That was the origin of networkx ( = network "X", reflecting the multidisciplinary renaissance we were watching). At that time python was a high-productivity starting framework to build the infrastructure, but I honestly expected that eventually there will be another language plugged in under the hood to make it more efficient for huge data sets on supercomputing architectures. We used the Guido v Rossum dict of dicts data structure idea (a few years old at that time) and built the most natural setup designed for a range of the disciplines. The python dict data type was a well-tested and integrated part of the language, so we felt this to be a solid base to work on. We freely borrowed ideas from many smarter than us. E.g. from David Eppstein [1] - one should be able to just say "if n in G" for "if the node n is in the graph G" and "G[n]" for "the neighborhood of node n in the grap G". We loved the graphviz drawing tools but quickly decided that graph drawing was a separate challenge [2]. Our goal was platform-independent, open source tools that will allow any graduate student, from any country, to use it in any field. Often when I came up with some strange subset of mathy graph stuff Aric would push back with YAGNI! [3]. Fast forward a few years, and the success of networkx --- due to the wise management and long midnight hours leadership by Aric and Dan, who inspired many new contributors --- continued to surprise us. The python dict-of-dicts technology allowed a wide range of fields to use these tools, and smart graduate students (working in diverse fields such as epidemiology, proteomics, ecology, architecture, social sciences, ...) could learn "applied graph theory" on the fly and easily write their own code. If we used C++ Boost BGL or some other more efficient C data structures, this would likely have bypassed all these thousands [4] of applications. The evolution of networkx continues with great new ideas, as for example explained in the recent scipy talks by current maintainers and contributors. Thanks Jarrod Millman ! [5].
Many programmers ached at better faster newer graph libraries, and I know that they used networkx as part of their development, as they should. Borrowing from paper dictionaries, there are some humorous quirks added into old code that allows one to track borrowed memes [6]. One reason the abundance of graph libraries will continue is that programmers, like woodworkers, enjoy the great joy of crafting them. I look forward to a future AI that creates a superb graph library just because it should be done. I hope it will contain random_lobster, and the Aric Hagberg Ankh-Morporkian quote "it is dictionaries all the way down".
Pieter Swart
Theoretical Division and Center for Nonlinear Studies, LANL
[2] In his CNLS 2003 talk, Bill Cheswick made the point that for typical internet related graphs any drawing tool soon delivers a "peacock splattered onto your windshield." https://www.cheswick.com/ches/
I think this is a cop-out. Someone in the 1990s could have written the same thing about collections, or dictionaries, but we eventually came up with good-enough compromises that Python, Ruby, and Javascript all basically do the same thing. They don't implement literally every case, but they are good enough for "small" data, where the definition of "small" has grown to be quite large by human standards.
I think the real problem is syntax. Someone needs to come up with a textual graph literal that fits in source code, and it'll be good enough for the small type. Switch to a custom ubergraph when you exceed, idk, ten million entries.
Two problems I see here, based on the research I've done in high-performance graph algorithms:
- It's hard to find a "good-enough" graph implementation. The best hashtable is only a handful of percent better than the built-in ones. The best graph impl is 1000x or more better than any generic built-in one could be, so there's much more incentive to specialize (and people already specialize hashtables for just a handful of percent speedup!)
- The baseline complexity level of implementing a reasonable hashtable is fairly high, even if for a small dataset. The baseline complexity of implementing a graph algorithm for a small dataset is pretty low, and the real problems come in later / at larger scale. So in graphs there's less incentive to learn a complex library's API when "I could just hack it myself," unlike for hashtables where the API is simple and doing it myself is much harder.
> The baseline complexity level of implementing a reasonable hashtable is fairly high, even if for a small dataset.
I would disagree with this, it's actually really easy to make one if you're willing to do away with many features (which aren't essential, but provide performance benefits). Implementing one is just something you never have to do in most modern languages.
The work you cited is very impressive and very welcome.
But you seem to be implying that `std::unordered_map` is the default choice one would use, which in my experience is not accurate -- it is well-known to have serious perf shortcomings, and everyone I know uses some other implementation by default. Even so, the delta from `std::unordered_map` to the improved hashtable in the blog post is impressive, and just shy of 10x.
Graph algorithms frequently have 10x improvements from one state-of-the-art approach to the next -- for example, here's one from my own research[1]. The delta between state-of-the-art and "good default" in graph algorithms would often be around 100-1000x. And comparing state-of-the-art to the equivalent of an `std::unordered_map` would be another 10-100x on top of that, so 1000-100000x total.
Whoah, thank you for sharing. I only knew that just like dictionaries, there are quite a few implementation choices when making a graph, depending on what operations the algorithms needs to do often, and how sparse the data is.
but where RDF graphs (roughly like a JSON document) get passed over the "lines" but found that the heavy hitters in this space believed this sort of product has to use columnar execution to be "fast enough". You can certainly build something that can do operations on a dynamic RDF graph (really a set of triple) but in principle you could compile code that treats native data structures as if they were in RDF... You might get pretty good in speed but it won't be as fast as native and hard to make it easier to code for than native.
RDF is not dependent on XML. There is a XML representation of RDF, but alternatives include JSON (using JSON-LD) and simple textual formats such as N3 and Turtle.
One trouble w/ RDF is that there are two official ways to do ordered collections, the RDF list which is basically a LISP list, and then RDF collections where the order is encoded in predicates. Neither is well-supported in most tools, for instance SPARQL lacks the list handling capabilities that you'd see in
The XMP spec, for instance, hacks Dublin Core by adding ordering information because... It matters what order the authors are in. Dublin Core on the other hand seems to be developed for cataloging elementary school libraries and they were huge fans of doing the easy stuff and leaving out anything moderately hard, so Dublin Core looks like it has a 1968 level of sophistication and MARC is so much more 2000s. People come to RDF, see all these problems that are ignored, and come to the conclusion RDF is not for them.
While there are (commonly used) alternative serialization formats, RDF is married to XML data model, as all core datatypes of RDF are defined in terms of XSD (XML Schema Definition) datatypes and very closely mimics it's data model.
If you want to have an RDF that is independent of that (e.g. based on Apache Arrow so that it's compatible with modern big data tooling) you might as well start from scratch.
And some languages (e.g. java, scala) have standard libraries with interfaces that describe ordered collections, dictionaries, etc, but offer multiple implementations so the programmer can pick based on their specific considerations, but still benefit from library code written around the interface.
It also uses hint/marker interfaces (eg. `RandomAccess`) so the library code can choose among algorithms internally while still presenting an uniform API to the client.
As somebody who's actually done the work, I don't think you've really spent time with the question. The problem is that graphs, in general, aren't special. They're a structural dumping ground.
Data structures are specialized graph representations. The source code you write is in a specialized graph representation. For the overwhelming majority of programmers, the fact that something can be viewed as a graph is much like saying "oh well that file is just a really long sequence of bits, so it's effectively an integer." It's not wrong, but is it useful?
We've got a couple common graph literals today: Graphviz's "dot" language [0] and the Mermaid language (in part inspired from "dot") [1]. Mermaid is increasingly common in documentation today, given Github supports it almost everywhere Markdown is supported. Parsing dot or Mermaid as embedded literals in another language doesn't seem that complicated (Mermaid's parser is written in JS and already runs anywhere you can run a JS parser).
Though one interesting thing to note here is that both languages are edge list languages and are optimized for algorithms that are useful for edge lists, particularly display/diagramming. That gets back to the article's point that there are other useful representations and those can matter for efficiency of algorithms.
They also can matter for efficiency of a graph literal in a source document. Edge lists are great for sparse graphs, but if you have a dense graph it might be easier to write as a literal with a massive spreadsheet of an adjacency graph. Especially if it can just be a spreadsheet with modern affordances like scrolling and sticky rows/columns and such. There's no perfect answer for "every" graph.
Same - DOT is a great way to go from zero to functional in near minimal time. It's also pretty trivial to generate since it's not order dependent, which is great for lightweight automation.
Fine-tuning layout can be a real hassle though, sadly. I haven't found any quick tools for that yet.
The way I approach fine-tuning DOT layout is to add subgroups where it seems appropriate, add a 1px border for the subgroup and see where the layout engine is situating it next to other nearby vertices/edges. Sometimes I may have to put a border around a few subgroups, then attempt to adjust size of vertices and entire area to nudge it to a local minima. Note: I don't attempt to adjust the size of the subgroups, I'm not sure that even works anyway, but maybe it depends on the choice of layout algorithm, too. Padding and other style on the vertex-box may help, too. It's been a few years for me, tbh.
Deciding where the appropriate subgroups are is a bit of an art. Sometimes it's obvious, as in bipartite graphs that are intentionally bipartite. Or, if there is a staged layout like for pipeline architectures. Sometimes it's not obvious even when it seems it should be, like when graphviz really wants to make a certain edge really short. Be ready to backtrack sometimes. Then I usually remove the subgroup border after I'm done, but a few times they have been useful to leave there.
One thing I really like about DOT is that adding hyperlinks to the vertices and edges that translate decently into the compiled output is really nice. I had an oncall dashboard that made liberal use of this feature that I still think back on fondly sometimes.
Can you point to where I can learn more about this ?
E.G. an example and explanation exists for hyper link embedding ?
I am especially interested in syntax suitable to be used in creating something to input into https://www.viz-js.com and creation of SVGs with embedded hyperlinks.
If you're using subgraphs, with an edge across subgraph boundaries, it is slightly order dependent - a node will appear in the area it was first mentioned in. If you define all the nodes/subgraphs first and all the edges at the bottom you'd never notice this though.
For a three node graph with edges between vertex 0 and 1, and vertex 0 and 2, vertex labels 'A' and 'B' and edge labels 'C', and 'D'. Not great to parse (as I sadly found), but possible to read.
> matrix multiplication and permanents are known to be non-cheap to compute, requiring worse-than-quadratic time! So any sort of good graph algorithm must dig deeper and be more specialized for the task at hand
Based on that experience, we had our very own second-system-syndrome experience.
We decided our graph library should be modular, type safe, and efficient. (These properties came up in the comments here, too.) This is probably just a variation of "good, fast, cheap - pick any two."
By modular, want to write collections of graph algorithm libraries that are developed and even compiled independently.
By type safe, we mean we want to detect programming errors during compilation, or link time at the latest. We don't want programs to throw runtime errors like "your node does not have a color attribute".
By efficient, we mean that accessing an attribute of a graph is as cheap as a field in a C struct. (So, we don't want to carry around external hash table, or do a lot of string conversions, for instance.)
You can argue about whether these things are worth the price or even make sense, but that's what we wanted. We had some famous C++ creators in our lab, so we figured we could get help, and we were willing to give C++ another chance.
Gordon Woodhull, who had been an intern and kept working for us, is a brilliant programmer, and wrote an implementation of this kind of graph library working in templated C++. It's even published at https://www.dynagraph.org/ via sourceforge. The rest of us were not really sure we could ever understand how the code worked, so we had a code review with said famous C++ inventors. There were a lot of screens of code, and chinstroking, and greybeards pronounced "That would probably work." We knew we might have gone over the complexity cliff. (Let's not even talk about compile-time template errors, where one error fills an entire screen with details that only a... C++ inventor could love.) It was our fault, not anyone else's, and Gordon kept plugging away and even made all the dynamic graph layout stuff work, in Microsoft OLE, too. In hindsight it was probably our own Project Xanadu. While we got lost in this, a lot of things like Gephi (Java) and NetworkX and NetworKit (python) happened.
Also, John Ellson, a very talented software engineer who had written parts of Graphviz, revitalized the main effort.