RAG Using Structured Data: Overview and Important Questions

semihsalihoglu · on Jan 4, 2024

This is the first blog post on a series of posts I plan to write on the role graph DBMSs and knowledge graphs play on LLM applications and recent text-to-high-level-query-language work I read up on over the holiday season.

These blogs have two goals:

(i) give an overview of what I learned as an outsider looking for technical depth; (ii) discuss some venues of work that I ran into that looked important.

This first post is on "Retrieval Augmented Generation using structured data", so private records stored in relational or graph DBMSs. The post is long and full of links to some of the important material I read (given my academic background, many of these are papers) but it should be an easy read especially if you were an outsider intimidated by this fast moving space.

tl;dr for this post: - I provide an overview of RAG. - Compared to pre-LLM work, the simplicity and effectiveness of developing a natural language interface over your database using LLMs is impressive. - There is little work that studies LLMs' ability to generate Cypher or SPARQL. I also hope to see more work on nested, recursive and union-of-join queries. - Everyone is studying how to prompt LLMs so they generate correct DBMS queries. Here, I hope to see work studying the effects of data modeling (normalization, views, graph modeling) on the accuracy of LLM-generated queries.

Hope some find this interesting.

emmanueloga_ · on Jan 11, 2024

So this post is about using RAG/LLM to generate queries (Cypher in this case, to be consumed by Kuzu). That way you could ask natural-language questions to be answered by the result of the query.

I wonder if you could comment about other areas of AI+Graphs (I think this is mostly Graph Neural Networks, not sure if anything else?).

For instance, I found PyG and Deep Graph Library but the use cases are so jargon-heavy [1], [2], I'm not sure about the real world applications, in layman terms.

--

1: https://pytorch-geometric.readthedocs.io/en/latest/tutorial/...

2: https://docs.dgl.ai/tutorials/blitz/index.html

emmanueloga_ · on Jan 11, 2024

Ok, using ChatGPT and Bard (the irony lol) I learned a bit more about GNNs:

GNNs are probabilistic and can be trained to learn representations in graph-structured data and handling complex relationships, while classical graph algorithms are specialized for specific graph analysis tasks and operate based on predefined rules/steps.

* Why is PyG it called "Geometric" and not "Topologic" ?

Properties like connectivity, neighborhoods, and even geodesic distances can all be considered topological features of a graph. These features remain unchanged under continuous deformations like stretching or bending, which is the defining characteristic of topological equivalence. In this sense, "PyTorch Topologic" might be a more accurate reflection of the library's focus on analyzing the intrinsic structure and connections within graphs.

However, the term "geometric" still has some merit in the context of PyG. While most GNN operations rely on topological principles, some do incorporate notions of Euclidean geometry, such as:

- Node embeddings: Many GNNs learn low-dimensional vectors for each node, which can be interpreted as points in a vector space, allowing geometric operations like distances and angles to be applied.

- Spectral GNNs: These models leverage the eigenvalues and eigenvectors of the graph Laplacian, which encodes information about the geometric structure and distances between nodes.

- Manifold learning: Certain types of graphs can be seen as low-dimensional representations of high-dimensional manifolds. Applying GNNs in this context involves learning geometric properties on the manifold itself.

Therefore, although topology plays a primary role in understanding and analyzing graphs, geometry can still be relevant in certain contexts and GNN operations.

* Real world applications:

- HuggingFace has a few models [0] around things like computational chemistry [1] or weather forecasting.

- PyGod [2] can be used for Outlier Detection (Anomaly Detection).

- Apparently ULTRA [3] can "infer" (in the knowledge graph sense), that Michael Jackson released some disco music :-p (see the paper).

- RGCN [4] can be used for knowledge graph link prediction (recovery of missing facts, i.e. subject-predicate-object triples) and entity classification (recovery of missing entity attributes).

- GreatX [5] tackles removing inherent noise, "Distribution Shift" and "Adversarial Attacks" (ex: noise purposely introduced to hide a node presence) from networks. Apparently this is a thing and the field is called "Graph Reliability" or "Reliable Deep Graph Learning". The author even has a bunch of "awesome" style lists of links! [6]

- Finally this repo has a nice explanation of how/why to run machine learning algorithms "outside of the DB":

"Pytorch Geometric (PyG) has a whole arsenal of neural network layers and techniques to approach machine learning on graphs (aka graph representation learning, graph machine learning, deep graph learning) and has been used in this repo [7] to learn link patterns, also known as link or edge predictions."

--

0: https://huggingface.co/models?pipeline_tag=graph-ml&sort=tre...

1: https://github.com/Microsoft/Graphormer

2: https://github.com/pygod-team/pygod

3: https://github.com/DeepGraphLearning/ULTRA

4: https://huggingface.co/riship-nv/RGCN

5: https://github.com/EdisonLeeeee/GreatX

6: https://edisonleeeee.github.io/projects.html

7: https://github.com/Orbifold/pyg-link-prediction

semihsalihoglu · on Jan 11, 2024

You seem to have done some research already but let me answer briefly: GNNs and what I covered in the blog post, "RAG over structured data", are not connected. They are approaches to solve 2 different problems. GNNs: Let's forget about LLMs completely. GNN is a term given to a specific ML models where the model layers follow a graph structure. Suppose you have some data that represents real-world entities and you have features, i.e., some vector of floating numbers representing properties of these entities. Then if you want to run some predictive task on these entities, e.g., your entities are customers and products, and you want to predict who could buy a product so you can recommend these products to customers. Then there are a suite of ML tools you can use if you could represent your entities as vectors themselves, e.g., then you can use distances between these vector-representations as indication of closeness/similarity and you could recommend products to a customer A that were bought by customers that are close to A's vector representation. This is what's embedding of these entities in a vector space. One way to embed your entities is to run the nodes through an ML model that takes their features as input and produces another set of vectors (you could use the features alone as embeddings but they are not really trained and often have higher dimensions compared to the embedded vectors' dimensions). GNNs are a specific versions of such ML models where the entities and relationships between entities are modeled as a graph. And the model's architecture, i.e., the operations that it does on the feature vectors, depends on the structure of this graph. (edited)

In short, GNNs are not deeply connected to LLMs.

GNNs became very popular several years ago because they were the only ML architectures where you could incorporate into the model and training objective not just the features but also connections between features. And they dominated academia until LLMs. In practice, I don't think they're as popular as they are in academia but afaik several major companies, such as Pinterest based their recommendation engines on models that had GNN architecture.

But one can imagine building applications that use a mix of these technologies. You can use GNNs to create embeddings of KGs and the use these embeddings to extract information during retrieval in a RAG system. All these combinations are possible.