Hacker News new | past | comments | ask | show | jobs | submit login
What Is a Knowledge Graph? (neo4j.com)
200 points by Anon84 5 months ago | hide | past | favorite | 59 comments



Good article on the high level concepts of a knowledge graph, but some concerning mischaracterizations of core functions of ontologies supporting the class schema and continued disparaging of competing standards-based (RDF triple-store) solutions. That the author omits the updates for property annotations using RDF* is probably not an accident and glosses over the issues with their proprietary clunky query language.

While knowledge graphs are useful in many ways, personally I wouldn't use Neo4J to build a knowledge graph as it doesn't really play to any of their strengths.

Also, I would rather stab myself with a fork than try to use Cypher to query a concept graph when better standards-based options are available.


    > While knowledge graphs are useful in many ways, personally I wouldn't use Neo4J to build a knowledge graph as it doesn't really play to any of their strengths.
I'd strongly disagree. The built-in Graph Data Science package has a lot of nice graph algos that are easy to reach for when you need things like community detection.

The ability to "land and expand" efficiently (my term for how I think about KG's in Neo4j) is quite nice with Cypher. Retrieval performance with "land and expand" is, however, highly dependent on your initial processing to build the graph and how well you've teased out the relationships in the dataset.

    > I would rather stab myself with a fork than try to use Cypher to query a concept graph when better standards-based options are available.
Cypher is a variant of the GQL standard that was born from Cypher itself and subsequently the working group of openCypher: https://opencypher.org/

More info:

https://neo4j.com/blog/gql-international-standard/

https://neo4j.com/blog/cypher-gql-world/


> That the author omits the updates for property annotations using RDF* is probably not an accident and glosses over the issues with their proprietary clunky query language.

Not just that, w.r.t. reification they gloss over the fact that neo4j has the opposite problem. Unlike RDF it is unable to cleanly represent multiple values for the same property and requires reification or clunky lists to fix it.


    > clunky lists
Not sure what the problem is here. The nodes and relationships are represented as JSON so it's fairly easy to work with them. They also come with a pretty extensive set of list functions[0] and operators[1].

Neo4j's UNWIND makes it relatively straightforward to manipulate the lists as well[2].

I'm not super familiar with RDF triplestores, but what's nice about Neo4j is that it's easy enough to use as a generalized database so you can store your knowledge graph right alongside of your entities and use it as the primary/only database.

[0] https://neo4j.com/docs/cypher-manual/current/functions/list/

[1] https://neo4j.com/docs/cypher-manual/current/syntax/operator...

[2] https://neo4j.com/docs/cypher-manual/current/clauses/unwind/...


It has been a while so maybe things have changed, but the main reasons I remember are 1) lists stored as a property must be a homogeneous list of simple builtin datatypes so no mixing of types, custom types, or language tagging like RDF has as first class concepts. 2) indexes on lists are much more limited ( exact match only iirc) so depending on the size of the data and the search parameters it could be a big performance issue. 3) cypher gets cumbersome if you have many multi-valued properties because every clause becomes any(elem in node.foo where <clause>). In Sparql it's just ?node schema:foo <clause>.

I don't think everybody should run away from property graphs for RDF or anything, in terms of the whole package they are probably the right technical call ninety-something percent of the time. I just find Neo4J's fairly consistent mischaracterization annoying and I have a soft spot for how amazingly flexible RDF is, especially with RDF-star.


What would you recommend as an RDF database to explore?


GraphDB is the one I usually use. It has a web interface that eases the first steps. Virtuoso (especially Virtuoso 7, which is open source) is also an option. [a bit more command line based].

In case you want to have a look a the SPARQL client I maintain, Datao.net, you can go to the website and drop me a mail. [i really need to update the video there as the tool has evolved a lot since that time]


The new kid on the block is very much QLever. Still lacking some features, especially wrt. real time update that make it unsuitable for replacing the Wikidata SPARQL endpoint altogether just yet, but it's clearly getting there.


> The new kid

that kid is 7 years old already, and in my understanding currently has only one active contributor. But idea of the project is very strong.


If you just want to try some queries, there is a public sparql wikidata endpoint at https://query.wikidata.org . If you press on the file folder icon there are example queries, which let you get a feel for the query language.


Marklogic is the best triple store


While I'm all for standards-based options, I think the fetishization does a disservice to anyone dipping their toes into graph databases for the first time. For someone with no prior experience, Cypher is everywhere and implements a ton of common graph algorithms which are huge pain points. AuraDB provides an enterprise-level fully-managed offering which is table stakes for, say, relational databases. Obviously the author has a bias, but one of the overarching philosophical differences between Neo4J and a Triple Store solution is that the former is more flexible; that plays out in their downplaying of ontologies (which are important for keeping data manageable but are also hard to decide and iterate on).


I can attest to that, or at least to the inverse situation. We have a giant data pile that would fit well onto a knowledge graph, and we have a lot of potential use cases for graph queries. But whenever I try to get started, I end up with a bunch of different technologies that seem so foreign to everything else we’re using, it’s really tough to get into. I can’t seem to wrap my head around SPARQL, Gremlin/TinkerPop has lots of documentation that never quite answers my questions, and the whole Neo4J ecosystem seems mostly a sales funnel for their paid offerings.

Do you by chance have any recommendations?


I think neo4j is a perfectly good starting point. Yeah, I feel like they definitely push their enterprise offering pretty hard, but having a fully managed offering is totally worth it IMO.


I enjoy cypher, it's like you draw ASCII art to describe the path you want to match on and it gives you what you want. I was under the impression that with things like openCypher that cypher was becoming (if not was already) the main standard for interacting with a graph database (but I could be out of date). What are the better standards-based options you're referring to?


W3C SPARQL, SPARUL is now SPARQL Update 1.1, SPARQL-star, GQL

GraphQL is a JSON HTTP API schema (2015): https://en.wikipedia.org/wiki/GraphQL

GQL (2024): https://en.wikipedia.org/wiki/Graph_Query_Language

W3C RDF-star and SPARQL-star (2023 editors' draft): https://w3c.github.io/rdf-star/cg-spec/editors_draft.html

SPARQL/Update implementations: https://en.wikipedia.org/wiki/SPARUL#SPARQL/Update_implement...

/? graphql sparql [ cypher gremlin ] site:github.com inurl:awesome https://www.google.com/search?q=graphql+sparql++site%253Agit...

But then data validation everywhere; so for language-portable JSON-LD RDF validation there are many implementations of JSON Schema for fixed-shape JSON-LD messages, there's W3C SHACL Shapes and Constraints Language, and json-ld-schema is (JSON Schema + SHACL)

/? hnlog SHACL, inference, reasoning; https://news.ycombinator.com/item?id=38526588 https://westurner.github.io/hnlog/#comment-38526588


SparQL, rdf triples.


ISO-GQL


Do you mind in mentioning some of the options available that you consider better than Cypher?


>better standards-based options are available.

Which ones would you recommend?


I've been working on an implementation of graph RAG (GRAG) using Neo4j as the underlying store.

The overall DX is quite nice. The apoc-extended set of plugins[0] make it very seamless to work with embeddings and and LLMs during local dev/testing. The Graph Data Science package comes preloaded with a series of community detection algorithms[1] like Louvain and Leiden.

Performance has been very, very good as long as your strategy to enter the graph is sound and you've structured your graph in such a way that you can meaningfully traverse the adjacent properties/nodes.

We've currently deployed the Community edition to AWS ECS Fargate using AWS Copilot + EFS as a persistent volume. There were some kinks with respect to the docs, but it works great otherwise.

It's worth a look for any teams that are trying to improve their RAG or are exploring GRAG in general. It's not a silver bullet; you still need to have some "insight" into how to process your input data source for the graph to do its magic. But the combination of the built-in graph algorithms and the ergonomics of Cypher make it possible to perform certain types of queries and "explorations" that would otherwise be either harder to optimize or more expensive in a relational store.

[0] https://neo4j.com/labs/apoc/5/ml/openai/

[1] https://neo4j.com/docs/graph-data-science/current/algorithms...


Thanks for the praise for APOC-ML, happy that it's useful.

Did you see the two blog posts that Tomaz Bratanic did on the topic?

For the ingestion: https://neo4j.com/developer-blog/global-graphrag-neo4j-langc... For the retrievers: https://neo4j.com/developer-blog/microsoft-graphrag-neo4j/

My general point on GraphRAG is that it extracts and compresses the horizontal topic-clustering across many documents and makes that available for retrieval.

And that by creating the semantic network of entities, you can use patterns in the graph structure to answer questions that rely on information coming together from different documents. Think the detectives board connecting facts with strings from many different sources.

Feel free to ping me for a deeper discussion: michael at neo4j


> Performance has been very

for how many records?


During our initial testing, ~1m nodes on a local Docker container with 1G RAM and 1vCPU.

But here I mean "performance" in both retrieval time and the overall quality of the fragments retrieved for RAG compared to a `pgvector` only implementation. It is possible to "simulate" these types of graph traversals in pg as well, you'll have to work much harder to get the performance (we tried it first).


Huh. I've had the opposite experience. Neo4j has a pretty nice interface and package overall, but I was not impressed with the performance, and the developer experience was about on-par with Elasticsearch (not comparing the two databases, just the developer resources and communities). For general purpose use I've still not found anything better than Postgres (and yes, knowledge graphs I would consider general purpose). For my day-to-day work I'm constantly querying a regularly-updated knowledge graph consisting of >10M active, highly-connected nodes - I keep previous versions in the same database so I can traverse backwards through time. This is all on my laptop. No problems with latency or performance.

I'm always curious what people's use cases are with graph databases; do people find Cypher and SPARQL helpful? I've tried several times, but SQL is just so expressive. Postgres is still my favorite graph database (and CRUD RDBMS, and filesystem, and "data conversion tool").


If your performance is poor, try running your query with `PROFILE {your_query}`. It's very easy to write a query that ends up loading way more nodes than expected. Years ago we had one query that progressively performed worse -- turned out one leg was loading the full node space!

What I have found is that "land and expand" using an index to find the landing spots is key for performance. Reason being once you "land" effectively, "expand" is cheap and fast.

Some of it will also come down to your graph design. If you have a lot of super dense nodes (analogous to a large JOIN), it will create a lot of memory pressure which it does not handle well.

But in a RAG use case, I don't see these as being issues.


number of nodes mean nothing. what matters to measure performance is how much interconnected your network is and how complex are the relationships you want to extract.


Right. I created a Neo4j db once with millions of nodes and relationships. Individual queries were very performant for all of my access patterns. Where it failed was with queries/sec. Throw more users at it, and it slowed to a crawl. Yes, read replicas are an option, but I was really discouraged with Neo4j performance with more than a few users.


If you are using community edition - check out the DozerDB plugin which adds enterprise features to Neo4j community such as multi database. Its still in its infancy but has already implemented multi db and enterprise constraints. https://dozerdb.org


In addition to labelled property graphs and triples, a list of approaches to knowledge graph should consider facts(tuples) that are connected via common values as a form of graph, with datalog queries to query them. This is a lot more flexible than either approach IMHO and also more easily connected to existing relational data.

RDFox is a tool that uses Datalog internally. RelationalAI uses a datalog based approach. Another example is Mangle Datalog, my own humble open source project that can be found on GitHub.

The language in the article about relational being "non native graph" is a bit biased. With some developer attention, there are massive opportunities to store data in a distributed manner and with te right indices querying can be fast. Though to be fair, good performance will always need developer attention.


A knowledge graph is really just a projection of structured data from disparate sources into a common schema.

Take a bunch of tables and covert each row into a tuple (rowkey, columnName, value). Now take the union of all the tables.

^ knowledge graph

That’s it…but it’s not very useful yet. It becomes more useful if you apply a shared ontology during the import—ie translate all the columns into the same namespace. Suppose we had a “contacts” table with columns {“first name”, “last name”, …} and a “events” table with columns {“participant given name”, “participant family name”, …} — basically you need to unify the word you use to describe the concept “first name”/“given name”/whatever across all sources.

This can be cool/useful because you now only need one table (of triples) to describe all your structured data, but it’s also a pain because you may need to perform lots of self-joins or recursive queries to recover your data in order to do useful things with it. The final table has a very simple “meta” schema, and you erase the schema from each individual source so you can push the schema into the data.


For me a knowledge graph is a complex network.

When you try to grasp any complex topic your brain starts to build and connect a fuzzy network of topics and their respective positive or negative correlations and of course the weights between the connections.

Once you have unfuzzied the picture in your head you realize that the network is active and dynamic and that this network has different "modes" of operation and that some weights and correlations can change over time, while others are always static.

Mastering the dynamics of the knowledge graph is the final step in understanding it.


The world is a complex network. The knowledge graph is a reliable path through the network. I.e., "knowing" the way from your home to your office. Or "knowing" where to find your Word doc and how to use it.


What's great about knowledge graphs and property graphs in general is once you really get it (and it's not too difficult, especially if you come from a CS background) you start to see graphs all over the place. It's a really nice way to work with data for certain classes of problems. Once you get "enough" data in and "enough" of a variety of things connected, you start to see remarkable relationships emerge.


What sort of relationships have you seen in the data you've worked with that you'd describe as remarkable?

I've explained a similar thing to friends before, but I was always at a loss for relationships/insights that have led to concrete outcomes


You know active directory? It's basically a graph of objects, accounts and computers all with their own complex permissions and relationships. During pentests it's VERY common to use a tool called bloodhound which imports an active directory graph into neo4j. You then use graph algorithms to find miss configuration and paths to traverse to become domain admin, which pretty much always reveals something "unexpected".

https://github.com/BloodHoundAD/BloodHound


Supply chain and logistics. We’d start to see where the biggest risk points were and use that to diversify risk and also rank failover. We could make predictions about how the supply chain would be disrupted based on individual suppliers/movers/warehouses/etc having events that affected their ability to perform. You start to see how much some suppliers rely on each other, etc. Holy hell did Covid make that crazy!


I think anything that has physical {nodes} and {edges}, where all things in each category are alike but very distinct from the other category, and connectivity is only achievable via a limited set of (not free) paths.

E.g. telecom networks, electrical grids, travel networks (including logistics), etc.

All of these feature a very different node (usually a system or capital asset) vs an edge (usually a wire or constructed path), and insight into the structure is economically valuable.

(That's talking to more traditional uses of graph theory, less-so to modern knowledge graphs)


We employ a knowledge graph at Deft (https://shopdeft.com) to enable searches over ~1M products, amounting to about 1B triples. Because of the complexity of the queries involved, the expressiveness of our data model — supporting n-ary/reified relations, negation, disjunction, linguistic vagueness, etc. — and our real-time latency targets, we built a graph DB engine "from scratch" (certain components are of course from open-source projects). Even RedisGraph wasn't fast enough for the purpose; ours (Deftgraph) is 700x faster on our queries thanks to some SOTA optimizations from various recent papers. You'll notice on our site that the overall search latency is generally acceptable but not great; the vast proportion of that latency comes from 1) LLMs and 2) a less-optimized other graph DB, Datomic, that we still store some of our data in for legacy reasons.

LLMs are great, but knowledge graphs are IMO indispensable to tame their shortcomings.


If you have a graph database that is 700x faster on real world use cases than the next nearest competitor, why aren't you selling it? Given the current AI gold rush, it seems like a no brainer to get some VC cash, hire some sales people, and start selling shovels.


I'd love to hear about your absolute numbers.

We had a similar problem, Datomic/Datascript not having an open format like RDF, but RDF being clunky and slow, so we build our own open-source solution in Rust (https://github.com/triblespace).

On an M1max we're currently at ~3us per query for a single result (so essentially per query overhead), and have something like 1m QRPS for queries with 3-4 joins.

I'm curious if you've somehow managed to shave off another order of magnitude, as I suspect that most WCO joins will be similarly limited by memory bandwidth. We for example worked out a novel join algorithm family (Atreides Join) and supporting trie based in-memory and succinct zero-copy on-disk data-structures, just to get rid of the query optimiser and its massive constant factor.


I would say that a better alternative to graph dbs would be prolog or datalog, because they're expressive enough to describe hypergraphs.

Prolog is better than datalog in a lot of ways: CLPZ, abduction, homoiconicity, being able to choosing search strategies for different problems, tabling, etc.

There's been some work to integrate prolog with LLMs:

https://swi-prolog.discourse.group/t/llm-swi-prolog-and-larg...


As someone running neo4j in production I can just warn that the DBs are a pain and need a lot more care and love than Postgres or Oracle DBs. Even much larger instances. Maybe their cloud offerings are better, but they are quite expensive.


Yes, Postgres is actually a great general-purpose graph database (excluding specialized network analysis that's actually pretty niche) if you can deal with the clunky recursive-CTE syntax for graph queries. (The new SQL standard actually comes with an added Property Graph Query/PGQ syntax specifically to make these queries easier to express.)


Have you tried AGE https://age.apache.org/ -- "Graph Database for PostgreSQL" ?


It will be quite a plot twist if Graph RAG paves the way for making knowledge graphs / semantic networks and the like cutting edge again... New "AI" meets old "AI" etc.


There is some research done recently from Google on Graph Reasoning that includes Graph encodings piped through a vanilla LLM.


Wrote this a while ago. KGs are definitely the next new old thing.

https://aneeshsathe.com/2024/05/10/dancing-on-the-shoulders-...

With LLMs enabling easy, if noisy, KG creation extracting knowledge into a computable form will lead to advances.

Drug discovery already uses the tech heavily, wouldn’t be surprised if it expands to more domains quickly now.


An interesting use of Knowledge Graphs is doing research into historic document, such when doing genealogical research or researching into some historic event, person or location. In those applications, you often have that sources do not have direct references (a person name in one document cannot always be identified with with 100% certainty) or are contradicting each other (one source gives a different date than another). In this case another layer is needed. There is some need for attaching a source identification, the actual document (scans), an author and/or an authority to a source. In case you are extracting information from historical documents, it might be needed to transcribe the contents and in that case it would be nice to be able to mark parts of the text, to quickly verify the source of a fact.

I have not yet found an application that combines all those functions and I have been considering to build one myself.


I've built something along these lines. It utilises OCR to extract text content, indexes it for RAG, uses a separate service to identify/match concepts to reference data in an RDF knowledge graph, and displays the original source documents with the references to KG concepts overlayed.


> free copy of the O’Reilly book "Building Knowledge Graphs: A Practitioner’s Guide"

Knowledge Graph (disambiguation) https://en.wikipedia.org/wiki/Knowledge_Graph_(disambiguatio...

Knowledge graph: https://en.wikipedia.org/wiki/Knowledge_graph :

> In knowledge representation and reasoning, a knowledge graph is a knowledge base that uses a graph-structured data model or topology to represent and operate on data. Knowledge graphs are often used to store interlinked descriptions of entities – objects, events, situations or abstract concepts – while also encoding the free-form semantics or relationships underlying these entities. [1][2]

> Since the development of the Semantic Web, knowledge graphs have often been associated with linked open data projects, focusing on the connections between concepts and entities. [3][4] They are also historically associated with and used by search engines such as Google, Bing, Yext and Yahoo; knowledge-engines and question-answering services such as WolframAlpha, Apple's Siri, and Amazon Alexa; and social networks

Ideally, a Knowledge Graph - starting with maybe a "personal knowledge base" in a text document format that can be rendered to HTML with templates - can be linked with other data about things with correlate-able names; ideally you can JOIN a knowledge graph with other graphs as you can if the Node and Edge Relations with Schema and URIs make it possible to JOIN.

A knowledge graph is a collection of nodes and edges (or nodes and edge nodes) with schema so that it is query-able and JOIN-able with.

A Named Graph URI may be the graphid ?g of an RDF statement in a quadstore:

  ?g ?s ?p ?o   // ?o_datatype ?o_lang


I’ve got a django side project that uses neo4j. I use it to map out the static content in the domain space and a postgres database that handles more transactional stuff.

It works great. I’m not a db expert but the flexibility and explicitness of the graph scheme clicks for me. It took me a while to come around on cypher but now that I’m there it makes sense.


A rant about Chrome Bookmarks Manager (it's on-topic, I promise).

A few years ago, a Good Samaritan on HN told me my bookmarks (I had about 10,000 at the time) were my "knowledge graph".

I had no idea what that was, but upon researching the concept, I was mind-blown by the simple truth of what I had been told.

Since then I became even more rapacious with my bookmarking (and especially editing their "Name" field to add tags and keywords), and I have about 30,000 bookmarks now.

And they truly are my knowledge graph. More so than the 1,000 or so text files where I store my notes on various topics. Mainly because of the Bookmarks Search feature. Let's say I want to refresh my understanding of Permutations, Combinations, Factorials (like I wanted to do, and did, yesterday). All I have to do is enter those keywords into Bookmarks Search and I instantly get the best (i.e. most relevant and intuitive for me) articles I've ever found on these topics (because I've bookmarked and tagged/keyworded them in the past). I find that more and more bookmarks end up with 404's these days, but then there's Internet Archive. Invaluable.

Which brings me to Chrome Bookmarks. There seems to have been no innovation in the last 15 years (other than the time they replaced the prior [and post/current] system with the horrible "Cards", and thankfully reversed course due to the howling protests and decided to instead offer "Cards" as a Chrome Extension [which is apparently not popular]). For example:

- One still cannot exclusively search folder names.

- One cannot search within a single folder only (by extension, one cannot search within selected multiple folders only).

- One cannot use Regex to search, or do any kind of Fuzzy Search.

- One cannot download (cache) a copy of a bookmarked page (in case the page 404s in the future) to store permanently in the some Zotero type storage system.

- When adding a new bookmark, one cannot conduct a search (using text keywords) for the folder one wants to store it in.

I realize that some of these features are available via third-party Chrome Extensions and the like. However, I stopped relying on third party anything after I relied on one for tagging bookmarks and this third party Extension decided one day to 404 on me (it shut down).

Please Google do more with Chrome Bookmarks.

Rant over.

(Addendum 1: Regarding the text files: yeah, I've tried Zettelkasten software. The one that came closest to my liking was FeatherWiki. However, in the end I continued with plain ol' text files because they're the most Lindy [plain text files are likely going to be the last format rendered unreadable by whatever destroys civilization as we know it].)

(Addendum 2: there seem to be some people out there who have uncanny storage and retrieval systems. E.g. @Balajis on X comes to mind. When he's in a debate/argument on X, he has an uncanny ability to pull out relevant contextual material, instantly, from his back pocket when the situation demands it. Let's hope they share their systems.)


I know its a third party extension, but I created an open-source browser extension that helps me with a lot of that (I also use my bookmarks extensively for this):

https://github.com/Fannon/search-bookmarks-history-and-tabs#...

If you're afraid that it goes 404: This extension is open-source, very easy to build and use locally and it does not make any external request or relies on external dependencies.


Thanks!


This doesn’t seem on topic, how is it a knowledge graph? Its a (nested) list of bookmarks…


E.g. I add the keywords "Permutations, Combinations" to the "Name" field of a bunch of Bookmarks.

I have now crosslinked the Bookmarks.

So while it may not be exactly what the linked OP article is talking about, it's a way of storing knowledge, linked via horizontal relationships.

But whatever floats your boat.


There is a difference between "tagged data" and "linked data".


Whatever floats your boat.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: