Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Cozo – new Graph DB with Datalog, embedded like SQLite (github.com/cozodb)
425 points by zh217 on Nov 8, 2022 | hide | past | favorite | 67 comments
Hi HN, I have been making this Cozo database since half a year ago, and now it is ready for public release.

My initial motivation is that I want a graph database. Lightweight and easy to use, like SQLite. Powerful and performant, like Postgres. I found none of the existing solutions good enough.

Deciding to roll my own, I need to choose a query language. I am familiar with Cypher but consider it not much of an improvement over CTE in SQL (Cypher is sometimes notationally more convenient, but not more expressive). I like Gremlin but would prefer something more declarative. Experimentations with Datomic and its clones convinced me that Datalog is the way to go.

Then I need a data model. I find the property graph model (Neo4j, etc.) over-constraining, and the triple store model (Datomic, etc.) suffering from inherent performance problems. They also lack the most important property of the relational model: being an algebra. Non-algebraic models are not very composable: you may store data as property graphs or triples, but when you do a query, you always get back relations. So I decided to have relational algebra as the data model.

The end result, I now present to you. Let me know what you think, good or bad, and I'll do my best to address them. This is the first time that I use Rust in a significant project, and I love the experience!




How I have waited for this: A simple, accessible library for graph-like data with datalog (also in a statically compiled language, yay). Have even pondered using SWI-prolog for this kind of stuff, but it seems so much nicer to be able to use it embedded in more "normal" types of languages.

Looking forward to play with this!

The main thing I will be wondering now is how it will scale to really large datasets. Any input on that?


Thanks for your interest in this!

It currently uses RocksDB as the storage engine. If your server has enough resources, I believe it can store TBs of data with no problem.

Running queries on datasets this big is a complicated story. Point lookups should be nearly instant, whereas running complicated graph algorithms on the whole dataset is (currently) out of the question, since all the rows a query touches must reside in memory. Also, the algorithmic complexity of some of the graph algorithms is too high for big data and there's nothing we can do about it. We aim to provide a smooth way for big data to be distilled layer by layer, but we are not there yet.


when you say currently it implies it will change? Does that mean all rows will not be in memory?

what if you had not so many nodes but each node had a lot of data would that improve it? Probably not but just normally I think of the number of nodes in your graph as the problem.


Yes. For example, in Postgres you can sort tables arbitrarily large, not constrained by main memory. Postgres uses external merge sort when the tables are really large. There are other situations where the working data are disk-based when they are too large in Postgres. We will eventually be able to do that in Cozo as well, but no timetable is available yet.

For your second question, say you have a relation with lots of fields, one of them particularly large. As long as you don't use that field in your query, it will not impact memory usage. The query may be slower though since the RocksDB storage engine needs to read more pages from disk, but the fields that are loaded by RocksDB but not needed will be promptly evicted from memory.


Many thanks for the detailed answer!


For folks looking for documentation or getting started-examples, see:

- The tutorial: https://nbviewer.org/github/cozodb/cozo-docs/blob/main/tutor...

- The language documentation: https://cozodb.github.io/current/manual/

- The pycozo library README for some examples on how to run this from inline python: https://github.com/cozodb/pycozo#readme


This is a really impressive piece of work! Congratulations!

I note that it appears to be a library, but it's licensed under the Affero GPL. I believe this means that if I link your library into a program, and if I then allow users to interact with that combined program in any way over a network, then I have to make it possible for users to download the source code to my entire program. Is that your goal here? Were you thinking of some kind of commercial licensing model for people writing server-side apps that use your library?

(I'm curious because I've been deciding whether or not to roll my own toy Datalog for a permissively-licensed open source Rust project.)


No, my understanding is that if you don't make any changes to the Cozo code, you don't need to release anything to the public. If you do, and you cannot release your non-Cozo code, then you must dynamically link to the library (and release your changes to the Cozo code). The Python, NodeJS and Java/Clojure libraries all use dynamic linking.

There is no plan for any commercial license - this is a personal project at the moment. My hope is for this project to grow into a true FOSS database with wide contributions and no company controlling it. If a community forms and after I understand the consequences a little bit more, the license may change if the community decides that it is better for the long-term good of the project. For the moment though, it is staying AGPL.


Let me preface by saying that this seems like a great piece of software and it is absolutely within your right to license it as whatever you would like, no matter what any of the commenters here think.

However, I don't believe your understanding of AGPL is accurate.

> No, my understanding is that if you don't make any changes to the Cozo code, you don't need to release anything to the public. If you do, and you cannot release your non-Cozo code, then you must dynamically link to the library (and release your changes to the Cozo code). The Python, NodeJS and Java/Clojure libraries all use dynamic linking.

This sounds like you're thinking of the LGPL, not AGPL. Whereas LGPL is less strict than GPL because the exception you describe above applies. AGPL on the other hand is more strict. Essentially, if you use any AGPL code to provide a service to users then you must also make the source code available, even if the software itself is never delivered to users.

The intention here is that you can't get around GPL by hiding any use of the GPL code behind a server, so it makes perfect sense to use it for a database. But I don't think it does what you want.

Whichever way you decide to go, be it AGPL, LGPL or something else, I encourage you to make a choice before accepting any outside contributions. As soon as you have code from other authors without a CLA you will need to obtain their permission to change the license (with some exceptions).

(Disclaimer: I'm not a lawyer, just interested in licenses.)


It seems that I really did misunderstand the differences. It is now under LGPL. The repo still requires CLA for contribution for the moment until I am really sure.


> The repo still requires CLA for contribution for the moment until I am really sure.

I just wanted to mention that this sounds like a great idea for a new project. Stay behind a CLA, for a while, just in case the initial license turned out to be an issue.


Thank you for your perspective.

Maybe I was confused about the case of using an executable vs linking against a library. Let me double-check with a few friends who understand copyright laws better than me. If everything checks out, the next release will be under LGPL.

About CLA: at the previous suggestion of a friend, the repo was locked with CLA requirement currently (even though nobody outside contributed yet). This will be lifted once the situation becomes clearer.


> If a community forms and after I understand the consequences a little bit more, the license may change if the community decides that it is better for the long-term good of the project. For the moment though, it is staying AGPL.

Yes, I do want to be clear: I encourage you to use whatever license you like. You wrote the code! I was just curious, because it would also affect the license of any hypothetical software I wrote that used the library.

Here's a super oversimplified version of the main license types (I am not a lawyer):

- Permissive: "Do whatever you want but don't sue me."

- LGPL: "If you give this library to other people, you must 'share and share alike' the source and your changes to this library."

- GPL: "If you use this code in your program, you must 'share and share alike' your entire program, but only if you give people copies of the program."

- AGPL: "If you use this code in your program, you must 'share and share alike' your entire program with anyone who can interact with it over a network."

The AGPL makes a ton of sense for an advanced database server, because otherwise AWS may make their own version and run it on their servers as a paid service, without contributing back.

But like I said, I'm simplifying way too much. Take a look at the FSF's license descriptions and/or talk to a lawyer. This shouldn't be stressful. Figure out what license supports the kind of users and community you want, pick it, and don't look back. :-)

(I may end up writing a super-simple non-persistent Datalog at some point for an open source project. My needs are much simpler than the things you support, anyways—I only ever need to run one particular query.)


I realized my mistake, as I said in the other comments. The main repo is now under LGPL. I'll see what I'll do with the bindings. Writing code is so much better than dealing with licenses!


Oh, cool!

And yeah, licenses can be challenging and frustrating, especially the first time you release a major project.

I am really super excited by the idea of embedded Datalog in Rust. I sometimes run into situations where I need something that fits in that awkward gap between SQL and Prolog. I want more expressiveness, better composability, and better graph support than SQL. But I also want finite-sized results that I can materialize in bounded time.

There has been some very neat work with incrementally-updated Datalog in the Rust community. For example, I think Datafrog is really neat: https://github.com/frankmcsherry/blog/blob/master/posts/2018... But it's great to see more cool projects in this space, so thank you.


I am not a lawyer, but I work in an open source programs office and am currently working specifically on open source license compliance.

Beyond what the sibling comments have said about LGPL sounding more like what you're going for, I'll just note that if you'd like broad adoption of this while still ensuring that changes to your code remain open, you might also want to consider the Mozilla Public License.

From what I understand of MPL and LGPL is that MPL is better for instances where dynamic linking isn't possible. The MPL basically says that any changes _to the files you created_ must be available under the MPL, preserving their public availability.

That said, most organizations are fine with the LGPL, but it just gets gnarly if there are instances where you really want to statically link something but you still fully want to support the original library's openness.


Licensing under AGPL will make it hard for any startup to use Cozo. Lawyers always ask about AGPL in venture financing diligence and it is considered a red flag. You can argue that they are wrong, the linking exception and so on, but you’re basically shouting into the wind.


AGPL is a variant of the GPL, not the LGPL. Meaning that dynamic linking still constitutes (according to them) a derivative work, meaning that even programs that dynamically link against it must themselves be AGPL in their entirety. Dynamic linking is also meaningfully complicated to do in Rust, and this licensure of the crates.io crate will be a footgun for anyone not using cargo-deny.

I think this is a very cool project, but its use of *GPL essentially ensures I'm not going to use it for anything. If you're planning on reducing it to LGPL, I'm not sure what the GPL is getting you over going with the Rust standard license set of MIT + Apache 2.0.


If I'm not mistaken that sounds more like LGPL than the AGPL?


Maybe, and maybe I need to consult a lawyer someday to get the facts straight. To tell you the truth my head hurts when I attempt to understand what these licenses say. Regardless, I intend this project to be true FOSS, the "finer detail" of which FOSS license it uses may change.


My understanding is the same as kylebarron's[0] since you lack linking protections (which you would get under LGPL), so any work that includes cozo would be a "derived work" under the (A)GPL. Interestingly there doesn't seem to be an affero LGPL license[1], which could be what you might want here.

Otherwise, simplest solution provided you want a copyleft license would be to use the LGPL I think.

NOTE: not a lawyer.

[0] https://softwareengineering.stackexchange.com/questions/1078...

[1] https://redmonk.com/dberkholz/2012/09/07/opening-the-infrast... (old link, but I couldn't find anything since then describing this kind of license?)


We kinda do have it; it's just mostly useless, given the linking clause. (Not entirely useless, though, as that article sets out.)

GPL and AGPL have the same layout, so you can just take the LGPL, and replace all references to 'GPL' and 'GNU General Public License' with 'AGPL' and 'GNU Affero General Public License'. Of course, you couldn't call that license 'GNU ALGPL' or 'GNU LAGPL'; you'd have to come up with your own name. (Disclaimer: I'm not a lawyer, and I haven't checked this as thoroughly as I would if I were going to use this for my own software.)

Maybe it's worth bothering Bradley M. Kuhn (http://ebb.org/bkuhn/) again and seeing what the current status of a Lesser AGPL is?


its also odd then re the python bindings being MIT, as the AGPL will convey throughout any aggregation or library usage, as would GPL, the primary delta for GPL vs AGPL is the intent on the later for network offered services, which in the context of an embedded library/db is odd. rightly or wrongly many orgs will refuse to allow usage of gpl/agpl software due to the licensing concerns around the effects of the rest of their ip. duckdb (embedded analytics sql) uses mit, etc. so in terms of creating a "true foss" project ie a community of users and contributors, its definitely worth considering a licensing change imho, but of course dealers choice.


OP here. Nothing about the license is final yet since there are no outside contributors. I just changed the main repo to LGPL, not because what I believed in changed, but because it seems that I really misunderstood the licenses.


That's a fair enough stance. I'd recommend not taking any outside contributions until you are sure about the license, since it'll make it much harder to change the license if you do. Or maybe require all outside contributions to be licensed very permissively, like using the BSD license. Or you could use a CLA, but that's not something I'd recommend. Either way, licensing is hard :(. I can emphasise with the head hurting.... Oh, also, check out https://tldrlegal.com/ .


Very cool! I love the sqlite install everywhere model.

Could you compare use case with Souffle? https://souffle-lang.github.io/

I'd suggest putting the link to the docs more prominently on the github page

Is the "traditional" datalog `path(x,z) :- edge(x,y), path(y,z).` syntax not pleasant to the modern eye? I've grown to rather like it. Or is there something that syntax can't do?

I've been building a Datalog shim layer in python to bridge across a couple different datalog systems https://github.com/philzook58/snakelog (including a datalog built on top of the python sqlite bindings), so I should look into including yours


I find nothing wrong with the classical syntax, but there is a very practical, even stupid reason why the syntax is the way it is now. As you can see from the tutorial (https://nbviewer.org/github/cozodb/cozo-docs/blob/main/tutor...), you can run Cozo in Jupyter notebooks and mix it with Python code. This is the main way that I myself interact with Cozo. Since I don't fancy writing an unmaintainable mess of Jupyter frontend code that may become obsolete in a few years, CozoScript had better look like python enough so as not to completely baffle the Jupyter syntax highlighter. That's why the syntax for comments is `#`, not `//`. That's also why the syntax for stored relation is `*stored`, not `&stored` or `%stored`.

This is a hack from the beginning, but over time I grew to like the syntax quite a bit. And hopefully by being similar to Python or JS superficially, fewer confusion results for new users :)


Interesting! I'm thinking ... perhaps a small syntax comparison for prolog/classical datalog vs cozo, would help people used to the classical syntax quickly get started.


Ah, that's very interesting. Thank you. `s.add(path(x,z) <= edge(x,y) & path(y,z))` is what I chose as python syntax, but it is clunkier.


This is amazing!

Have you looked at differential-datalog? It's rust-based, maintained by VMWare, and has a very rich, well-typed Datalog language. differential-datalog is in-memory only right now, but could be ideal to integrate your graph as a datastore or disk spill cache.

https://github.com/vmware/differential-datalog


Differential-datalog is a cool project. I think the targeted use cases are different as compared to Cozo. The most important difference is that Cozo is focused on graphs, whereas differential-datalog is focused on incremental computation. These two goals are somewhat at odds with each other, as for queries with lots of joins (very common in graph computations), you can't know whether it's better to compute new results incrementally or to recompute everything until you actually run the query. Also, Cozo caters for the exploratory phase of data analysis (no need to define types/tables beforehand), whereas in differential-datalog everything must be explicit upfront.


For everyone else: it looks like parent submitted it and it's now on the frontpage: https://news.ycombinator.com/item?id=33521561


Thank you, this looks very useful. I will try the Python embedded mode when I have time.

I especially like the Datalog query examples in your READ project file. I usually use RDF/RDFS and the SPARQL query language, with must less use of property graphs using Neo4J. I expect an easy ramp up learning your library.

BTW, I read the discussion of your use of the AGPL license. For what it is worth, that license is fine with me. I usually release my open source projects using Apache 2, but when required libraries use GPL or AGPL, I simply use those licenses.


You mention how Cypher is not much of an improvement over CTE in SQL, I was wondering if you could expand on this point a bit if possible?

Some part of me is considering using Apache AGE graph extension for postgres, but another part wonders whether it's worth it considering CTE's can do a lot very similarly.

I'll definitely be following the progress for Cozo though, sounds great on the face of it. Definitely will have to consider potentially using Cozo as well. I wonder if it could make sense to use Postgres and Cozo together?


Yes of course.

Perhaps I should start by clarifying that I am talking about the number of queries the Cypher language can express, without any vendor-specific extensions, since my consideration was whether to use it as the query language for my own database. And Cypher is of course much more convenient to _type_ than SQL for expressing graph traversals - it was built for that.

With that understanding, any cypher pattern can be translated into a series of joins and projections in SQL, and any recursive query in cypher can be translated into a recursive CTE. Theoretically, SQL with recursive CTE is not Turing complete (unless you also add in window functions in recursive CTE, which I don't think any of the Cypher databases currently provide), whereas Datalog with function symbol is. Practically, you can easily write a shortest path query in pure Datalog without recourse to built-in algorithms (an example is shown in README), and at least in Cozo it executes essentially as a variant of Dijkstra's algorithm. I'm not sure I can do that in Cypher. I don't think it is doable.


Does Cypher even support nested and/or recursive queries? I remember asking the Neo4j guys at a meetup about that many years ago, and they didn't even seem to understand the question. Might have changed since then of course.

Otherwise the thing I have noticed with the datalog (as well as prolog) syntax, is you are able to build a vocabulary of re-usable queries, in a much more usable was than any of the solutions I've seen in SQL, or other similar languages.

It thus allows you to raise your level of abstraction, by layer by layer define your definitions (or "classes" if you will) with well crafted queries, that can be used for further refined classifying queries.


Re Datalog syntax: yes, the "composability" is the main reason that I decided to adopt it as the query language. This is also the reason why we made storing query results back into the database very easy (no pre-declaration of "tables" necessary) so that intermediate results can be materialized in the database at will and be used by multiple subsequent queries.


Indeed, composability is the spot-on keyword here.


This look nice !

Datascript seems to be another Datalog engine (in memory only)

https://github.com/tonsky/datascript


there are a few more, including ones supporting on disk databases https://en.wikipedia.org/wiki/Datalog#Systems_implementing_D...


I thought there was a big class of queries Datalog could not express -- something about negation, queries like "all X for which not Y". Is that not true? Or if it is, is Datalog somehow Turing complete nonetheless?


Technically, Cozo is using something called "Datalog with stratified negation, stratified aggregation, and function symbols", allowing aggregations to be auto-recursive when they form a semi-lattice, together with built-in algorithms which are black boxes taking in relations and giving you back another relation. Your example is taken care of by the "stratified negation" part.

I believe "Datalog with function symbols" is already Turing complete, but you are right, what they call "Datalog" without any qualification in academic papers is not.


Great initiative, hope this takes off :)

Just FYI, The largest, most-used knowledge graphs in the world (Google and LinkedIn) are not running on RDF4J or any RDF Triplestore, but on their proprietary graph stores, which also use Datalog as a query language.

For those looking for an enterprise-ready equivalent (also datalog queries) and have a good wad of cash, consider https://www.oxfordsemantic.tech/product


This does look very nice!

Especially (from my point of view) having the Python interface.

What's the max practical graph sizes you anticipate?


For the moment: you can have as much data as you want on disk as long as the RocksDB storage engine can handle it, which I believe is quite large. For any single query though, you want all the data you touch to fit in memory. The good news is that Rust is very efficient in using memory. This will be improved in future versions.

For the built-in graph algorithms, you are also limited by the algorithmic complexity, which for some of them is quite high (notably betweenness centrality). There is nothing the database can help in this case, though we may add some approximate algorithms with lower complexities later.


And what is the biggest size that you have tested?


Around 10 GBs of data, on the standalone version. We will have systematic benchmarks when the API, syntax, etc. settle down a little bit.


Looks incredibly polished. Everything from the logo through the ready made bindings. Very impressive!


Good job! How to transact? The examples only show queries.


Transactions are described in the manual: https://cozodb.github.io/current/manual/stored.html#chaining....

Sorry about the docs being all over the place at the moment! My only excuse is that Cozo is very young. The documentation (and the implementation) still needs a lot of work!


This is very similar to the goals of a project I've been working on, though I've been focusing on the raw storage format (literally a drop-in replacement for RocksDB, so this could be interesting). I think datalog databases are far underrated.


you may store data as property graphs or triples, but when you do a query, you always get back relations

Can you elaborate on this? in datomic you can get back hierarchical data


I believe you are referring to Datomic's pull syntax (https://docs.datomic.com/on-prem/query/pull.html). The way I see it is that this is an add-on to the query language, not a core part of it, since it applies to the output of the query only. This would be analogous to having a GraphQL server on top of Cozo's relational model. (In fact, in earlier versions of Cozo we did have something like the pull syntax, but we quickly decided that we do not want two distinct ways of querying data in the same database, nor do we want to create a GraphQL clone).

I personally find Datomic's way of doing things a bit too convoluted, as you need to

- learn how to define the schema

- learn how to transact data

- learn how to query data

- learn how to "pull" data after the query is done

and the steps are all very different.


Awesome work, congrats.

For someone who never did anything datalog I didn't see an example in the repo and the docs (docs.rs) could need some more content.

I hope to see a 1.0 at some point and performance that can compete with SQLite.

Would love to have an alternative, especially as I have a few pet projects that have graph data (well, in the end the whole universe can be modelled as a graph ;))


I'm very happy that you like it!

The "teasers" section in the repo README contains a few examples. Or you could look at the tutorial (https://nbviewer.org/github/cozodb/cozo-docs/blob/main/tutor...), which contains all sorts of examples.

The Rust documentation on docs.rs could certainly be improved, will do that later!


Ah, yes, mea culpa. Was browsing on the phone and did miss that link indeed.

Is is also okay to store big data that would otherwise go into another storage like e.g. blog-posts?

I mean the content could also be modeled as a leaf-node and not be part of the db itself. (not sure if that would be abusing the kv storage)


In short: yes, but not right now. See this issue: https://github.com/cozodb/cozo/issues/2. Also in this case you are not really using it as an embedded database anymore, which is our original motivation. We currently also provide a "cozoserver", but it is pretty primitive at the moment. "Big data" capabilities, when they arrive in Cozo, will probably go into the server instead of the embedded binaries.


Hm, why wouldn't that be embedded?

How do you define embedded?

One of my application is a simple "blog-like" webservice where you can either use a SQLite db or Postgres.

Personally I often prefer SQLite because it doesn't need a thousand configurations and I can just migrate all the content with copying a file.


My use of "embedded" means that the whole database runs in the same process as your application. This is how SQLite works. Your application doesn't "connect" to an SQLite database in the usual sense. Your application simply contains SQLite as part of itself. Contrast this with Postgres, where you first need to start a Postgres server and then have your application talk to it.


Exactly.

I was just curious because of your comment:

> Also in this case you are not really using it as an embedded database anymore, which is our original motivation

As by your (and mine) definition, I am indeed using it as an embedded database. It's running inside the process and storing (and persisting) blog-posts.


I’m excited to get some more Rust docs!

Even just a pointer to serde ::from_value(value).unwrap(), and <TheType as Deserialize>::deserialize(value), would be helpful to get people pointed in the right direction.

Looks like a super cool project, congrats!


Really nice!

I like the design choices of Datalog for the query language and Relations for the data model. This contrasts with the typical choices made for graph databases where the word graph seems to make links a mandatory query and representation tool.


I nitpick for the README: consider converting examples from images to code blocks (you can even directly copy-paste them into the code blocks and they should retain their formatting)

Otherwise: yes, please. I love the idea.


Graph query over relational data, brilliant. I need this yesterday.


Same here, I've tried out Dgraph and Neo4j, but they seemed a little complex. This on the other hand seems much simpler (like SQLite!)


I have been meaning to do this exact project for 5 years at least. Congrats on making it happen - looking forward to using it


This is amazing. I can't wait to play with it




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: