Hacker News new | past | comments | ask | show | jobs | submit login
Gitqlite: Query Git Repositories with SQL (github.com/augmentable-dev)
166 points by sebazzz on July 4, 2020 | hide | past | favorite | 36 comments



Creator here! Thanks for submitting - I hope people find this interesting. I’m excited to explore more use cases beyond listing and reformatting git log output! There’s a similar project called gitbase, but I had difficulty just spinning that up for ad-hoc CLI querying so created this project in hope of a better UX for that type of use case. It uses SQLite virtual tables, which was something I had been meaning to play around with. This seemed like a cool use of them!


A friend of mine did something similar a few years ago: https://github.com/filhodanuvem/gitql


Cool. It's taken me years to find an efficient (and most importantly, easily debuggable) way to transfer Git's history to SQL. Since you are looking for use cases, I'll share with you some of the ways that I'm leveraging SQL to help drive code analytics for Git.

Once Git's history is in an easy to query state, you can do some interesting things. For example, if you aggregate all the contributors in a Git repo, you can produce something like this:

https://imgur.com/tT4DSJc

which makes it very easy to identify project investment/commitment. In the above example, you can see that Microsoft is heavily invested in vscode, as a lot of the developers that are contributing to it, have been doing so for more than 3 years. And if you aggregate contributions by file types, you can see how people are contributing to it as well. In the case of vscode, the contributions are mainly TypeScript contributions.

Here is another contributor example, which shows GitLab contributions:

https://imgur.com/3UU6Kda

What the above analytics shows, is GitLab has a lot of contributors and a lot of them are new contributors (6 months or less), which makes sense since they were hiring aggressively not too long ago. Not sure if this is still the case with Covid-19, but this can be easily confirmed 6 months from now, with the same chart.

Now for something more interesting in my opinion, which is code review analytics.

https://imgur.com/xWH6QFl

It has taken a lot of research and development to get to this point, but once you can easily query Git, you can surface very interesting things by cross referencing it with external systems, like GitHub's pull request system.

In the pull requests screen shot, I created a window that only considers open pull requests that were updated within the last 30 days. With this type of window, I can see what has changed across dozens, if not hundreds or thousands of pull requests. For example, I can easily identify file collisions, between pull requests. When was their last commit, and so forth.

I'm still working on refining my code review analytics, but the goal is to get it to an advanced state, where you can see exactly what is happening between pull requests and to derive insights from those requests.

So those are just some of the use cases that I've developed, which leverages being able to query a Git's history with SQL.


Do you have a schema somewhere? I've been working on a dataset that includes git repositories and I've been muddling through it slowly. It's built around a dataset of ~126k builds I collected some time back, plus ~5k Pivotal Tracker stories for the same time period. Covers about 2.5 years of 3 teams.

The hardest parts have been (1) dealing with actual lines, which I gave up on and (2) very busy robot repos with hundreds of thousands of commits.

My goal is to release the data as a single integrated set, but there's a ways to go. For one thing I need to find everyone in it to ask if they're OK with me doing so.


Sorry, I haven't published it yet. It honestly took a long time to develop the schemas, but I might publish it the future. The issue right now is, I'm a single founder so I really have to be smart with my time and publishing things will just add to my work load.

My goal is to make the indexed data easily accessible, so that you can easily cross reference Git's history with whatever external systems you may have. What I've created is really a search and analytics engine for Git, which is designed for querying via SQL or through a REST interface.

On my simple dev machine which has 32gb of RAM, 1 TB of NVME storage, and a 2700x CPU, the search engine can easily index hundreds of million changes.

https://imgur.com/WoS4Nr6

The search engine can run on as little as 500MB of RAM though (with 2GB of swap space), but with this kind of hardware, you can only index small repositories.

Are these repos public and on GitHub? If so, I can include them in my indexing in the future.


> The issue right now is, I'm a single founder so I really have to be smart with my time and publishing things will just add to my work load.

Understood.

Do you store lines or full blobs at all? That's really where I came unglued on my first pass. I still want to reintroduce them somehow so that researchers can study changes more closely.

> On my simple dev machine which has 32gb of RAM, 1 TB of NVME storage, and a 2700x CPU, the search engine can easily index hundreds of million changes.

There's nothing quite like a good database on bulk hardware, is there?

> Are these repos public and on GitHub? If so, I can include them in my indexing in the future.

They are, but I am not sure about pointing them out just yet. What I'm doing looks to be a first for VMware, so we're moving cautiously.


> Do you store lines or full blobs at all? That's really where I came unglued on my first pass. I still want to reintroduce them somehow so that researchers can study changes more closely

No, since Git does a pretty good job of efficiently storing blobs. I would like to be able to execute

"select blob from blobs where sha=<some sha>"

but I can't justify the overhead of storing this in a database. This isn't to say I won't in the future, but if I do, I'll probably introduce a key/value DB for this, instead of using postgres. I do index blobs and diffs with lucene though. I also store the diffs in postgres.

Since Git does a very good job of storing blobs, I really can't justify using a DB just yet.

> What I'm doing looks to be a first for VMware, so we're moving cautiously

Understood.


That seems like a really cool project, but I couldn't figure out how exactly the where filter was implemented. Can you point me to the code that implements it?


Can you query multiple repos at with one query? Interesting tool!


At the moment, not easily, but someone else just asked about that and I think it's worth looking into! It makes sense and could be a cool use case to run queries on all the key repos in an org or project


This would make the difference between neat and useful for my team.


same!


See also: Fossil, a version control system backed by a SQLite database. The author (of Fossil and SQLite) points out that a real database gives you more flexible efficient queries for your data. In git in comparison, it is for example much easier to find the parents of a commit than its children.


I used fossil professionally for about a year before company got bought out and switched to git. It was interesting. It’s been a few years since then, but here are my thoughts:

* the built-in web server was neat and useful

* when things got in a weird state (anyone learning Git knows what I mean)- it was extremely difficult to find a solution. Doing a web search was a waste of time, no one is sharing their useful Fossil SCM knowledge online.

* clear text passwords. To clarify this a bit, users are local to the clone, so not like it’s being shared... but then again clear text passwords. I just looked it up again, looks like it’s SHA1 now..... bcrypt would be nice.

* Fossil SCM does not believe in altering history. Your not going to find any squash commits or branch rebases. There is a ‘shun’ command for removing sensitive information, but it’s specifically designed to not work like git history edits and has weird behavior regarding clones.

Overall, I significantly prefer git. Being able to find solutions if something goes wrong is huge. Not having to train people on another tool when they join the team is good- you can put git as a skill requirement without limiting your options, you can’t do that with Fossil SCM. I learned Git first, so very likely I’m bias on the git-way vs. the fossil way. It was an interesting experience. To have a version control system have a built-in web server, it’s certainly unique!


> In git in comparison, it is for example much easier to find the parents of a commit than its children.

Sure, but both are really cheap.


Well, no, in huge repositories with lots of branches it's not. But this is true for Fossil as well. IIRC MSFT did a study with the Windows repository showing that a relational approach to VCS bloated the repository too much, and out of this grew the Bloom filter approach to speeding up with git blame/log.


> But this is true for Fossil as well.

Since I don't know anything about how Fossil stores commit metadata this might a naive question - but why is that?

I would guess that being able to efficiently query all the children of particular commit should be just an index away.


> I would guess that being able to efficiently query all the children of particular commit should be just an index away.

Yes, but where will you find the space, and when will you compute it?


I'm not sure I understand what you mean. I was talking about Fossil, which uses SQLite under the hood, which would then maintain the index whenever data in it is modified.


Scale. Microsoft couldn't make a SQL VCS work. Maybe it's because they tried to use SQL Server instead of PostgreSQL, I dunno :) but IIRC the amounts of metadata involved essentially meant that they had no hope of doing git-like cloning: it was too much to clone.


You would need to implement a persistent datastructure, but without refcounts, you would have to do a gc. I guess you could do a refcount tree and only RC at the last common node.

Really atime is the devil. Funny thing is, is that all modern higher level systems have some form of data access logging, which is what you want anyway. So why not form some easily compressible event log?


You can name the binary git-qlite if you want to invoke it as a subcommand on git.

    git qlite "SELECT * FROM commits"


How can you pass up naming it “git-sql”?!

    git sql “SELECT * ...”


haha you're totally right! if it's just a matter of renaming the binary and dropping it in a particular directory to get git to know about it, I can leave this part up to the user. I'll definitely have to make a note in the README.

gitqlite is a little clunky for a name, but I wanted to make sure to convey that it's sqlite doing a lot of the heavy lifting here!


Not so humble opinion, I'd encourage you to aim for ergonomics and let people read about the project if they want to know about the backend.


that’s great to know, will take a look and add a note to the README


I mainly use fossil these days, with git as a secondary option. One of the main reasons I went with fossil is due to it being built on top of SQLite and being able to query against the database, as well as extend it for customized project management.

Interesting stuff. When time permits, I'll have to check out how this implementation for git differs from fossil.


I wonder how hard it would be to load up these virtual tables (written in Go) against a Python sqlite3 connection.

I imagine this would involve loading Go code as a Python module, or maybe using FFI somehow?

Would be really fun to be able to call these virtual tables from Datasette.


Checkout Calcite. It's a library that transforms SQL into relational algebra that can be used with many backends including git.


This post reminds me of a product idea I've been pondering for a while: A versioned database backed by git.

It would only be useful for data that made sense to be represented in individual files (perhaps json files with the file name as the id), with an API that allowed the data to be accessed and written to in a similar way to other databases. Perhaps similar to a nosql database, but because it's in git, everything is versioned and can be reverted.


There are a few of these around now (various developers of which are active on HN and who will no doubt pop up in a bit).

I prefer a bitemporal approach to data history. Easier for non-nerds to understand and much more powerful to boot.


There are two projects that take this direction - Splitgraph and Dolt.


Really was hoping this would be for finding projects on GitHub, not for searching within a repo. I often find it hard to do advanced searches using some special DSL GitHub has come up with (or really any company/website for that matter) and would love to use a SQL query to find specific git repos. I definitely feel the same way about YouTube channels and videos.


There are a couple github datasets in BigQuery, this might work for your usecase.


Edit: I misunderstood that it was for github. Great project for querying git repos.


Git repositories outside Github exist.




Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: