
Gitqlite: Query Git Repositories with SQL - sebazzz
https://github.com/augmentable-dev/gitqlite
======
patrickdevivo
Creator here! Thanks for submitting - I hope people find this interesting. I’m
excited to explore more use cases beyond listing and reformatting git log
output! There’s a similar project called gitbase, but I had difficulty just
spinning that up for ad-hoc CLI querying so created this project in hope of a
better UX for that type of use case. It uses SQLite virtual tables, which was
something I had been meaning to play around with. This seemed like a cool use
of them!

~~~
sdesol
Cool. It's taken me years to find an efficient (and most importantly, easily
debuggable) way to transfer Git's history to SQL. Since you are looking for
use cases, I'll share with you some of the ways that I'm leveraging SQL to
help drive code analytics for Git.

Once Git's history is in an easy to query state, you can do some interesting
things. For example, if you aggregate all the contributors in a Git repo, you
can produce something like this:

[https://imgur.com/tT4DSJc](https://imgur.com/tT4DSJc)

which makes it very easy to identify project investment/commitment. In the
above example, you can see that Microsoft is heavily invested in vscode, as a
lot of the developers that are contributing to it, have been doing so for more
than 3 years. And if you aggregate contributions by file types, you can see
how people are contributing to it as well. In the case of vscode, the
contributions are mainly TypeScript contributions.

Here is another contributor example, which shows GitLab contributions:

[https://imgur.com/3UU6Kda](https://imgur.com/3UU6Kda)

What the above analytics shows, is GitLab has a lot of contributors and a lot
of them are new contributors (6 months or less), which makes sense since they
were hiring aggressively not too long ago. Not sure if this is still the case
with Covid-19, but this can be easily confirmed 6 months from now, with the
same chart.

Now for something more interesting in my opinion, which is code review
analytics.

[https://imgur.com/xWH6QFl](https://imgur.com/xWH6QFl)

It has taken a lot of research and development to get to this point, but once
you can easily query Git, you can surface very interesting things by cross
referencing it with external systems, like GitHub's pull request system.

In the pull requests screen shot, I created a window that only considers open
pull requests that were updated within the last 30 days. With this type of
window, I can see what has changed across dozens, if not hundreds or thousands
of pull requests. For example, I can easily identify file collisions, between
pull requests. When was their last commit, and so forth.

I'm still working on refining my code review analytics, but the goal is to get
it to an advanced state, where you can see exactly what is happening between
pull requests and to derive insights from those requests.

So those are just some of the use cases that I've developed, which leverages
being able to query a Git's history with SQL.

~~~
jacques_chester
Do you have a schema somewhere? I've been working on a dataset that includes
git repositories and I've been muddling through it slowly. It's built around a
dataset of ~126k builds I collected some time back, plus ~5k Pivotal Tracker
stories for the same time period. Covers about 2.5 years of 3 teams.

The hardest parts have been (1) dealing with actual lines, which I gave up on
and (2) very busy robot repos with hundreds of thousands of commits.

My goal is to release the data as a single integrated set, but there's a ways
to go. For one thing I need to find everyone in it to ask if they're OK with
me doing so.

~~~
sdesol
Sorry, I haven't published it yet. It honestly took a long time to develop the
schemas, but I might publish it the future. The issue right now is, I'm a
single founder so I really have to be smart with my time and publishing things
will just add to my work load.

My goal is to make the indexed data easily accessible, so that you can easily
cross reference Git's history with whatever external systems you may have.
What I've created is really a search and analytics engine for Git, which is
designed for querying via SQL or through a REST interface.

On my simple dev machine which has 32gb of RAM, 1 TB of NVME storage, and a
2700x CPU, the search engine can easily index hundreds of million changes.

[https://imgur.com/WoS4Nr6](https://imgur.com/WoS4Nr6)

The search engine can run on as little as 500MB of RAM though (with 2GB of
swap space), but with this kind of hardware, you can only index small
repositories.

Are these repos public and on GitHub? If so, I can include them in my indexing
in the future.

~~~
jacques_chester
> _The issue right now is, I 'm a single founder so I really have to be smart
> with my time and publishing things will just add to my work load._

Understood.

Do you store lines or full blobs at all? That's really where I came unglued on
my first pass. I still want to reintroduce them somehow so that researchers
can study changes more closely.

> _On my simple dev machine which has 32gb of RAM, 1 TB of NVME storage, and a
> 2700x CPU, the search engine can easily index hundreds of million changes._

There's nothing quite like a good database on bulk hardware, is there?

> _Are these repos public and on GitHub? If so, I can include them in my
> indexing in the future._

They are, but I am not sure about pointing them out just yet. What I'm doing
looks to be a first for VMware, so we're moving cautiously.

~~~
sdesol
> Do you store lines or full blobs at all? That's really where I came unglued
> on my first pass. I still want to reintroduce them somehow so that
> researchers can study changes more closely

No, since Git does a pretty good job of efficiently storing blobs. I would
like to be able to execute

"select blob from blobs where sha=<some sha>"

but I can't justify the overhead of storing this in a database. This isn't to
say I won't in the future, but if I do, I'll probably introduce a key/value DB
for this, instead of using postgres. I do index blobs and diffs with lucene
though. I also store the diffs in postgres.

Since Git does a very good job of storing blobs, I really can't justify using
a DB just yet.

> What I'm doing looks to be a first for VMware, so we're moving cautiously

Understood.

------
dan-robertson
See also: Fossil, a version control system backed by a SQLite database. The
author (of Fossil and SQLite) points out that a real database gives you more
flexible efficient queries for your data. In git in comparison, it is for
example much easier to find the parents of a commit than its children.

~~~
microcolonel
> _In git in comparison, it is for example much easier to find the parents of
> a commit than its children._

Sure, but both are really cheap.

~~~
cryptonector
Well, no, in huge repositories with lots of branches it's not. But this is
true for Fossil as well. IIRC MSFT did a study with the Windows repository
showing that a relational approach to VCS bloated the repository too much, and
out of this grew the Bloom filter approach to speeding up with git blame/log.

~~~
balfirevic
> But this is true for Fossil as well.

Since I don't know anything about how Fossil stores commit metadata this might
a naive question - but why is that?

I would guess that being able to efficiently query all the children of
particular commit should be just an index away.

~~~
microcolonel
> _I would guess that being able to efficiently query all the children of
> particular commit should be just an index away._

Yes, but where will you find the space, and when will you compute it?

~~~
balfirevic
I'm not sure I understand what you mean. I was talking about Fossil, which
uses SQLite under the hood, which would then maintain the index whenever data
in it is modified.

~~~
cryptonector
Scale. Microsoft couldn't make a SQL VCS work. Maybe it's because they tried
to use SQL Server instead of PostgreSQL, I dunno :) but IIRC the amounts of
metadata involved essentially meant that they had no hope of doing git-like
cloning: it was too much to clone.

~~~
sitkack
You would need to implement a persistent datastructure, but without refcounts,
you would have to do a gc. I guess you could do a refcount tree and only RC at
the last common node.

Really atime is the devil. Funny thing is, is that all modern higher level
systems have some form of data access logging, which is what you want anyway.
So why not form some easily compressible event log?

------
icholy
You can name the binary git-qlite if you want to invoke it as a subcommand on
git.

    
    
        git qlite "SELECT * FROM commits"

~~~
koolba
How can you pass up naming it “git-sql”?!

    
    
        git sql “SELECT * ...”

~~~
patrickdevivo
haha you're totally right! if it's just a matter of renaming the binary and
dropping it in a particular directory to get git to know about it, I can leave
this part up to the user. I'll definitely have to make a note in the README.

gitqlite is a little clunky for a name, but I wanted to make sure to convey
that it's sqlite doing a lot of the heavy lifting here!

~~~
klyrs
Not so humble opinion, I'd encourage you to aim for ergonomics and let people
read about the project if they want to know about the backend.

------
xorander
I mainly use fossil these days, with git as a secondary option. One of the
main reasons I went with fossil is due to it being built on top of SQLite and
being able to query against the database, as well as extend it for customized
project management.

Interesting stuff. When time permits, I'll have to check out how this
implementation for git differs from fossil.

------
simonw
I wonder how hard it would be to load up these virtual tables (written in Go)
against a Python sqlite3 connection.

I imagine this would involve loading Go code as a Python module, or maybe
using FFI somehow?

Would be really fun to be able to call these virtual tables from Datasette.

------
setheron
Checkout Calcite. It's a library that transforms SQL into relational algebra
that can be used with many backends including git.

------
Humphrey
This post reminds me of a product idea I've been pondering for a while: A
versioned database backed by git.

It would only be useful for data that made sense to be represented in
individual files (perhaps json files with the file name as the id), with an
API that allowed the data to be accessed and written to in a similar way to
other databases. Perhaps similar to a nosql database, but because it's in git,
everything is versioned and can be reverted.

~~~
jacques_chester
There are a few of these around now (various developers of which are active on
HN and who will no doubt pop up in a bit).

I prefer a bitemporal approach to data history. Easier for non-nerds to
understand and much more powerful to boot.

------
mavsman
Really was hoping this would be for finding projects on GitHub, not for
searching within a repo. I often find it hard to do advanced searches using
some special DSL GitHub has come up with (or really any company/website for
that matter) and would love to use a SQL query to find specific git repos. I
definitely feel the same way about YouTube channels and videos.

~~~
sitkack
There are a couple github datasets in BigQuery, this might work for your
usecase.

------
searchableguy
Edit: I misunderstood that it was for github. Great project for querying git
repos.

~~~
detaro
Git repositories outside Github exist.

