Once Git's history is in an easy to query state, you can do some interesting things. For example, if you aggregate all the contributors in a Git repo, you can produce something like this:
which makes it very easy to identify project investment/commitment. In the above example, you can see that Microsoft is heavily invested in vscode, as a lot of the developers that are contributing to it, have been doing so for more than 3 years. And if you aggregate contributions by file types, you can see how people are contributing to it as well. In the case of vscode, the contributions are mainly TypeScript contributions.
Here is another contributor example, which shows GitLab contributions:
What the above analytics shows, is GitLab has a lot of contributors and a lot of them are new contributors (6 months or less), which makes sense since they were hiring aggressively not too long ago. Not sure if this is still the case with Covid-19, but this can be easily confirmed 6 months from now, with the same chart.
Now for something more interesting in my opinion, which is code review analytics.
It has taken a lot of research and development to get to this point, but once you can easily query Git, you can surface very interesting things by cross referencing it with external systems, like GitHub's pull request system.
In the pull requests screen shot, I created a window that only considers open pull requests that were updated within the last 30 days. With this type of window, I can see what has changed across dozens, if not hundreds or thousands of pull requests. For example, I can easily identify file collisions, between pull requests. When was their last commit, and so forth.
I'm still working on refining my code review analytics, but the goal is to get it to an advanced state, where you can see exactly what is happening between pull requests and to derive insights from those requests.
So those are just some of the use cases that I've developed, which leverages being able to query a Git's history with SQL.
The hardest parts have been (1) dealing with actual lines, which I gave up on and (2) very busy robot repos with hundreds of thousands of commits.
My goal is to release the data as a single integrated set, but there's a ways to go. For one thing I need to find everyone in it to ask if they're OK with me doing so.
My goal is to make the indexed data easily accessible, so that you can easily cross reference Git's history with whatever external systems you may have. What I've created is really a search and analytics engine for Git, which is designed for querying via SQL or through a REST interface.
On my simple dev machine which has 32gb of RAM, 1 TB of NVME storage, and a 2700x CPU, the search engine can easily index hundreds of million changes.
The search engine can run on as little as 500MB of RAM though (with 2GB of swap space), but with this kind of hardware, you can only index small repositories.
Are these repos public and on GitHub? If so, I can include them in my indexing in the future.
Do you store lines or full blobs at all? That's really where I came unglued on my first pass. I still want to reintroduce them somehow so that researchers can study changes more closely.
> On my simple dev machine which has 32gb of RAM, 1 TB of NVME storage, and a 2700x CPU, the search engine can easily index hundreds of million changes.
There's nothing quite like a good database on bulk hardware, is there?
> Are these repos public and on GitHub? If so, I can include them in my indexing in the future.
They are, but I am not sure about pointing them out just yet. What I'm doing looks to be a first for VMware, so we're moving cautiously.
No, since Git does a pretty good job of efficiently storing blobs. I would like to be able to execute
"select blob from blobs where sha=<some sha>"
but I can't justify the overhead of storing this in a database. This isn't to say I won't in the future, but if I do, I'll probably introduce a key/value DB for this, instead of using postgres. I do index blobs and diffs with lucene though. I also store the diffs in postgres.
Since Git does a very good job of storing blobs, I really can't justify using a DB just yet.
> What I'm doing looks to be a first for VMware, so we're moving cautiously
* the built-in web server was neat and useful
* when things got in a weird state (anyone learning Git knows what I mean)- it was extremely difficult to find a solution. Doing a web search was a waste of time, no one is sharing their useful Fossil SCM knowledge online.
* clear text passwords. To clarify this a bit, users are local to the clone, so not like it’s being shared... but then again clear text passwords. I just looked it up again, looks like it’s SHA1 now..... bcrypt would be nice.
* Fossil SCM does not believe in altering history. Your not going to find any squash commits or branch rebases. There is a ‘shun’ command for removing sensitive information, but it’s specifically designed to not work like git history edits and has weird behavior regarding clones.
Overall, I significantly prefer git. Being able to find solutions if something goes wrong is huge. Not having to train people on another tool when they join the team is good- you can put git as a skill requirement without limiting your options, you can’t do that with Fossil SCM. I learned Git first, so very likely I’m bias on the git-way vs. the fossil way. It was an interesting experience. To have a version control system have a built-in web server, it’s certainly unique!
Sure, but both are really cheap.
Since I don't know anything about how Fossil stores commit metadata this might a naive question - but why is that?
I would guess that being able to efficiently query all the children of particular commit should be just an index away.
Yes, but where will you find the space, and when will you compute it?
Really atime is the devil. Funny thing is, is that all modern higher level systems have some form of data access logging, which is what you want anyway. So why not form some easily compressible event log?
git qlite "SELECT * FROM commits"
git sql “SELECT * ...”
gitqlite is a little clunky for a name, but I wanted to make sure to convey that it's sqlite doing a lot of the heavy lifting here!
Interesting stuff. When time permits, I'll have to check out how this implementation for git differs from fossil.
I imagine this would involve loading Go code as a Python module, or maybe using FFI somehow?
Would be really fun to be able to call these virtual tables from Datasette.
It would only be useful for data that made sense to be represented in individual files (perhaps json files with the file name as the id), with an API that allowed the data to be accessed and written to in a similar way to other databases. Perhaps similar to a nosql database, but because it's in git, everything is versioned and can be reverted.
I prefer a bitemporal approach to data history. Easier for non-nerds to understand and much more powerful to boot.