Do you have a schema somewhere? I've been working on a dataset that includes git...

sdesol · on July 5, 2020

Sorry, I haven't published it yet. It honestly took a long time to develop the schemas, but I might publish it the future. The issue right now is, I'm a single founder so I really have to be smart with my time and publishing things will just add to my work load.

My goal is to make the indexed data easily accessible, so that you can easily cross reference Git's history with whatever external systems you may have. What I've created is really a search and analytics engine for Git, which is designed for querying via SQL or through a REST interface.

On my simple dev machine which has 32gb of RAM, 1 TB of NVME storage, and a 2700x CPU, the search engine can easily index hundreds of million changes.

https://imgur.com/WoS4Nr6

The search engine can run on as little as 500MB of RAM though (with 2GB of swap space), but with this kind of hardware, you can only index small repositories.

Are these repos public and on GitHub? If so, I can include them in my indexing in the future.

jacques_chester · on July 5, 2020

> The issue right now is, I'm a single founder so I really have to be smart with my time and publishing things will just add to my work load.

Understood.

Do you store lines or full blobs at all? That's really where I came unglued on my first pass. I still want to reintroduce them somehow so that researchers can study changes more closely.

> On my simple dev machine which has 32gb of RAM, 1 TB of NVME storage, and a 2700x CPU, the search engine can easily index hundreds of million changes.

There's nothing quite like a good database on bulk hardware, is there?

> Are these repos public and on GitHub? If so, I can include them in my indexing in the future.

They are, but I am not sure about pointing them out just yet. What I'm doing looks to be a first for VMware, so we're moving cautiously.

sdesol · on July 5, 2020

> Do you store lines or full blobs at all? That's really where I came unglued on my first pass. I still want to reintroduce them somehow so that researchers can study changes more closely

No, since Git does a pretty good job of efficiently storing blobs. I would like to be able to execute

"select blob from blobs where sha=<some sha>"

but I can't justify the overhead of storing this in a database. This isn't to say I won't in the future, but if I do, I'll probably introduce a key/value DB for this, instead of using postgres. I do index blobs and diffs with lucene though. I also store the diffs in postgres.

Since Git does a very good job of storing blobs, I really can't justify using a DB just yet.

> What I'm doing looks to be a first for VMware, so we're moving cautiously

Understood.