

Mercurial Ate Our Breakfast [with Revsets], But We Don't Mind - stevejohnson
http://timunionsteve.posterous.com/mercurial-ate-our-breakfast-but-we-dont-mind

======
inerte
I don't know if you talked with the mercurial guys before starting SourceQL,
but it seems your work overlapped someone else's effort. Everyone could have
finished the feature earlier and with better quality.

Next time, my suggestion is to announce what you are doing (to the project
maintainers). Be part of the community instead of coming up with something
"done" that none knows about.

(sorry if I got the wrong impression about your involvement, but by your story
I could not be sure!)

~~~
stevejohnson
You got the right impression about our involvement, but the wrong idea about
the outcome.

When a professor asks you to do a semester project, you look for an area where
work within your abilities can be done. This is what we came up with. We spent
a few hours searching for similar projects and found nothing, and it would
have been impractical to get in touch with every maintainer of every SCM to
see if they were working on a feature like revsets.

We used Mercurial for two reasons: it is distributed, and it has a great hook
API. There was no traffic on the mailing list that would have hinted to us
that revsets were on the way. As far as I can tell, Matt Mackall himself was
just as uncommunicative as we were. The only place I can find revsets
mentioned prior to their introduction in Mercurial 1.6 was a mailing list post
where he announced them, fully baked and implemented. Can we be expected to be
more involved with the community than the primary maintainer of the project?

Anyway, we never felt comfortable with sharing our work that had a nonzero
probability of becoming vaporware. We only recently realized that the ideas
themselves might be worth something to others, and this is our way of
introducing ourselves to the community. I haven't rejoined the Mercurial
mailing list yet simply because I have been busy with work and it keeps
slipping my mind, but I do want to get plugged in with them.

~~~
timtadh
(the other author here)

In addition to what Steve said, this particular "feature" is really just the
very beginning for us. Our ideas about what query-able version control mean go
way beyond the ability to select arbitrary sets of commits based on commit
meta data. We want to be able to do higher order queries. For instance:

For each line of code written by Johnny what is the average life span of a
line of code.

Basically how lasting on average are lines of code written by a particular
developer. You could combine this query to be only on a particular subgraph
forest of the repository (for instance a branch, or commits this quarter).
These types of queries are aggregation queries, however there are other
classes of queries we would like to support as well. See some of the papers we
wrote on scribd for more details. We will be writing a another post soon
covering our vision for version control query.

------
stevejohnson
Author here, taking requests for clarification, suggestions for what to talk
about next, etc.

~~~
masklinn
Wouldn't the other obvious approach (to storing data in an easily queryable
form) be to store it in an actual database? As Fossil does (and apparently
Veracity as well, but I'm still not too sure whether it's only storing
metadata in a DB or if it's storing everything there) for instance?

~~~
stevejohnson
Since it isn't all that slow to just ask Mercurial for batches of data, there
isn't really a need to put it into a database, especially since there is an
extra cost of updating the database for every change to the repository. The
real problem is string searching and concise data gathering syntax, which is
what we are focusing on.

In the case of Fossil, it may be possible to access the backend directly and
do any selects or joins at that level. While it would be nice to access the
data for any system with SQL, for instance, those languages are not suited for
working with graph-oriented data, and so the implementation choice comes down
to writing a language that transforms user queries into different forms for
each system, or writing a language that accesses some underlying database that
the repository data has been translated into.

We had been taking the latter approach, but the entrance of revsets pushes us
neatly into the former approach. We still have our code to do all the data
scraping, so we could use it for example with svn or git (which also has its
own subgraph selection mechanism).

There also remains the fact that since we are students, we sometimes like to
implement things just for the hell of it. Like databases. We've already got
B[+]trees and a parser library, which I'm sure have been written thousands of
times, but it's certainly a useful exercise to implement them. Besides, Go is
light on libraries right now and it's nice to help them out.

~~~
masklinn
> Since it isn't all that slow to just ask Mercurial for batches of data,
> there isn't really a need to put it into a database, especially since there
> is an extra cost of updating the database for every change to the
> repository.

Oh absolutely, my point was that it's simpler if the repo's data is in a DB in
the first place.

> those languages are not suited for working with graph-oriented data

Mmmm that's true.

------
mml
solr?

~~~
stevejohnson
The only similarity between Solr and SourceQL is that they scrape data from
version control systems. And they index strings. But the goals and access
methods are totally different.

