
Codeq - twism
http://blog.datomic.com/2012/10/codeq.html
======
gfunk911
The Smalltalk model (simplifying horribly as storing functions/code units
instead of files) has significant and obvious advantages, and it's been around
for 30 years.

The fact that the file model continues to dominate suggests to me that there
may be significant drawbacks as well.

Possibilities:

* Losing the ability to use file-based tools costs too much. This is less that the function model is bad and more that existing tools happen to be file-based

* The model will eventually win, it's just taking a really long time for thoughts/tools to catch up.

* The model is better, but has not caught on for reason unrelated to its utility. The model is Betamax.

* The file-based model is better

~~~
ken
Garbage collection took around 40 years to become mainstream. There are lots
of important concepts that the industry still ignores. I don't think that
'computer science being ignored by industry programmers for decades' is
evidence of anything except the industry's own technical apathy.

Personally, I think the image model is a huge win for a single programmer,
while the file model makes it easier to integrate changes from a big team.
(There's not really a "DVCS for images" yet.) When I look around, I see that
tools that have won tend to support big teams of less-efficient programmers.

So it's not really surprising to me that, by numbers, the file model
dominates. That's not an indictment of the image model.

What do you mean by "better"? There are more Corollas than BMW 5's on the
streets here, but does that mean they're better, or worse, or better for a
particular use case? I'll take the image model for building a fast prototype
any day, even if I have to hand it over to a big team writing C or Java for
the final product.

~~~
gfunk911
You state that the image model is good for a single programmer, but not large
teams.

How much of this do you feel is due to the lack of tools (VCS, IDEs, etc)
equipped to deal with the image model, and how much is inherent to the models?

~~~
ken
I was careful to say that the image model is great for a single programmer,
but not that at was bad for large teams -- simply that large teams are a
(current) strength of the file model. Large teams are built of individuals,
after all, and there's no reason they can't use images on their own. There's
just not much in the way of tooling (yet) to support integration at the image
level.

Is this limitation inherent? I think that's impossible to say. How many people
accurately predicted the importance of a DVCS, before ever seeing one built?
Or for that matter, a garbage collector? In cases where it's not obvious, the
tools drive our understanding.

I don't see anything inherent in image models that would prevent this, though
languages don't seem designed in a way that would make this particularly easy.
For example, right now we're collaboratively editing a (very simple) shared
image (HN).

------
kstenerud
I read it and re-read it, and I still don't understand what it does or what
advantages it has over... whatever it's supposed to have an advantage over.

~~~
Scriptor
One advantage is that you can view your repo's history as a list of new
functions being added/removed/changed instead of just diffs of what lines are
removed and added. You could add editor support and look at all previous
versions of a specific function you're working on.

More specifically, where git mostly looks at code as a collection of files and
differences in individual lines, codeq breaks it down into code as a
collection of semantic units (function definitions for clojure, possibly class
definitions, methods, etc. if other languages are supported). Your version
control system would actually understand if a particular method had been moved
from a subclass into a parent class, instead of just considering it as lines
being deleted here and added elsewhere.

~~~
drumdance
Great explanation. This is one of my biggest frustrations with git.

------
jwr
I've waited a long time to see a tool that does not view my program code in
terms of lines of text. Being a Lisp (Common, Scheme, Clojure) programmer, I
always felt I'd much rather see a structural diff — what units of code
changed, not which lines changed.

I'm so glad this approach is finally coming, and in what style!

------
zacharypinter
Would love to see some examples of the query output here alongside the example
queries.

~~~
richhickey
Yeah, Stu told me I should do that :) In all cases the output from Datomic
Datalog is just a data structure. In the case of the query in the blog post,
it is just a collection of 2-tuples of date + source code string. The source
strings are largish, and it would have bulked up things, so I punted.

The rules don't have output until incorporated in a query - you can think of
them as akin to SQL views. However, they don't need to be installed in the db,
you can pass them as an arg, as the query does.

------
snprbob86
I noticed that the analyzer's schema includes an analyzer revision, presumably
as a way to allow newer analyzers to be re-run against older versions of the
code.

This raises a question for Rich regarding Datomic and the notion of "derived
facts" a la "Out Of The Tar Pit":

Datalog Rules can be used to query by some trivial notion of derived facts at
any point in time, but most derived facts are expensive to compute and thus
should be cached and introduced by a new transaction. In the case of Codeq,
this includes the full output of the analyzer. It seems like a natural
extension of Datomic to support lazy calculation and caching of derived facts.
I could even imagine some cluster scheduling of that work, in a sense
producing a map-reduced immutable materialized view of sorts.

I realize I said a question was coming, but I'm having a hard time formulating
one... which probably means that I don't understand the problem well enough.
So, can you talk a little bit about how you envision Datomic evolving with
respect to derived facts?

~~~
richhickey
We don't have any support for materialized views at present, but they are on
the list of enhancements to consider.

~~~
rjn945
Obligatory Wikipedia link for those of us (like me) who don't know what
materialized views are: <http://en.wikipedia.org/wiki/Materialized_view>

In short: "In a database management system following the relational model, a
view is a virtual table representing the result of a database query. Whenever
a query or an update addresses an ordinary view's virtual table, the DBMS
converts these into queries or updates against the underlying base tables. A
materialized view takes a different approach in which the query result is
cached as a concrete table that may be updated from the original base tables
from time to time. This enables much more efficient access, at the cost of
some data being potentially out-of-date. It is most useful in data warehousing
scenarios, where frequent queries of the actual base tables can be extremely
expensive."

------
gavinpc
To those who don't see the point of this, I would place it in the context of a
larger move towards code "babel" (for lack of a better term), i.e., a unified
interface for common abstractions. Imagine that (as Steve Yegge suggests) the
tooling available for statically-typed langauges will eventually come to
dynamic languages as well (think CEDET). It reminds me of the fact that most
web server platforms still focus on emitting text instead of manipulating a
DOM. That will change.

What I _don't_ understand is why I need cookies enabled in Firefox to view the
post. This is obviously a Blogger issue, as I've seen those little gears many
times before. I can't imagine the great Rich Hickey intends it.

------
pshc
Nice work. Props on normalizing human names and use/mention distinction.

Are you going to stick to analysis, or support code transformations? If you're
going to transform code, how will you avoid IDE refactor tool hell?

I was struck with the same idea (turning ASTs into git DAGs; normalizing code)
back when I was first learning git, but the idea took me down a different path
- writing a structured (no-plaintext) programming environment. I'll get to the
version control portion soon enough, and I hope there'll be some lessons I can
take away from Codeq!

------
currywurst
I'm really enthusiastic about the newfound exposure that Datalog is getting!
Can anyone inform us of other real world systems using it? What kind of
limitations do you run into?

.QL(<http://en.wikipedia.org/wiki/.QL>) is a similar project to query
codebases developed by SemmleCode, that gives a "OO+SQL flavor" to the
queries.

~~~
trurl
LogicBlox (<http://logicblox.com/>) is using Datalog for real-world enterprise
software development. At present, the implementation is far more advanced than
any other. See [http://www.logicblox.com/technical-
reports/LB1201_LeapfrogTr...](http://www.logicblox.com/technical-
reports/LB1201_LeapfrogTriejoin.pdf) for example.

~~~
cliftonk
the link to your paper is broken

------
ivanb
This looks awfully similar to the idea I had since 2009. I never got close to
implementing it but I also intended to target Clojure and then Javascript.
However I thought about it more in the sense of a nicely structured global
open source library. The smallest unit of such a library would be a lisp form.
There would be dependency management, version control, documentation and tests
for each form. It would be then possible to query the library in plaintext
queries like "SHA algorithm" or "vector 3d". The user would be then presented
a list of the forms. By looking at the docstrings and tests of the forms the
user would be able to choose the most fitting one. Checking out a form would
automatically fetch all the forms it requires.

It is nice to see a similar idea actually implemented.

------
ripperdoc
I'm just wondering who thought it was a good idea to put some share links
hovering ABOVE the scroll bar. <http://imgur.com/xF5UP>

I don't know how many times I have lost the scroll handle behind it when this
widget is used.

------
indeyets
Awesome. Can't wait till it supports more languages

------
programnature
Who said anything about the image model? This is about doing analysis of the
git repo. Not replacing the git repo.

------
dmorgan
This tool sounds like a nice addition to "Light Table".

~~~
ibdknox
LT already has a basic form of this, it just interacted with the filesystem
directly instead of going through git. I'll play with this some to see what I
can make out of it :)

------
dschiptsov
How about NO?)

~~~
dmorgan
How about contributing something to the discussion?

~~~
dschiptsov
Isn't it obvious?

1\. It is an IDE's job, to keep track of changes of individual expressions in
the code of a project.

2\. This information should be stored separately from the source code files,
as a meta-data to the project, not the individual files.

3\. I don't need a solution for a problem I do not have.

4\. The query language is ugly.

5\. I do not want to use any "free" commercial service for a solution which
can be implemented as a emacs-lisp package.

6\. I see nothing in this blog post of any interest.

7\. I have no over-excitement _just because_ something comes from Rich Hickey.

~~~
Scriptor
> 1\. It is an IDE's job, to keep track of changes of individual expressions
> in the code of a project.

Is the IDE supposed to have it's own revision history stored somewhere? I
understand how it might be an IDE's job to recognize individual functions but
there's no way it is supposed to keep track of changes. That is entirely the
source control's job.

> 2\. This information should be stored separately from the source code files,
> as a meta-data to the project, not the individual files.

What information? This doesn't change your code in any way, it analyzes it and
stores the data resulting from that in the datomic db.

3, 4, 6 are entirely subjective, it's rather clear plenty of others find this
interesting and useful. I do agree that the query language isn't pretty.

5\. Maybe someone will dislike Datomic enough to implement this as an emacs
package. The source for codeq itself is open source, datomic is only a storage
backend.

7\. Let's be straight, judging from the points you made you certainly didn't
go into this with no bias. Some of us don't get _angry_ just because it comes
from Hickey.

~~~
parishda
The query language isn't pretty (that's what I thought at first, too!), but
it's derived from Datalog/Prolog, AND it is simple (not easy, at first, since
unusual), but I'd rather it NOT look too much like SQL. Just like I don't
confuse Ruby programming with Clojure.

