

Minimize Code, Maximize Data - edw519
http://database-programmer.blogspot.com/2008/05/minimize-code-maximize-data.html

======
Hexstream
I very much agree with this article.

Reminds me of the evolution of my lisp HTML compiler. At first I simply made a
macro that transforms:

    
    
      (:tag :attribute "value" (:p "Paragraph"))
    

into appropriate code to output the HTML at runtime. But the problem is that I
couldn't inspect or transform my HTML; this macro-based scheme was way too
brittle.

So instead I wrote another macro with the same syntax but that instead
generates code to build a semantic representation of HTML at runtime:

    
    
      (make-instance 'html-node :type "tag" :attributes (cons "attribute" "value" :children (list (make-instance 'html-node :type "p" :children (list "Paragraph")))))
    

I then pass this representation to a compiler that first optimizes ("flattens"
the structure and appends contiguous strings together) and then compiles into
an efficient tree of closures. I still have a version of the macro that
generates efficient code to output the HTML directly that I use in my dynamic
HTML generation functions, but I use that only when necessary.

The advantage is that now I can do all sorts of static analysis on my HTML (I
use a similar scheme for CSS). Believe it or not, in a "dynamic" site there's
TONS of static (as in, known before someone even tries to access the page)
information. I even have a scheme to "inject" stuff into pages, for example
the (page-sensitive) navigation is automatically "baked" into the appropriate
pages.

Eventually I want to make a "site debugger" with all the semantic data I have
about my site. I'll have powerful querying capabilities to answer questions
like: "What pages in my site have an A tag that links to page X and is nested
in a container affected by CSS rule Y?". Maybe I can even make some kind of
advanced editor. Owning your data opens lots of interesting possibilities
indeed!

So, I get all the semantic data __plus __the great speed. The best of both
worlds. I think this really illustrates the importance of compilers and of
languages like lisp that facilitate their implementation. I never made a low-
level compiler and it will be some time before I'm knowledgeable enough to
make one but lisp allowed me to make those high-level HTML and CSS compilers
comparatively easily.

Another important thing is separating policy from mechanism, one obvious
problem of the programmer that represented rules directly in code as depicted
in the article is that he hard-coded policy into his program, but the thing is
that policy is volatile so it really belongs in a config file.

I think this somewhat resonates with the quote: " _Programs must be written
for people to read, and only incidentally for machines to execute._ " Except
programs themselves often want to read programs for the purpose of analysis or
manipulation, hence the advantage of representing the most logic you can in a
"dumb" language or data format.

------
olavk
It's a good point, but of course someone is going to say that code is just a
kind of data. So to be more specific, I like a principle like TBL's "Principle
of Least Power" <http://www.w3.org/2001/tag/doc/leastPower-2006-01-23.html> :
You should express information in the _least_ powerful format possible. If
possible, it is better to express logic in a constrained (e.g. declarative)
DSL than in a general purpose language, and even better if it can be factored
out in configuration files or a database.

~~~
kendowns
Yes, the code-is-data is one of those frustratingly true-but-out-of-context
things. I created a complete framework that builds databases out of a text
file (YAML it so happens) including security and automations
(<http://www.andromeda-project.org>), but was unaware of TBL's essay. I will
have to read it carefully and determine what citations may be in order.

------
daniel-cussen
Reminds me of Norvig's "more data beats better algorithms."

~~~
kendowns
A cursory google search gave no obvious explanation of this, do you have a
specific link to serve as a starting point?

~~~
abstractbill
The grandparent was referring to this talk:

[http://www.justin.tv/hackertv/98128/Peter_Norvig_Director_of...](http://www.justin.tv/hackertv/98128/Peter_Norvig_Director_of_Research_Google)

~~~
bluishgreen
How about this? [http://anand.typepad.com/datawocky/2008/03/more-data-
usual.h...](http://anand.typepad.com/datawocky/2008/03/more-data-usual.html)

Norvigs point was similar, but Anand has more convincing examples.

------
dnaquin
Minimize what you don't understand, maximize what you do?

 _The code-centric solution, which I mentioned we are afraid to touch, is full
of conditionals and branches that make it dangerous to mess with for fear of
causing unintended side-effects._

Just a case of programming by coincidence.

------
scroyston
I'm all for data over code, but I'd use a Rete network so my program didn't
run in O(DISTRIBUTIONS * RULES) time. Yes data design is important, but you
need to be algorithm savvy in order to know the best way to design your data
structures.

~~~
kendowns
I think that in this case you get a non-trivial performance gain with some
basic implementation improvements. You can replace the row-by-row round trip
with a single SQL UPDATE with a LIMIT (or a LIMITed subquery if the server
does it that way).

Also, the performance of the algorithm in the essay would be linear to the
number of rows requiring update, can you elaborate on whether the RETE
alogorithm can do any better in this case.

And finally, thanks for the note on RETE, I will have to investigate that.

~~~
scroyston
Comment space is a bit limited to do an adequate explanation. For rules that
have straightforward (but possibly compound) predicates, Rete will give you
O(1) lookup. From wikipedia: "In most cases, the speed increase over naïve
implementations is several orders of magnitude (because Rete performance is
theoretically independent of the number of rules in the system)."

<http://en.wikipedia.org/wiki/Rete_algorithm>

My favorite reference on the subject is at the bottom of the wikipedia page:
"Production Matching for Large Learning Systems - R Doorenbos"

Sadly (and strangely) many of the "Rete Engines" today do it wrong and will
not give you effecient lookups.

------
wensing
"Data dominates. If you've chosen the right data structures and organized
things well, the algorithms will almost always be self-evident. Data
structures, not algorithms, are central to programming." \--Rob Pike, 'Notes
on C Programming'

