> I want a program that will take a set of unit tests, and create a program that takes a piece of code and evolves it until all those tests pass.
Look into Genetic Programming (as opposed to Genetic Algorithms, Evolutionary Programming or Evolutionary strategies, three other fields that arose independently).
In GP, the genotype being evolved is a program, usually represented as either a tree of instructions or a string of assembler instructions. There are other schemes but these are the major two.
However, there's a fly in the ointment. Averaged over every possible computable problem, no one algorithm is the "best". It's called the "no free lunch theorem". The upshot is that with some clever encoding of domain knowledge into the genotypical space, you can often produce systems that outperform human-designed systems. But you cannot start from an absolutely generic genotype and necessarily arrive at the 'best' algorithm, you will always need some degree of manual intervention. Don't ask me to explain it better, I can't because I don't follow the maths.
GP has lots of interesting nooks and crannies, I refer you to A Field Guide to Genetic Programming for a fairly readable introduction: http://www.gp-field-guide.org.uk/
I did something like this for my Master's, I evolved Java objects using genetic programming based on a set of unit tests. Managed to evolve a simple bank account example and a reverse polish notation class. Here's a link to the paper http://goo.gl/cgGWM
Interesting work, the problem I see with using standard "unit tests" as fitness functions to evolve objects (or their functions) is that the resulting program may only compute the correct output for the defined test cases:
Say for example we want a function that adds two integers, we create a test with:
int x = 0; int y=0;
for (int x=0,y=0;x<10;x++,y++){
int result = sumIntegers(x,y);
assert (result== x+y);
writeFitness(x,y,(result-x-y)/(x+y));
}
A correct function (with fitness == 0) can be:
int sumIntegers(int x, int y) {
if (x == 1 and y == 1) return 2;
if (x == 2 and y == 2) return 4;
if (x == 3 and y == 3) return 6;
...
}
True, but that's true of most data mining and search algorithms. You are in danger of over-fitting the sample data. You can take steps to avoid this by giving more consideration to correct solutions that are shorter, in the hope that brevity avoids an exhaustive case statement.
I don't want to come off as if I disagree with you, the evolved programs will only be as good/complete as the unit tests they are evolved against.
That has to be related somehow to mutation testing. They used that at my engineering school to test our (Java-subset) compilers, along with a lot of unit (or not) tests.
Actually, the fact that mutation testing is useful is a sign of one of the major difficulties with automatic algorithm generation: a tiny change in a program typically destroys a piece of functionality completely, which means it's hard to follow any sort of fitness landscape to incrementally create a working program that passes all your tests. When writing an algorithm you're looking for a needle in a haystack, not tweaking parameters to find the peak of a broad mountain, and that takes a lot of the punch out of automated methods.
What this means, in part, is that brute-force search will often be just as effective as evolutionary or hill climbing methods. Where evolutionary methods start to shine is when you allow them to reuse bits of code that have already been found to be effective, esp. in other problems - that corresponds a lot more closely to the way that humans write code, we lean on our memories of the types of things that have worked before, for everything from syntax to algorithm design to system architecture.
Sure..and put the rest of us out of business? : ) Although it is interesting to think about what the make-up of the genetic material would be. Just a string of 0s and 1s? CPU opcodes? Or maybe some higher level sub-units.
These have been tried. Historically Genetic Algorithms have represented their genotypes as bit strings. Some Genetic Programming systems use CPU instruction strings (pros: fast as heck, cons: can produce indecipherable spaghetti code that breaks when you remove those 400 NOOPs because it's adapted to the cache size of this particular CPU).
> Or maybe some higher level sub-units.
These too. Sometimes as trees of instructions, dataflow diagrams, FSMs and so on.
Very interesting.."indecipherable spaghetti code that breaks when you remove those 400 NOOPs because it's adapted to the cache size of this particular CPU" - sounds a little like our "junk" DNA (though I suspect the analogy breaks down in the details.)
It can get pretty Rube Goldbergesque, apparently. The main rule of thumb is that if you're doing GP with CPU strings as your representation, beware of including run time in the fitness function. Because you will almost inevitably get over adaptation that breaks on any other platform.
GP would just turn all the programmers to digital shepherds :P
It would be a really interesting challenge to get broad enough coverage in the fitness function that things like the previously-mentioned "400 NOPS" don't cause problems.
Besides unit (or otherwise) tests, it should also be possible to use static analysis to measure fitness and/or use a static type system to restrict the possibilities. I'd think this would have some of the same problems, though- anybody know of any work on that?
> It would be a really interesting challenge to get broad enough coverage in the fitness function that things like the previously-mentioned "400 NOPS" don't cause problems.
All the branches of evolutionary computing agree that evolved systems are really good at finding flaws in your problem statements. Most of the evolved systems you see in production are not the product of the first attempt to evolve the system.
> Besides unit (or otherwise) tests, it should also be possible to use static analysis to measure fitness and/or use a static type system to restrict the possibilities. I'd think this would have some of the same problems, though- anybody know of any work on that?
Type consistency for GPs is a whole subfield in itself. See the textbook I referred to, it has a chapter on it from memory.
For the longest time now I've been meaning to work on this, what I envisioned was to essentially a Javascript preprocessor that allowed you to insert placeholder tags that said "figure this chunk of code, with these constraints: ..., these suggestions: ..., these requirements: ..., etc.", at least to start.
The real problem with designing a system like that is not to get a GP framework running [1], that's a couple days of work at most. The difficulty is to set up a constraints and specification system that works well and is convenient, which means that it needs to be very flexible in terms of defining the spec, yet push the user to define everything carefully enough so that it can actually make progress towards a solution. Without a push in that direction it's just too easy to write crappy unit tests that either don't test enough of the input space to pick out the right implementation, or don't provide enough feedback to an optimizer to guide it towards a useful solution. This is a particularly tricky issue when it comes to GP, because programs that are close to right usually have outputs that are completely and utterly wrong, at least in my experience, so hill-climbing barely works at all when it comes to syntax trees (though it can be really useful to point a standard hill-climber at your numerical constants once a syntax tree is created, esp. in some types of mathy code). Writing code is closer to a needle-in-a-haystack problem, and it tends to confound evolutionary searches because the fitness feedback is not very helpful.
There's also always the problem of side effects, which can cause a lot of problems for GP apps, especially when you're dealing with external resources like databases or web services where you can't afford to just poke things randomly and see if they work (you might cause changes), but you need to actually touch them to develop because you rely on their responses to figure out if you're doing things right. So the user would always need to be providing additional layers of code to roll back or prevent changes in these resources, which could get messy.
A starting point for a project like this would probably be to look closely at something like Prolog, where you can do a bit of this sort of thing, at least for certain types of problem (though Prolog essentially does it only by brute force). Where I would go from there is probably to think about how to take to heart the idea of making the search strategy used at any point a core part of the interface, so that (for instance) you could either do a randomized search through programs with some bounded complexity (some combination of limits on token count, syntax tree depth, etc.), choose to run an evolutionary strategy with a set of tokens, optimize a set of constants to maximize a fitness metric, etc.
[1] I'd probably reject standard GP as the main approach - there are perfectly reasonable ways to automatically explore program space without using evolution (at least exclusively). I'd probably look toward some sort of hybrid search strategy, with a bit of random walk, a bit of brute force, and some statistical modeling of the domain (a la MOSES [http://code.google.com/p/moses/]) to inform the "evolution" part of it, as well as possibly leaning on a centralized database that automatically catalogued commonly useful code patterns and snippets.
And I want to be able to guide its evolution, prompt it to use certain technologies in certain places.