

Tests for programmers Part IV: comparing routines - RiderOfGiraffes
http://www.solipsys.co.uk/Writings/TestsForProgrammers_Part_4.html

======
jacquesm
I think there are some implications for copyright-on-code from all this
analysis, it is surprising given the simplicity of the problem that there are
so many unique ways to solve it.

This contradicts the statement that 'given an adequate spec two competent
programmers will solve a problem in the same way except for stylistic
differences'.

I've seen that tossed around quite a few times and if it doesn't hold water
for such a simple problem it certainly won't be left standing if we start
analyzing larger pieces of code.

On the whole I think this research is one of the most interesting things on HN
as of late.

~~~
eru
Yes, probably. But it also seems to depend on the level of your language. I.e.
if your language's built-ins are close to the problem, the solutions will
probably be much shorter and thus have a higher chance of being similar.

~~~
JoachimSchipper
Actually, C's string handling is _very_ well-suited to this problem. (Which
doesn't mean it is perfect - in fact, it's highly inconvenient for many tasks.
Just not this one.)

~~~
eru
Yes, it is. But C's other nuts and bolts are so low level that they introduce
lots of variability.

In e.g. Haskell everyone would just do

    
    
      condense_by_removing :: Char -> String -> String
      condense_by_removing c = filter (/=c)
    

Because there's much less going on here--the loop being more or less implicit
only--there's less room for variations.

Mind you, Haskell programs also show lots of variability if the tasks gets
only slightly more complicated.

~~~
dkersten
The Haskell snippet doesn't remove the character _in place_ though, does it? I
imagine solutions would be more varied then.

~~~
lincolnq
No. "In place" has virtually no meaning in Haskell.

I find this very interesting -- a popular idiom in one language can't even be
represented in another.

~~~
sesqu
One of the biggest reservations I initially had about abstraction, when
learning programming, was that I didn't consider performance a detail. As
such, I didn't litter my code with objects or pure functions even when they
made conceptual sense, unless I was convinced the overhead was small and the
abstraction helpful.

I've found that languages like Java make assumptions about your memory, and
languages like Prolog make assumptions about your data. I'm under the
impression that in Haskell in-place modification is an interpreter
optimization, whereas C is increasingly employed when resources are
constrained and abstraction layers are deemed too costly. Consequently, C is
maligned for resulting in faulty implementations behind every corner, yet it
remains one of the most used languages.

~~~
eru
The Haskell compiler is free to use in-place mutations, when it can prove that
nobody can access the old object any longer.

The language Clean supports this notion explicitly in its type system (and
uses it for IO, instead of, say, Monads like Haskell).

------
mullr
Wouldn't it be better to build a parse tree and then cluster based on the
difference between the trees? I think this is called "tree distance", or
sometimes "tree edit distance." There seems to be a reasonable amount of
research on the subject.

~~~
RiderOfGiraffes
Much harder to do with the tools I have immediately to hand, and much harder
to visualise the main results without special tools. The "fingerprint string"
gives 90% of the effect with very little effort on my part.

However, I will investigate further the "tree edit distance" to see if I can
bootstrap something quickly, just out of interest. Thanks for the reminder.

~~~
bediger
You should probably start with Brenda S Baker's work at Bell Labs:
[http://www.cs.ucdavis.edu/~devanbu/teaching/289/Schedule_fil...](http://www.cs.ucdavis.edu/~devanbu/teaching/289/Schedule_files/baker-
wcre95.pdf) references the final result.

Baker's alumni web page: <http://cm.bell-labs.com/cm/cs/who/bsb/index.html>

It's unfortunate that Baker's "dup" and "pdiff" haven't gotten the open source
treatment, or at least if they have, they're not widespread. SCO could have
saved themselves a lot of lawyer's fees by running "dup" over SysV and Linux
sources to see what's similar.

------
davisp
Pretty neat stuff. The method used for decimating representations is pretty
awesome. Seeing the aligned routines makes me think that quite a few of the
bioinformatics algorithms could be useful for such an analysis.

The two that spring to mind would are the MCL clustering algorithm that could
be applied quite easily to the similarity matrix. As a more academic endeavor,
it'd also be interesting to see what kinds of differences in similarity you'd
get by applying Needleman-Wunsch.

<http://www.micans.org/mcl/>
[http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algori...](http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm)

------
scorchin
Thanks for the explanation on your process for creating the clusters and the
analysis. It cleared up a few questions I had.

What's interesting is the large clusters where either a for or while loop were
used. I'm curious whether if given a (limited) choice any hiring manager would
have a preference between the two implementations?

TLDR: How much does/would coding style influence the hiring process?

~~~
RiderOfGiraffes
That's something I'll be going into in the next stage - feature
identification. Some routines used for, some used while, some used do, and
some used switch.

Yes, switch. Addmittedly not as the looping structure, but for the internal
decision process of the loop.

Some of these choices are indicative that the programmer isn't a native
idiomatic C programmer, but some might be genuine choices for one reason or
another. That's when style choice exposes internal concepts.

More later. (On current work load - much later. It's taking about 2 weeks per
article here, so it won't be quick. Sorry.)

~~~
scorchin
That's fair enough. Are you still coming to the London meetup on Thursday?

~~~
eru
Is it open for other hackers, too?

~~~
dkersten
And where are these things advertised? I'd love to attend, but need a little
bit more notice since I'd have to arrange a flight from Dublin and somewhere
to stay for a night or two.

~~~
scorchin
They're advertised as local posts here on HN. Here's the link:
<http://news.ycombinator.com/item?id=1434964>

Sorry for the delay, hope I've given you enough time to sort out travel
arrangements.

~~~
dkersten
Well, I wouldn't have made it anyway, but I'll keep my eyes open for the next
one. Maybe I'll be able to go then :)

------
Amnon
Just wondering, what's the solution to the riddle at the end? The test
function sets up the input, calls the routine, and then compares its output to
the desired output. There doesn't seem to be a place for error there.

~~~
JoachimSchipper
See <http://www.joachimschipper.nl/posts/20100622/answer.txt>. (Hidden behind
a hyperlink to not spoil the riddle.)

~~~
RiderOfGiraffes
I think I disagree with you. The input you mention is correctly handled by the
routine you mention, and is tested correctly.

Certainly there is a bug other than the one you think you've found, and I
don't think the circumstances you mention demonstrate a bug at all.

I would certainly be interested in seeing a more detailed analysis.

~~~
JoachimSchipper
You're right, that answer isn't correct. I still think that particular input
should be tested, though (you say it is - did you leave out that part of your
testing program? It should be possible to construct a program that fails on
such inputs.)

I don't think I've found a real "bug" yet. You include, but don't use,
stdlib.h; you exit with status 0, even on error; but these are nitpicks, not
what you mean. I'll think a bit more.

~~~
RiderOfGiraffes
I tested that case when you mentioned it. All routines submitted pass it
correctly, so I haven't worried about it too much. To do so would be to stray
too far from the original intention. I think it's hard to write a natural
looking routine that fails that test.

I return 0 in all cases because my test succeeds, even if the routine it's
testing fails. It's up to my shell to decide what to do about that error. As
it stands it reports the error, but it has succeeded in doing so, so it hasn't
failed.

But that's not the point, as you say.

And the real bug is still there.

~~~
JoachimSchipper
Well, I'm stumped. I can think of some other "cosmetic" issues and some things
you fail to test (e.g. that the function is in-place, runs in
O(strlen(z_terminated)) and does not access memory beyond
z_terminated[strlen(z_terminated)]), but that's it. Besides, as you mention,
such issues can usually be found just by looking at the function.

Will you give out the answer at some point?

~~~
RiderOfGiraffes
Yes - in the next part of the analysis.

------
sireat
Interesting analysis so far.

Are we going to see some performance data?

While pure performance should not be the overriding consideration, it would
still be interesting to see how everyone's routine stacked up.

I would hazard a guess that some dissimilar looking routines will have very
similar performance.

Of course, in reality, all I am looking for is the confirmation that my code
did not suck too badly...

~~~
RiderOfGiraffes
What's your routine ID and I'll give you my opinion - I will discuss
performance later.

------
greyfade
While I was trying to figure out why my submission ended up where it did...

I realized my code has a bug.

It's too late for a resubmission, isn't it? :(

~~~
JoachimSchipper
I think _all_ solutions are used, anyway. And the submissions are anonymous
for exactly this reason.

