
What can you do in 2k LOC of C? - s3graham
http://www.h4ck3r.net/2011/02/02/what-can-you-do-in-2k-of-c/
======
js2

      $ git show e83c5163316f89bfbde7d9ab23ca2e25604af290 --stat
      commit e83c5163316f89bfbde7d9ab23ca2e25604af290
      Author: Linus Torvalds <torvalds@ppc970.osdl.org>
      Date:   Thu Apr 7 15:13:13 2005 -0700
    
        Initial revision of "git", the information manager from hell
    
       Makefile       |   40 +++++++++
       README         |  168 ++++++++++++++++++++++++++++++++++++
       cache.h        |   93 ++++++++++++++++++++
       cat-file.c     |   23 +++++
       commit-tree.c  |  172 +++++++++++++++++++++++++++++++++++++
       init-db.c      |   51 +++++++++++
       read-cache.c   |  259 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
       read-tree.c    |   43 +++++++++
       show-diff.c    |   81 ++++++++++++++++++
       update-cache.c |  248 +++++++++++++++++++++++++++++++++++++++++++++++++++++
       write-tree.c   |   66 ++++++++++++++
       11 files changed, 1244 insertions(+), 0 deletions(-)
    

:-)

~~~
bdonlan
To be fair, that's 1244 _lines_, not bytes :)

~~~
kevinherron
So is what the OP is talking about :o (lines vs. bytes)

------
sliverstorm
_If you think about it for a second, you realize that TrueType rasterization
can't be that hard because printers were doing it long ago on crappy little
embedded processors, but the default is just to fall back on the big ugly
library, and then wrap it and pretend it's not there. How about instead, just
write some good code?_

This is why 'modern' software can still manage to bring a 3GHz quad-core to
it's knees, IMHO

~~~
cdavid
This is also why we have many more softwares available now.

The improvement in hardware specs are so useful not only because it is faster,
but also because we can do much more without having to care about the details.
Writing tiny, efficient libraries is cool, but it takes a lot of time,
everything else being equal, so if you can afford not doing it, you don't.

~~~
gnaritas
> This is also why we have many more softwares available now.

Assuming English isn't your first language, that would be phrased "much more
software"; software is always singular.

~~~
w1ntermute
<http://www.ar.media.kyoto-u.ac.jp/members/david/>

"David Cournapeau

a French PhD student in signal processing at Kyoto University"

France and Japan...he hasn't exactly been living in places where English is
prominently spoken with grammatical accuracy...cut him some slack. In fact, he
made the exact same error of pluralizing "software" as "softwares" on his
website.

~~~
gnaritas
> cut him some slack

You act as if I'm giving him a hard time, I'm not. I've generally found multi-
lingual people want to be corrected when they do it wrong because they find it
helpful; I was doing him a favor.

------
pygy_
Roberto Ierusalimschy's lpeg is around 2.4k loc of ansi C without any
dependency beside libC and lua.h (needed to interface with Lua, since it's a
Lua library).

It implements an efficient pattern matching system based on Parsing Expression
Grammars (akin to CFGs, but without ambiguities). It consists of a
Pattern/Grammar to bytecode compiler and a custom VM to interpret the result
of the compiling phase.

Nice and clean.

<http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html>

<http://en.wikipedia.org/wiki/Parsing_expression_grammar>

~~~
silentbicycle
The Lua code itself (while not <2 kloc) is also clean, well designed C, and
definitely worth study. It gets quite a bit of functionality out of 12.6 kloc,
and if you work with C, Lua is an ally (it's like Tcl, but good). :)

Also, in my experience, LPEG is a great fit for middle-ground parsing - more
complex than Perlish regexps (which it handles well), and including those a
bit more complex still, but it's awkward for really complex parsing (since it
inherently combines lexing and parsing).

But I dissed Perl and Tcl, so -1. Screw useful information. ;)

~~~
beagle3
> but it's awkward for really complex parsing (since it inherently combines
> lexing and parsing).

I'm kind of surprised by that statement - grammars for programming languages
expressed in PEG always seem much cleaner to me than the lex/parse separation.
Can you give an example of a grammar that you find is complicated by PEG?

~~~
silentbicycle
I initially wrote the parser for a (proprietary) query language compiler using
LPEG, and keeping track of lexical issues (whitespace, etc.) along with the
grammatical structure complicated things. When I broke it apart and wrote a
quick FSM-based lexer and a recursive descent parser, it became much simpler.
Perhaps I could have factored the PEG grammar better, but with every PEG
grammar I've had grow beyond a certain point, I've wished I could separate the
lexing and parsing stages.

I much prefer working with LPEG to regular expressions for small to midsized
stuff, though.

------
sofuture
Shocked that silentbicycle hasn't mentioned it already, but Arthur Whitney
whipped up the first prototype/inspiration for the J language in a short bit
of macro heavy C over the course of an afternoon.

42 lines?

<http://pastebin.com/s2usuqDq>

If this interests you at all, absolutely worth reading Roger Hui's
retrospective on the subject (more about J + Ken Iverson, but definitely
fascinating) <http://keiapl.org/rhui/>

~~~
ehsanul
That's interesting, but damn is that some ugly code. Slightly obfuscated on
purpose? Though of course it wouldn't win the IOCCC.

On the other hand, it's notable that many IOCCC submissions happen to pack a
lot of functionality in often less than 2k. I remember reading a few
descriptions of some the winning entries, but I can't find that now. Here's a
glimpse though: [http://cboard.cprogramming.com/brief-history-cprogramming-
co...](http://cboard.cprogramming.com/brief-history-cprogramming-
com/123-ioccc.html)

~~~
silentbicycle
Having spent a lot of time meditating on it (as sofuture mentioned), it isn't
"slightly obfuscated" so much as "stubbornly written like APL rather than C".

If you become inexplicably fascinated by that code and want help unraveling
it, my contact info is in my profile.

I don't consider it noteworthy as a _useful_ program under 2kloc (on modern
hardware it usually just crashes, it's quite cavalier with pointer casting,
and clearly a quick prototype either way), but it's like a pink space laser
beam of insight about the APL mindset. Real APLs take more than a page, but
not _that much_ more. Eliding loops does that.

A lot of the IOCCC code is also delightfully perverse, too. Highly
recommended.

~~~
ehsanul
Yeah, I realized that fact a little late. Relevant snippet from the link about
that initial bit of code for J:

 _I showed this fragment to others in the hope of interesting someone
competent in both C and APL to take up the work, and soon recruited Roger Hui,
who was attracted in part by the unusual style of C programming used by
Arthur, a style that made heavy use of preprocessing facilities to permit
writing further C in a distinctly APL style._

~~~
silentbicycle
Arthur's C is stubbornly unconventional, but often, he's got a point.

Also, the APL community seems to be pretty disjoint from the rest of CS
(though Arthur is also a Lisper). I think it'd be mutually beneficial if the
APLers and the MLers got together, in particular.

------
haberman
The core of my protobuf-decoding library upb
(<https://github.com/haberman/upb/wiki>) is 3k SLOC of C. That includes

    
    
      * a hash table implementation
      * a reference-counted string type
      * a generic interface for doing tree traversals of protobuf data
      * the protobuf decoder itself (which implements the previous interface)
      * all the code to load proto descriptors (including bootstrapping the first one,
        which is necessary to load others)

~~~
silentbicycle
It really says something about C that so many useful-but-small systems have a
good chunk of code devoted to hash table implementations, atoms ("a reference-
counted string type"), etc.

When working with C, it sometimes makes sense to have bespoke data structures,
but it always makes me think of Hanson's _C Interfaces and Implementations_,
Greenspun's tenth rule, etc.

My 1500loc C project also has a hash table implementation (and, soon, dynamic
arrays (gasp!)). That and serializing custom data structures to disk accounts
for more than half of its code. It's fast as hell, but I originally prototyped
it in like 100 lines of Lua.

~~~
beagle3
> It really says something about C that so many useful-but-small systems have
> a good chunk of code devoted to hash table implementations, atoms ("a
> reference-counted string type"), etc.

Well, the only reason for that is that C never had a hash table in its
standard library. The same assertion is true about any other programming
language I've seen that doesn't.

Python's and Lua's dicts are really hard to top (at least in the general case,
and definitely inside the language), so there aren't even any attempts. Java's
is slow and horrible, so there are hundreds of replacements, but they're all
interface compatible.

> That and serializing custom data structures to disk accounts for more than
> half of its code.

That's one of the the things K gets unbelievably right - there is one routine
(which with Arthur's style is probably 10 lines of C) for serialize, and
another for deserialize. They work for any K structure, and they do that with
blinding speed thanks to being totally memory mapped. My C code has switched
to that approach as well - and it works really, really well.

~~~
silentbicycle
My text indexer is using a mmap'd hash table too, and that's exactly why. :)

------
silentbicycle
I'm working on a text indexing/retrieval program, like locate
(<http://www.openbsd.org/cgi-bin/man.cgi?query=locate>) but for content and
not just filenames, and with an index <=2% the size of the indexed data.

It's very nearly together (integrating individually working parts now), and is
currently ~1,500 lines (according to sloccount).

Adding support for indexing Unicode text, more configuration, composite search
queries (A and B near C and not D), etc. will no doubt make the source expand
a bit, but it's still pretty small.

If you're interested in trying it out once it's ready, contact info is in my
profile. I'm shooting for within a week or two for a _beta vulgaris_.
(Requires Unix. ANSI C, strung together with sh and/or awk to avoid
dependencies.)

~~~
ajays
Does the index include the dictionary too, in your calculation? I'd be
interested in seeing this indexing/retrieval program of yours, I hope you
release it soon!

~~~
silentbicycle
Yes.

------
mayank
I wrote EasyEXIF: <http://code.google.com/p/easyexif/>

About 120 lines of C++ [see 1] that parses basic EXIF information out of a
JPEG image. I found all the other EXIF parsing tools and libraries a little
too heavyweight for something as simple as getting the date and time a picture
was taken, or the f/stop or exposure time. It only uses string.h for memcpy
and memset, and no other headers.

[1]
[http://code.google.com/p/easyexif/source/browse/trunk/exif.c...](http://code.google.com/p/easyexif/source/browse/trunk/exif.cpp)
(note: Google's source view screws up my whitespace)

------
abecedarius
I was tempted to show off some toys I feel fatherly pride for, like my
500-line spreadsheet, but the only C program in this size range that I still
use much is <http://wry.me/~darius/software/req.tar.gz> \-- a rewrite-rule-
based programmable calculator. Since it's >20 years old it's not at all what
I'd write now.

(Toy spreadsheet at <https://github.com/darius/vicissicalc>)

------
jerf
Getting a standards-compliant XML parser into 2K lines is going to be a
challenge, if you're not going to cheat on what a "line" is. You must be able
to deal with both UTF-8 and UTF-16 [1] (and remember UTF-16 can be in either
endian order), you have several tables of things like what chars are valid
where, you've got data structures to declare, and there's a lot of edge cases
that may not leap to mind but if you don't cover them you don't really have an
_XML_ parser, like CDATA handling, entity loading from a DTD, parsing DTDs at
least well enough to get those entities, processor commands, etc. A useful
subset certainly, something I'd actually call a real _XML parser_ , I'm
skeptical. Not quite ready to write the idea off, but skeptical. (It's saved
from me writing if off entirely because if I read the spec correctly a parser
is not required to resolve external DTD references, if it was it would be
stick-a-fork-in-it done, you'd eat hundreds of lines just using raw sockets to
do HTTP requests and manage them even halfway properly.)

A JSON parser? Heck yes, even with the UTF-8 handling. It wouldn't even
necessarily suck.

[1]: <http://www.w3.org/TR/REC-xml/#charsets>

~~~
tzs
> Getting a standards-compliant XML parser into 2K lines is going to be a
> challenge

This is a pretty good argument against XML.

~~~
jerf
Yes, that point may have had some influence on the way I phrased my post.

------
sigil
How about micro_httpd, an inetd-style HTTP/1.0 server in 200 lines of C.
(<http://acme.com/software/micro_httpd/>)

------
nostrademons
In 2000 lines of C, you can encode the HTML5 "named character character
entity" table.

[http://dev.w3.org/html5/spec/Overview.html#named-
character-r...](http://dev.w3.org/html5/spec/Overview.html#named-character-
references)

I hate HTML.

~~~
johne
O.. M.. F.. G.. I had no idea they did this. I mean, what is the point? Talk
about "not invented here" mentality: "I've got a fantastic idea! Unicode
already has an official name for every Unicode character, so let's throw all
of that out and come up with our own names that have no relation to what those
Unicode guys did! And while we're at it, HTML4, HTML5? Version numbers are for
pu$$7$s, so lets drop that too, and then randomly throw in a few hundred extra
character entities every two to eighteen months for a fresh 'What's new'
bullet point!"

~~~
nostrademons
It's for backwards compatibility, like most of the HTML5 spec. If browsers
suddenly start becoming HTML5-compliant and as a result most of the web stops
working, then they've failed.

I agree with the reasoning. But damn, it makes things suck going forwards.
This is why we can't have nice things. :-(

------
pnathan
I wrote an cooperative multitasking RTOS for an ATMega processor in 1200 or so
lines.

~~~
jpd
If you're willing to share the code, I'm sure many of us would love to see it.

~~~
pnathan
Here you go!

<https://bitbucket.org/pnathan/uirtos>

Turned out it was a preemptive RTOS. I had forgotten that.

------
Zev
_An JSON / XML / YAML parser?_

The fastest JSON parser that I know of is implemented in ~1800 lines of
Objective-C: <https://github.com/johnezang/JSONKit>. Its about 2-3x faster
than the yajl bindings for Objective-C are, in my testing.

------
limmeau
Fabrice Bellard's IOCCC entry of 2002 was a C-subset compiler in 617 wc-lines
(ELF version).

<http://bellard.org/otcc/>

~~~
beagle3
It's an amazing piece of work: Those 617 lines are in its own C subset, so it
can compile itself into a native code ELF version.

This evolved into bellard's "tcc" compiler, which is the fastest (compile-time
wise) available on linux, the only one I'm aware of that can be used as a
library to run code in memory (without having to generate an .so and load
that). And it's output performance, while lacking compared to -O3 gcc or clang
- is not too shabby!

------
joe_the_user
The next question is:

 _How many of these 2K sections can you string together to make something of
further use?_ (using another 2k section, of course)

------
just-a-c-hacker
How about an X11 tiling window manager? <http://dwm.suckless.org/>

------
zephjc
Write a basic Lisp interpreter/compiler and write everything else in Lisp? :)

Would compiling Lisp code count against the challenge's LOC limit?

~~~
abecedarius
I did once write a 2k-line bytecode interpreter and runtime for R4RS Scheme --
the compiler, library, and debugger were another 2-3kloc of Scheme. I even
used it as my main hacking platform for years. But I didn't post it here
because:

* It depends on the Boehm garbage collector.

* (This is embarrassing.) It stopped working after some version bump of GCC, and the cause is not at all obvious to me. Since I don't Scheme much anymore I gave up.

<http://wry.me/~darius/software/uts.tar.gz> if anyone cares.

~~~
silentbicycle
No blame, but I'm curious "why you don't Scheme much anymore". Was it due to
Python, Common Lisp, etc?

I switched from (Chicken) Scheme to Common Lisp (and, in parallel, from Python
to Lua), and, having read a lot of old code of yours, I'm curious about your
choices - you seem consistently sensible and pragmatic. (Sorry to put you on
the spot.)

~~~
abecedarius
Thanks! Partly Python's gotten more tolerable as a language, partly I'm doing
more things needing libraries, partly I mostly code inside
<https://github.com/darius/halp> these days and it doesn't have a Scheme mode
so far. I do have a couple of recent Scheme projects up on github though --
optilamb and selfcentered.

Very impressed with LuaJIT2, btw -- I'd like to do more with it.

~~~
limmeau
Halp is fun. Thanks. Installed it already.

~~~
abecedarius
Glad you liked it!

------
grammaton
Back in The Day (tm) I wrote a full Gnutella client in about 1600 lines of
code.

------
whakojacko
Anyone know how big nginx is? I wouldnt at all be surprised if its under 2k
LOC.

~~~
silentbicycle
You're probably joking, but sloccount says:

    
    
        ansic:        82714 (99.68%)
        perl:           124 (0.15%)
        sh:              78 (0.09%)
        asm:             48 (0.06%)
        cpp:             17 (0.02%)
    

(for nginx-0.8.53)

~~~
silentbicycle
Why have I gotten _13_ upvotes for this? Serious question.

~~~
nkurz
It's a show of thanks for taking the trouble to give a solid answer backed by
facts. Just as bad memes are publicly swatted down, good work (even small)
should be praised to set an example for others.

~~~
silentbicycle
If actual facts are that rare in software engineering, then that's one hell of
a "code smell". It really wasn't hard to get.

~~~
nostrademons
Facts are pretty common in (good) software engineering, but they're rare in
Internet discussion. Hence, solid facts in Internet discussion tend to rack up
points.

------
derleth
How much memory does the code the blog poster described leak? How does it
respond to edge cases and invalid input? Finally, is it portable beyond one
specific OS? Beyond POSIX or Windows-based systems?

Those questions are especially pertinent in C.

~~~
JoeAltmaier
Right! Coming up with standalone C means coming up with your own versions of
hard-to-write simple "system calls" like memove.

I debugged a memcopy (taken from Linux! 10 years ago) on a RISC processor - it
had 12 bugs in a dozen lines of code. Not designed to run on RISC but shows
how hard it can be to get this stuff right.

~~~
sigil
> simple "system calls" like memove

Minor, pet-peevy point: memmove(3) and memcpy(3) are not system calls.
(<http://en.wikipedia.org/wiki/System_call>)

~~~
JoeAltmaier
Right; runtime library.

Often they are not even that; they appear to be but compilers can replace them
bodily with super-optimized code so they get better benchmarks.

