Hacker News new | past | comments | ask | show | jobs | submit login
What can you do in 2k LOC of C? (h4ck3r.net)
156 points by s3graham on Feb 3, 2011 | hide | past | web | favorite | 109 comments



  $ git show e83c5163316f89bfbde7d9ab23ca2e25604af290 --stat
  commit e83c5163316f89bfbde7d9ab23ca2e25604af290
  Author: Linus Torvalds <torvalds@ppc970.osdl.org>
  Date:   Thu Apr 7 15:13:13 2005 -0700

    Initial revision of "git", the information manager from hell

   Makefile       |   40 +++++++++
   README         |  168 ++++++++++++++++++++++++++++++++++++
   cache.h        |   93 ++++++++++++++++++++
   cat-file.c     |   23 +++++
   commit-tree.c  |  172 +++++++++++++++++++++++++++++++++++++
   init-db.c      |   51 +++++++++++
   read-cache.c   |  259 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
   read-tree.c    |   43 +++++++++
   show-diff.c    |   81 ++++++++++++++++++
   update-cache.c |  248 +++++++++++++++++++++++++++++++++++++++++++++++++++++
   write-tree.c   |   66 ++++++++++++++
   11 files changed, 1244 insertions(+), 0 deletions(-)
:-)


Ah, that's a good one, and worth study at that revision with so few lines. Thanks!


To be fair, that's 1244 _lines_, not bytes :)


So is what the OP is talking about :o (lines vs. bytes)


If you think about it for a second, you realize that TrueType rasterization can't be that hard because printers were doing it long ago on crappy little embedded processors, but the default is just to fall back on the big ugly library, and then wrap it and pretend it's not there. How about instead, just write some good code?

This is why 'modern' software can still manage to bring a 3GHz quad-core to it's knees, IMHO


This is also why we have many more softwares available now.

The improvement in hardware specs are so useful not only because it is faster, but also because we can do much more without having to care about the details. Writing tiny, efficient libraries is cool, but it takes a lot of time, everything else being equal, so if you can afford not doing it, you don't.


> This is also why we have many more softwares available now.

Assuming English isn't your first language, that would be phrased "much more software"; software is always singular.


More precisely, it's a mass noun (as opposed to a count noun).

Count noun: a program, some programs

Mass noun: some software. (you can't say "a software" or "some softwares")


http://www.ar.media.kyoto-u.ac.jp/members/david/

"David Cournapeau

a French PhD student in signal processing at Kyoto University"

France and Japan...he hasn't exactly been living in places where English is prominently spoken with grammatical accuracy...cut him some slack. In fact, he made the exact same error of pluralizing "software" as "softwares" on his website.


> cut him some slack

You act as if I'm giving him a hard time, I'm not. I've generally found multi-lingual people want to be corrected when they do it wrong because they find it helpful; I was doing him a favor.


Thanks for the correction (but I guess the correction still stand even if English were my first language).


I personally suspect the explosion of storage space is the biggest factor in all this. I was just writing software for a uC with 2kb program space last week, and I started to bump into the limit. Optimizing my code for size actually resulted in much better code, because I had to stop and think "How can I do this... smarter?"

(Though if you're trying to fit 8kb into 2kb and have to start doing voodoo, it's true the quality will suffer)


Current software (compiled code) would usually fit perfectly in the 20Mb hard drives I was using 20 years ago. The extra space is taken by various data files. I don't think I ever cared about compiled size with PC software, even when working with 360kb floppies.


Roberto Ierusalimschy's lpeg is around 2.4k loc of ansi C without any dependency beside libC and lua.h (needed to interface with Lua, since it's a Lua library).

It implements an efficient pattern matching system based on Parsing Expression Grammars (akin to CFGs, but without ambiguities). It consists of a Pattern/Grammar to bytecode compiler and a custom VM to interpret the result of the compiling phase.

Nice and clean.

http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html

http://en.wikipedia.org/wiki/Parsing_expression_grammar


The Lua code itself (while not <2 kloc) is also clean, well designed C, and definitely worth study. It gets quite a bit of functionality out of 12.6 kloc, and if you work with C, Lua is an ally (it's like Tcl, but good). :)

Also, in my experience, LPEG is a great fit for middle-ground parsing - more complex than Perlish regexps (which it handles well), and including those a bit more complex still, but it's awkward for really complex parsing (since it inherently combines lexing and parsing).

But I dissed Perl and Tcl, so -1. Screw useful information. ;)


> but it's awkward for really complex parsing (since it inherently combines lexing and parsing).

I'm kind of surprised by that statement - grammars for programming languages expressed in PEG always seem much cleaner to me than the lex/parse separation. Can you give an example of a grammar that you find is complicated by PEG?


I initially wrote the parser for a (proprietary) query language compiler using LPEG, and keeping track of lexical issues (whitespace, etc.) along with the grammatical structure complicated things. When I broke it apart and wrote a quick FSM-based lexer and a recursive descent parser, it became much simpler. Perhaps I could have factored the PEG grammar better, but with every PEG grammar I've had grow beyond a certain point, I've wished I could separate the lexing and parsing stages.

I much prefer working with LPEG to regular expressions for small to midsized stuff, though.


Shocked that silentbicycle hasn't mentioned it already, but Arthur Whitney whipped up the first prototype/inspiration for the J language in a short bit of macro heavy C over the course of an afternoon.

42 lines?

http://pastebin.com/s2usuqDq

If this interests you at all, absolutely worth reading Roger Hui's retrospective on the subject (more about J + Ken Iverson, but definitely fascinating) http://keiapl.org/rhui/


That's interesting, but damn is that some ugly code. Slightly obfuscated on purpose? Though of course it wouldn't win the IOCCC.

On the other hand, it's notable that many IOCCC submissions happen to pack a lot of functionality in often less than 2k. I remember reading a few descriptions of some the winning entries, but I can't find that now. Here's a glimpse though: http://cboard.cprogramming.com/brief-history-cprogramming-co...


Having spent a lot of time meditating on it (as sofuture mentioned), it isn't "slightly obfuscated" so much as "stubbornly written like APL rather than C".

If you become inexplicably fascinated by that code and want help unraveling it, my contact info is in my profile.

I don't consider it noteworthy as a useful program under 2kloc (on modern hardware it usually just crashes, it's quite cavalier with pointer casting, and clearly a quick prototype either way), but it's like a pink space laser beam of insight about the APL mindset. Real APLs take more than a page, but not that much more. Eliding loops does that.

A lot of the IOCCC code is also delightfully perverse, too. Highly recommended.


Yeah, I realized that fact a little late. Relevant snippet from the link about that initial bit of code for J:

I showed this fragment to others in the hope of interesting someone competent in both C and APL to take up the work, and soon recruited Roger Hui, who was attracted in part by the unusual style of C programming used by Arthur, a style that made heavy use of preprocessing facilities to permit writing further C in a distinctly APL style.


Arthur's C is stubbornly unconventional, but often, he's got a point.

Also, the APL community seems to be pretty disjoint from the rest of CS (though Arthur is also a Lisper). I think it'd be mutually beneficial if the APLers and the MLers got together, in particular.


"That's interesting, but damn is that some ugly code"

I would disagree. That code is beautiful. Elegance and brevity; it's as close to a pure expression of the coder's thoughts as it's reasonably possible to get.


Excuse me, but in my programming thought flow, there are checksums and limits, not c's and l's, and my code reflects that. Do I not qualify as a coder?


Quick, what does the i stand for in "for (i=0; i<n; i++) {"?

It was a quick prototype written by one APLer, for another APLer, using mutually understood conventions. For example, its dyadic verbs ("V2" functions) use the variable names 'a' and 'w'. For a while, I wondered why (maybe they stood for "Arthur" and "Whitney"?), but after reading Iverson's _A Programming Language_, I realized APL usually uses lowercase alpha (α) and omega (ω), which look like 'a' and 'w' in ASCII. Details like that would need no explanation.

Likewise, in

    typedef struct a{I t,r,d[3],p[2];}*A;
t is type (is it boxed? 0 or 1), r is rank (number of dimensions), d is the dimensions (hardcoded to a max of 3; a 2x3x4 matrix has d={2,3,4}), and p is a variable-length array sized by malloc-ing extra space; that's where the values go, either longs or void pointers). That'll save you some time if you ever try to pick the rest apart. :)

APL code is often terse to the point of seeming cryptic, just math papers are. You get used to it, and it's that way for a reason.


Wow, thanks for that clue. Will have to take a closer look at that source now. I don't feel too bad for not really getting it, given the creator of J saying this about it:

My immediate reaction on seeing the page was recoil and puzzlement: it looked nothing like any C code I had ever seen. (“Is it even C?”) However, Ken counselled that I should reserve judgment. As recounted in An Implementation of J, it then happened that:

I studied this interpreter for about a week for its organization and programming style; and on Sunday, August 27, 1989, at about four o’clock in the afternoon, wrote the first line of code that became the implementation described in this document.

The name “J” was chosen a few minutes later, when it became necessary to save the interpreter source file for the first time.


If it intrigues you, check out Henry Rich's _J for C Programmers_. There's a free PDF on Lulu, and it's included with J (http://www.jsoftware.com/).

Like Emacs, J's internal documentation is excellent, but you have to learn its idiosyncratic terminology before you can actually find things.


I obviously can't speak to Arthur Whitney's state of mind circa some summer afternoon 1990, but I don't believe the code is an attempt to obfuscate or be snobby/elitist at all. He was simply trying to express the basics of an idea in a terse form. It's the APL way.

I, for one, am happy simply finding the code 'of interest'. It's not great code, it's the initial sketch of an idea (I'll bet you AW would admit that readily (plus Roger Hui refers to it as "Incunabulum", which seems a bit tongue in cheek)) -- but, it's certainly not something you see every day, and it is thought provoking.


Nevermind the comment about it being slightly obfuscated. Have to realize that this is the creator of J we're talking about. He probably saw it as highly expressive code (much meaning in as few characters as possible). Though I can't agree, it is definitely amazing that something so small could be the start of something like J.


It was actually written by Arthur Whitney, the author of K (not J), though Roger Hui (of J) studied it beforehand.

I do recommend studying it, though. Their APL dialects are really something; I wish there were good fully open-source implementations, and writing one is on my queue. Kona (http://github.com/kevinlawler/kona), an open-source implementation of K 3.2, is also coming along.

Just be warned that APLer C is usually nasty, brutish, and short. :)

FWIW, the best thing I've seen for getting the J mindset is _J for C Programmers_ - as Henry Rich says, to do J you must "think big" - don't think "for each part of this list, do this, then this, then this...", think, "apply this operation over these.". That scales up to multiple dimensions, and can often be run in parallel. :)


On a related note, reading J's getting started guide has been pretty interesting so far: http://www.jsoftware.com/help/primer/contents.htm


> http://pastebin.com/s2usuqDq

I'm going to save this for the next time someone asks why there aren't more females in CS/programming. I just tell them that this kind of code is held in high esteem, instead of being ridiculed for the stupidity that it is.


Are you saying that women can't appreciate elegance and brevity? Perhaps your kind of attitude is why there aren't more females in programming.


Or perhaps it's people who mistake obfuscation for elegance and brevity.


I'd love to hear why that code doesn't embody brevity?

As for obfuscation, it's rather straightforward -- it's just concise. A lack of instant understanding doesn't indicate that a piece of code is without merit, or not honest and straightforward.


To be clear. If someone submitted this code to you for code review you'd say, "nice and concise -- approved"? Really?

I get this person may have been working with constraints we don't know about, but its not straightforward. Give this to 10 working C devs and ask them to tell you what it does without running it.

The standard cues of straightforward code aren't there -- well named variables, well named functions, comments, indentation, etc...


If we were working in an APL dialect, I would expect code like that. I can tell you what every token in that code is for, by the way. If we were working in C, no. But if you're trying to get C programmers to grok APL style, it's a good start.

The APL people are not fooling around, either - look at Kx's customers: http://kx.com/Customers/end-user-customers.php . And then check how much a license for kdb+ costs per core. Oh, and "a standard purchase requires licensing at least 8 cores."


Honestly, I think your superhuman then. For the most part I don't think ANY programming language is understandable without well named "tokens".

I'm curious, do you think you could tell me what some arbitrary C code does that I give you, where I've changed all the variables to "a", "b", "c", etc...? I'd be hardpressed to believe anyone in the world could.

Is APL structurally so expressive that huamns can easily extract meaning from structure? If so this is the single great accomplishment in the history of computer science. This dwarfs any contribution from Lisp, C, FORTRAN, theory of computation, etc...

And the fact that I know some of the people that worked on the original versions of APL at IBM, yet they never mentioned this, makes me suspect.


> I'm curious, do you think you could tell me what some arbitrary C code does that I give you, where I've changed all the variables to "a", "b", "c", etc...? I'd be hardpressed to believe anyone in the world could.

If you know the specific context ahead of time (a FSM for lexing, a hash table implementation, etc), then it really isn't that hard.* Are you honestly telling me that you wouldn't be able to figure out how a hash table implementation worked if all the types and variable names were changed to single letters? It would slow you down, sure, but it wouldn't be impossible.

* Speaking as someone who has debugged code with variable names and comments in Swedish.

That code was written with the context of, "Hey, we both know APL really well, here's a quick prototype for a new implementation of it, could you give it a look?" When somebody is showing a Scheme implementation to other Lispers, they don't need to spend time explaining what a cdr is. Context matters.

Most of the functions in that take one or two arbitrarily-dimensioned arrays and loop simple operations over them. Everything takes (w) or (a, w) and returns z. i is a loop index (as usual). w->p is the array of values in w, ga allocates a new array. ("get array"? "generic allocate?")

   V1(iota){I n=*w->p;A z=ga(0,1,&n);DO(n,z->p[i]=i);R z;}
iota (the term comes from APL; J uses "i.", K uses "!", Q uses "til") takes a number and returns a vector of integers from 0 to n. iota 5 -> 0 1 2 3 4. Most of the definitions are equally simple, there just isn't much whitespace. DO is a macro that abstracts out the common "for (i=0; i<n; i++) { ... }" loop, which is used all over the place.


Are you honestly telling me that you wouldn't be able to figure out how a hash table implementation worked if all the types and variable names were changed to single letters?

Yes, probably not. I've looked at code that I've personally written with variable names -- in the debugger and still scratched my head trying to figure out what I had done. And not trying to overstate my ability, but I tend to be someone who is generally pretty good at reading code.

Here's some code you might run across in some DSP code. There are some globals, and hopefully I translated it correctly. But in any case, w/ good function variable naming I could understand the intent in a few seconds. Without it, I think I'd scratch my head for some time. How long does it take you to figure out what it does (or what it is intended to do)?

   ov f(){
   xx b, hh, i, pp, j, jj = PL, k;
   xx D = NN>>2*b,d=n-2*zz;
   t *z[B]; t *p, m, n, q, oo;
   for (i=0; i<B; i++) z[i] = &X[ro[i]*jj];
   for (b=0; b<D; b++) { f(b, hh, d);
   for (i=0; i<B; i++) { pp=ro[i];
   k = (b << zz) + i;
   p = &YY[(hh<<zz) + (pp<<(n-zz))];
   for (j=0; j<B; j+=SS) { m = z[j][k];n=z[j+1][k];q = z[j+2][k];oo = z[j+3][k];
   p[j]=m;p[j+1]=n;p[j+2]= q;p[j+3]=oo;}}}}


I think we're arguing past each other. The J prototype took me part of a morning to figure out (and get working on more modern hardware), but I wasn't familiar with APL conventions at the time. Code like that is slower to read, just not impossible.

About your code sample - I haven't done anything with digital signal processing code, so I couldn't tell you. It looks like some sort of wave transformation, but that's like saying code from a 3D rendering engine is "doing something with triangles", sorry.

Most of the APL functions are only a line or two long, though - that makes a big difference. "accumulate i..N", "get new vector with a[i] + w[i]", etc. The only part that is individually complex is the parser.


I upvoted you, but as mentioned earlier in this thread by silentbicycle, this was written by one APLer to be read by another. I suppose even knowing APL makes it still difficult to follow, but only as difficult as APL itself is to follow right? :)


And how is that at all related to female numbers in CS/programming?


The core of my protobuf-decoding library upb (https://github.com/haberman/upb/wiki) is 3k SLOC of C. That includes

  * a hash table implementation
  * a reference-counted string type
  * a generic interface for doing tree traversals of protobuf data
  * the protobuf decoder itself (which implements the previous interface)
  * all the code to load proto descriptors (including bootstrapping the first one,
    which is necessary to load others)


It really says something about C that so many useful-but-small systems have a good chunk of code devoted to hash table implementations, atoms ("a reference-counted string type"), etc.

When working with C, it sometimes makes sense to have bespoke data structures, but it always makes me think of Hanson's _C Interfaces and Implementations_, Greenspun's tenth rule, etc.

My 1500loc C project also has a hash table implementation (and, soon, dynamic arrays (gasp!)). That and serializing custom data structures to disk accounts for more than half of its code. It's fast as hell, but I originally prototyped it in like 100 lines of Lua.


> It really says something about C that so many useful-but-small systems have a good chunk of code devoted to hash table implementations, atoms ("a reference-counted string type"), etc.

Well, the only reason for that is that C never had a hash table in its standard library. The same assertion is true about any other programming language I've seen that doesn't.

Python's and Lua's dicts are really hard to top (at least in the general case, and definitely inside the language), so there aren't even any attempts. Java's is slow and horrible, so there are hundreds of replacements, but they're all interface compatible.

> That and serializing custom data structures to disk accounts for more than half of its code.

That's one of the the things K gets unbelievably right - there is one routine (which with Arthur's style is probably 10 lines of C) for serialize, and another for deserialize. They work for any K structure, and they do that with blinding speed thanks to being totally memory mapped. My C code has switched to that approach as well - and it works really, really well.


My text indexer is using a mmap'd hash table too, and that's exactly why. :)


Sadly, I've written many hash-table implementations that are only ever used once, in a function. The hash table is just a local array, and the structures I work with generally already have a next pointer available for simplistic chaining.

Minimal hashtable implementation: an array declaration, a memset call, and a template search loop which either uses pointers or locations (pointers to 'next' pointers), depending on whether you need to look up the item, or potentially remove it. Roughly 2+n lines for n hash operations.


> It really says something about C that so many useful-but-small systems have a good chunk of code devoted to hash table implementations

For this speed-obsessed project, I wouldn't have it any other way:

  Benchmarking hash lookups in an integer-keyed hash table.

  Table size: 8, keys: 1-8 ====
  upb_inttable(seq): 410 M/s
  upb_inttable(rand): 334 M/s
  std::map<int32_t, int32_t>(seq): 149 M/s
  std::map<int32_t, int32_t>(rand): 170 M/s
  __gnu_cxx::hash_map<uint32_t, uint32_t>(seq): 69.9 M/s
  __gnu_cxx::hash_map<uint32_t, uint32_t>(rand): 68.0 M/s

  Table size: 64, keys: 1-64 ====
  upb_inttable(seq): 410 M/s
  upb_inttable(rand): 333 M/s
  std::map<int32_t, int32_t>(seq): 32.4 M/s
  std::map<int32_t, int32_t>(rand): 33.4 M/s
  __gnu_cxx::hash_map<uint32_t, uint32_t>(seq): 67.9 M/s
  __gnu_cxx::hash_map<uint32_t, uint32_t>(rand): 71.4 M/s

  Table size: 512, keys: 1-512 ====
  upb_inttable(seq): 407 M/s
  upb_inttable(rand): 330 M/s
  std::map<int32_t, int32_t>(seq): 20.8 M/s
  std::map<int32_t, int32_t>(rand): 17.2 M/s
  __gnu_cxx::hash_map<uint32_t, uint32_t>(seq): 70.8 M/s
  __gnu_cxx::hash_map<uint32_t, uint32_t>(rand): 64.6 M/s

  Table size: 64, keys: 1-32 and 10133-10164 ====
  upb_inttable(seq): 394 M/s
  upb_inttable(rand): 334 M/s
  std::map<int32_t, int32_t>(seq): 32.0 M/s
  std::map<int32_t, int32_t>(rand): 30.7 M/s
  __gnu_cxx::hash_map<uint32_t, uint32_t>(seq): 71.3 M/s
  __gnu_cxx::hash_map<uint32_t, uint32_t>(rand): 70.6 M/s


Summarizing your results, your hash table implementation is between five and ten times faster than STL and G++ generic maps, because G++, like most C++ compilers, imposes a substantial abstraction penalty on the use of templates — even though in theory that's not necessary. Is that right?


Yes, 5-10x faster. As to the reason, I can only speculate. I doubt it's template-imposed abstraction penalty per se, though it could be a result of having to write the algorithm in a generic way. In other words, if you did a sort of "manual instantiation" of the hash_map template, where you made the types specific but didn't change a thing besides that, I think you'd get the same performance as the templated algorithm.

I can't say for sure the reason for the difference. Previously I thought that hash_map might have been wasting time calculating a hash of the integer key (whereas I just use the key as the hash), but I just added another test that uses hash_map with an identity hash function and the performance was unchanged.

I bet that hash_map is using external chaining, whereas I'm using internal chaining which has better caching behavior. Besides that it's hard to say for sure why mine is so much faster.


Could be. I used internal chaining also in the program I mentioned in the other comment. And probably a power-of-2 size and another tweak or two -- I don't remember details.


See OP for hash_map. And std::map is dog slow because somebody thought a Red-Black-Tree is a good idea for a map.

So at least part of the penalty there is simply choosing the wrong data structure.

Also, IIRC, both hash_map and std::map don't support a reserve call. Which means you pay for quite a few allocations.


> And std::map is dog slow because somebody thought a Red-Black-Tree is a good idea for a map.

What's wrong with red-black trees for an ordered map?

> IRC, both hash_map and std::map don't support a reserve call. Which means you pay for quite a few allocations.

This is absolutely an issue for insert performance, but I was only benchmarking lookup performance, which should not perform any allocations.


> What's wrong with red-black trees for an ordered map?

I was wondering that too, except without the "n ordered" part. The claim that trees are competitive with hash tables in the average case rests on the claim that comparison is much faster than hashing, because you only have to hash once in the average case, whereas you have to compare something like 1.5 lg N times in the average case, which might be between 4 and 22 in common cases. I just ran this microbenchmark to compare hashing integers with comparing them, and hashing seems to be actually slightly faster than comparing on my CPU; this implies that hash tables should be dramatically faster than red-black trees.

https://gist.github.com/814746

In theory, this forces an unpleasant dilemma on code that aspires to be real-time but wants to do a lookup in a finite map: use hash tables with their awful worst-case performance (typically O(N), although you can improve this by hanging trees off your hash buckets, but that would be stupid), or use trees with their awful average-case performance?

In practice, I've never written real-time code, so I don't know if this dilemma is real.


It won't perform allocations, but since the structure has been built from multiple disjoint allocations, you're likely to get worse spatial coherence. (I.e. more cache misses)

And yes, for an ordered map RBL is not a bad choice. "I object to std::map being ordered" would probably have been the better wording.


In my last C++ program, 5 years ago, I did find it worthwhile to code a custom hashtable. Was a bit surprised. This was g++, probably an older version even then.


I'm working on a text indexing/retrieval program, like locate (http://www.openbsd.org/cgi-bin/man.cgi?query=locate) but for content and not just filenames, and with an index <=2% the size of the indexed data.

It's very nearly together (integrating individually working parts now), and is currently ~1,500 lines (according to sloccount).

Adding support for indexing Unicode text, more configuration, composite search queries (A and B near C and not D), etc. will no doubt make the source expand a bit, but it's still pretty small.

If you're interested in trying it out once it's ready, contact info is in my profile. I'm shooting for within a week or two for a beta vulgaris. (Requires Unix. ANSI C, strung together with sh and/or awk to avoid dependencies.)


Does the index include the dictionary too, in your calculation? I'd be interested in seeing this indexing/retrieval program of yours, I hope you release it soon!


Yes.


I wrote EasyEXIF: http://code.google.com/p/easyexif/

About 120 lines of C++ [see 1] that parses basic EXIF information out of a JPEG image. I found all the other EXIF parsing tools and libraries a little too heavyweight for something as simple as getting the date and time a picture was taken, or the f/stop or exposure time. It only uses string.h for memcpy and memset, and no other headers.

[1] http://code.google.com/p/easyexif/source/browse/trunk/exif.c... (note: Google's source view screws up my whitespace)


I was tempted to show off some toys I feel fatherly pride for, like my 500-line spreadsheet, but the only C program in this size range that I still use much is http://wry.me/~darius/software/req.tar.gz -- a rewrite-rule-based programmable calculator. Since it's >20 years old it's not at all what I'd write now.

(Toy spreadsheet at https://github.com/darius/vicissicalc)


Getting a standards-compliant XML parser into 2K lines is going to be a challenge, if you're not going to cheat on what a "line" is. You must be able to deal with both UTF-8 and UTF-16 [1] (and remember UTF-16 can be in either endian order), you have several tables of things like what chars are valid where, you've got data structures to declare, and there's a lot of edge cases that may not leap to mind but if you don't cover them you don't really have an XML parser, like CDATA handling, entity loading from a DTD, parsing DTDs at least well enough to get those entities, processor commands, etc. A useful subset certainly, something I'd actually call a real XML parser, I'm skeptical. Not quite ready to write the idea off, but skeptical. (It's saved from me writing if off entirely because if I read the spec correctly a parser is not required to resolve external DTD references, if it was it would be stick-a-fork-in-it done, you'd eat hundreds of lines just using raw sockets to do HTTP requests and manage them even halfway properly.)

A JSON parser? Heck yes, even with the UTF-8 handling. It wouldn't even necessarily suck.

[1]: http://www.w3.org/TR/REC-xml/#charsets


A JSON parser at ~2K lines: https://github.com/johnezang/JSONKit I'm the author, so I'm obviously biased :). It has strict Unicode standard UTF-8 parsing ("passes" http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt) and has a whiz-bang LRU cache with a clever aging replacement policy that saves lots of time by reusing already seen and instantiated immutable objects (think JSON dictionaries: the keys in "key": value pairs tend to repeat an awful lot). A further benefit is this saves lots of memory too. For converting JSON to 'native' ObjC/Foundation objects, it's faster than the other ObjC JSON libraries by 2-10x (HEAVILY dependent on the JSON being parsed, typical is 2-4x), and serializing ObjC -> JSON is (usually) even faster.

So, in short, it's possible to write a very high performance JSON serializer and deserializer in ObjC in ~2K lines of code. For JSONKit, most it is actually pure C (which in turn uses Core Foundation, which is a pure C API interface version for the equivalent native ObjC objects), with public API ObjC stub bindings making use of the C code.


> Getting a standards-compliant XML parser into 2K lines is going to be a challenge

This is a pretty good argument against XML.


Yes, that point may have had some influence on the way I phrased my post.


I would say its an argument for using JSON in situations where you need stuff to be small, fast and simple and XML in situations where you need it (where you want to do a queries easily over the document structure, which XPath is fairly decent at, interop with systems that insist on XML and things that are more document-ish rather than data-ish.


I recently wrote a Turtle[1] parser and abbreviating serialiser[2] in just over 2K lines, which I'd say is roughly equivalent in complexity to doing the same for JSON. This is with UTF-8 support, full conformance, URI parsing/resolution, line/column error reporting, etc. A more kludgy job could be quite a bit smaller still...

[1]: http://www.w3.org/TeamSubmission/turtle/ [2]: http://drobilla.net/software/serd/


Sounds like a challenge (to someone ;) A good XML test suite to confirm "fullness" must exist?


A JSON parser (of sorts) in roughly 150 lines of C code. I'm not the author.

https://github.com/quartzjer/js0n


How about micro_httpd, an inetd-style HTTP/1.0 server in 200 lines of C. (http://acme.com/software/micro_httpd/)


In 2000 lines of C, you can encode the HTML5 "named character character entity" table.

http://dev.w3.org/html5/spec/Overview.html#named-character-r...

I hate HTML.


O.. M.. F.. G.. I had no idea they did this. I mean, what is the point? Talk about "not invented here" mentality: "I've got a fantastic idea! Unicode already has an official name for every Unicode character, so let's throw all of that out and come up with our own names that have no relation to what those Unicode guys did! And while we're at it, HTML4, HTML5? Version numbers are for pu$$7$s, so lets drop that too, and then randomly throw in a few hundred extra character entities every two to eighteen months for a fresh 'What's new' bullet point!"


It's for backwards compatibility, like most of the HTML5 spec. If browsers suddenly start becoming HTML5-compliant and as a result most of the web stops working, then they've failed.

I agree with the reasoning. But damn, it makes things suck going forwards. This is why we can't have nice things. :-(


I wrote an cooperative multitasking RTOS for an ATMega processor in 1200 or so lines.


If you're willing to share the code, I'm sure many of us would love to see it.


Here you go!

https://bitbucket.org/pnathan/uirtos

Turned out it was a preemptive RTOS. I had forgotten that.


I wrote a preemptive multitasking one for an AVR in about the same amount of code :-)


An JSON / XML / YAML parser?

The fastest JSON parser that I know of is implemented in ~1800 lines of Objective-C: https://github.com/johnezang/JSONKit. Its about 2-3x faster than the yajl bindings for Objective-C are, in my testing.


Fabrice Bellard's IOCCC entry of 2002 was a C-subset compiler in 617 wc-lines (ELF version).

http://bellard.org/otcc/


It's an amazing piece of work: Those 617 lines are in its own C subset, so it can compile itself into a native code ELF version.

This evolved into bellard's "tcc" compiler, which is the fastest (compile-time wise) available on linux, the only one I'm aware of that can be used as a library to run code in memory (without having to generate an .so and load that). And it's output performance, while lacking compared to -O3 gcc or clang - is not too shabby!


The next question is:

How many of these 2K sections can you string together to make something of further use? (using another 2k section, of course)


How about an X11 tiling window manager? http://dwm.suckless.org/


Write a basic Lisp interpreter/compiler and write everything else in Lisp? :)

Would compiling Lisp code count against the challenge's LOC limit?


I did once write a 2k-line bytecode interpreter and runtime for R4RS Scheme -- the compiler, library, and debugger were another 2-3kloc of Scheme. I even used it as my main hacking platform for years. But I didn't post it here because:

* It depends on the Boehm garbage collector.

* (This is embarrassing.) It stopped working after some version bump of GCC, and the cause is not at all obvious to me. Since I don't Scheme much anymore I gave up.

http://wry.me/~darius/software/uts.tar.gz if anyone cares.


No blame, but I'm curious "why you don't Scheme much anymore". Was it due to Python, Common Lisp, etc?

I switched from (Chicken) Scheme to Common Lisp (and, in parallel, from Python to Lua), and, having read a lot of old code of yours, I'm curious about your choices - you seem consistently sensible and pragmatic. (Sorry to put you on the spot.)


Thanks! Partly Python's gotten more tolerable as a language, partly I'm doing more things needing libraries, partly I mostly code inside https://github.com/darius/halp these days and it doesn't have a Scheme mode so far. I do have a couple of recent Scheme projects up on github though -- optilamb and selfcentered.

Very impressed with LuaJIT2, btw -- I'd like to do more with it.


Halp is fun. Thanks. Installed it already.


Glad you liked it!


Back in The Day (tm) I wrote a full Gnutella client in about 1600 lines of code.


Anyone know how big nginx is? I wouldnt at all be surprised if its under 2k LOC.


You're probably joking, but sloccount says:

    ansic:        82714 (99.68%)
    perl:           124 (0.15%)
    sh:              78 (0.09%)
    asm:             48 (0.06%)
    cpp:             17 (0.02%)
(for nginx-0.8.53)


Why have I gotten 13 upvotes for this? Serious question.


It's a show of thanks for taking the trouble to give a solid answer backed by facts. Just as bad memes are publicly swatted down, good work (even small) should be praised to set an example for others.


If actual facts are that rare in software engineering, then that's one hell of a "code smell". It really wasn't hard to get.


Facts are pretty common in (good) software engineering, but they're rare in Internet discussion. Hence, solid facts in Internet discussion tend to rack up points.


I upvoted due to sloccount. I didn't know such a tool existed.


Umm..13 people found it insightful/interesting? I don't know..


I can think of plenty of comments that were /actually/ insightful, though. This feels like, "oh, wow, you know how to use science! +1". sad.


It's not sad at all. HN is, for me at least, primarily a place for learning. I scan comments for people speaking of facts and first-hand experience, or links to similar things.

You provided some hard data, and some people who appreciated it pressed the up arrow. It's nothing more than that, and certainly not sad.


Fair enough. I wish it was less rare though.


I don't think the upvotes were intended as a positive judgment of you as a person, but rather as a positive judgment of the value of your comment. A sort of thank-you for doing the legwork so that the rest of us can simply glance at your comment instead of downloading nginx source and running sloccount ourselves.


As of 0.9.4, the core module is almost 21,000 lines of code and the entire nginx source tree is about 118,000 lines.


How much memory does the code the blog poster described leak? How does it respond to edge cases and invalid input? Finally, is it portable beyond one specific OS? Beyond POSIX or Windows-based systems?

Those questions are especially pertinent in C.


The only dependencies are the C standard libraries (this is even mentioned in the blog post). So unless there's some terrible hackery going on (and skimming the source files, it doesn't look like there is), these are going to be more portable than most of the mega-libs you'd be choosing instead.

What makes you so incredulous about this?


Mine or stb_*? I know stb_image is quite careful not to leak. It may not handle all perverse input files gracefully though.

As for twok, I believe the error handling to be decent and I don't believe it leaks. I'd be happy to hear of any errors you find. It's only portable to Windows, OS X, and Linux because that's all I have access to. It may work on BSD also.


Right! Coming up with standalone C means coming up with your own versions of hard-to-write simple "system calls" like memove.

I debugged a memcopy (taken from Linux! 10 years ago) on a RISC processor - it had 12 bugs in a dozen lines of code. Not designed to run on RISC but shows how hard it can be to get this stuff right.


> simple "system calls" like memove

Minor, pet-peevy point: memmove(3) and memcpy(3) are not system calls. (http://en.wikipedia.org/wiki/System_call)


Right; runtime library.

Often they are not even that; they appear to be but compilers can replace them bodily with super-optimized code so they get better benchmarks.




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: