Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What are the “best” codebases that you've encountered?
392 points by 0mbre on July 29, 2019 | hide | past | favorite | 269 comments
I am rather fond of the concepts described in "Clean Code" by Robert Martin but it seems that in real life, a really high-quality codebase is hard to come by.

While I am asking myself this question, the only one that popups to my mind would be Laravel: https://github.com/laravel/laravel (PHP)

One could think that a codebase as popular as React (https://github.com/facebook/react) would be a perfect example of "clean code" but with a glance, I personally don't find it very expressive.

This may all be very subjective but I would love to see examples of codebases that member of this community have enjoyed working with

Perhaps I'm jaded, but I notice that all the examples given here are developer tools or otherwise things with well scoped functional inputs and outputs (e.g. ffmpeg).

Anyone have an example of a consumer application that has a good codebase? Chromium, GitLab, OpenOffice, etc? I feel like such applications inherently have more spaghetti because the human problems they're aiming to solve are less concretly scoped. Even something as simple as "Take the data from this form and send it to the project manager" ends up being insanely complex and nitpicky. In what format should the data be sent? How do we know who the project manager is? Via what format should the data be sent? How should we notify the project manager? When should we send the report? Some of these decisions are inherently inelegant, so I feel like you get inelegant code.

NetHack's source code is absolutely beautiful. It's probably the most well-kept, well-designed, well-structured and well-implemented K&R C program in history. Although they are switching to ANSI syntax soon. It's even more amazing because NetHack is (1) a video game, where generally speaking code quality is sacrificed for efficiency, and (2) it has been developed by a constantly changing team of volunteers over the course of over thirty years, and there still isn't any spaghetti.

Nethack suffers somewhat because the era in which it was birthed required a lot of portability, so there is a ton of #ifdef-ery which can be difficult to reason about.

#ifdefs are themselves spaghetti; windows.c [0a] and files.c [0b] have tons of platform-related ifdefs (but, mercifully, not too much nesting), hacklib.c [1] has some deep-ish ifdef nesting (though, again mercifully, well commented)

Ok on re-reading some of this, I guess it was worse in my mind than it actually is. Nethack was the first large codebase I ever made any changes to and tried to understand, so maybe the relative enormity at the time made a negative impression.

Check out the source for Brogue [2] for what I consider to be pretty readable game code.

[0a] https://github.com/NetHack/NetHack/blob/NetHack-3.6/src/wind... [0b] https://github.com/NetHack/NetHack/blob/NetHack-3.6/src/file... [1] https://github.com/NetHack/NetHack/blob/NetHack-3.6/src/hack... [2] https://sites.google.com/site/broguegame/

I have to agree with that. I once decided to hack in the "Drunken Master" after the Jackie Chan film. If you are a martial arts master, then every booze potion you drink increases your attack and your defence (thought you still walk around mostly randomly). If you drink too many (I forget how many it was), then you fall asleep. I also added a fortune cookie that said, "A boat floats in water, but sinks in it too". I think it took me a couple of hours to go from never having seen the code base before to completing the hack. Really lovely stuff. Unfortunately, I don't have the code any more, but if anyone wanted to add it, it's easy ;-)

I bet they didn't have sprints and OKR like increasing the number of active users by 50% in one quarter. Nethack is a product of love to rogues and hobby software development.

It seems like that's how this question is always answered. I'd also be interested to see some good application code.

Also curious about some good, not too hugely sized, game code (preferably something not written in C/C++, maybe like an indie game from the past decade or so). Anyone know something?

Keldon Jones’ Race for the Galaxy AI is brilliant. Every line is documented C. It trains and deploys a neural net. It has a fast and functional GUI. Most diehard RftG players have cut their teeth by losing to Keldon a few dozen times before really learning the game.

Keldon was contracted a few years later to develop the RftG apps on iOS and Android, which are easily worth $4.

Source: https://github.com/bnordli/rftg

Precompiled binaries: http://keldon.net/rftg/

> Every line is documented C

Every line? Oh dear [1]:

    /* Set verbose flag */
    /* Set advanced flag */
    advanced = 1;
    /* Set expansion level */
    expansion = atoi(argv[++i]);
    /* Initialize game */
I assume the verbose flag is incremented because there's multiple levels of verbosity, whereas the advanced flag is a boolean, and i is incremented because expansion is not a command line flag (rather, an option). But I wouldn't know that from the comments! Especially useful to know if like, verbosity actually has multiple levels, like -v and -v -v or -v -v -v and so on ... instead of knowing that init_game is ... init'ing the game.

Compare all those comments to this [2], which is actually a spectacularly useful comment.

Signal to noise ratio!

[1] https://github.com/bnordli/rftg/blob/master/src/learner.c

[2] https://github.com/bnordli/rftg/blob/master/src/replay.c#L94

Yikes, I've gotta agree:

    /* Exit */
Well, yes, I see that. It might have been helpful to extend that to "Exit with an error status", perhaps, but as-is it tells you literally nothing that the code doesn't.

IMHO Coding guides should ban "SBO comments" Statin the Bleedin Obvious. Let devs decide what SBO is, usually its bleedin obvious.

Actually, this line:

/* Set expansion level */

    expansion = atoi(argv[++i]);
Is an example of ugly code. They're doing multiple things in one line, and in a confusing order.

# Increment i by 1.

# Grab the command line argument in position i

# Convert from string to integer.

# Assign to the variable expansion.

Note that there's no error checking to handle strings that don't convert to ints.

Arguably, any serious C developer can read this on the fly, but I seriously object to code with side-effects like the above ++i. It's often confusing to read and change.

I'm not criticizing the original author, more nit-picking on this as an example of great code.

I don't have a problem with that code. It is idiomatic C code, like *p++ or flag &= ~mask. ++i is a bit less common than i++ but nothing unusual. The lack of of error checking is more concerning though. It can be justified though, and it brings me to the next point.

The comment is the worst part. It is useless, it literally repeats the most obvious part of the code. A good comment would be something like "there is no error checking because arguments are already validated in the GUI, also 0 is an acceptable default".

For me, comments should not repeat what the code is doing. Comments are for expressing what is in the the mind of the developer that cannot be expressed in code. Example : "constant found using empirical testing", "Hermite spline formula" or "required on Windows". (Edit: the Hermite spline example is actually a bad one, a better idea would be to write a function called hermite_spline_formula instead)

If some coding rule require you to comment, find something to say about your code that isn't your code. It will also help you think more about what you just wrote.

> The lack of of error checking is more concerning though. It can be justified though...

How so, in this case? It could lead to undefined behavior.

> The comment is the worst part.

Useless comments may be irritating, but, as a bad practice, they are not comparable to skipping checks for possible errors and creating the possibility for undefined behavior.

It is not undefined behavior. atoi() returns 0 if no valid conversion could be found. Assuming argv[++i] is not out of bounds of course.

Useless comments are bad in part because they are sometimes not updated as the code changes, and no code analyzer will catch that. There is a reason some people say that comments are lies.

Most common example of what I've seen:

  // set foo to x
  foo = x;
  // set foo to x
  bar = x;

> Assuming argv[++i] is not out of bounds of course.

Well, you cannot assume that - look at the code (there are links to it) - there are several different ways in which it will go out of bounds.

What's more, even if it were the case that the worst that could happen is that the variable gets set to zero and the program terminates gracefully without doing anything, that would not amount to a justification for not checking for an error, it would merely be an observation that, fortuitously, this particular bug is relatively harmless, perhaps doing nothing worse than leaving the user unnecessarily puzzled over what went wrong.

Even the example you give here (which is not actually to be found in the code in question, and so could not possibly be a candidate for the worst thing about it) is relatively harmless compared to potentially causing undefined behavior, or (what is sometimes at least as bad) defined behavior that apparently works but which gives subtly wrong answers. And, by the way, code can lie too: identifiers can make claims about the semantics of a variable or function that are not always true. The only way to avoid that is to use meaningless identifiers.

The more you try to brush this off and argue that comments are worse than actual errors, the more obvious it becomes that your priorities are askew here.

Looking at the source, yeah, you are totally right. I was just looking at the lines out of context.

I was just thinking of cases where atoi(argv[++i]) was justified. Turns out I was overthinking it, it is just terrible code. No feedback on error, crash when the last argument is missing a parameter, static array with no bounds checking, a potentially leaky strdup. Comments are the least of the problems. It is a code smell though. A well justified one in that case.

I have to admit that until seeing this example, I thought that code commented this way did not actually exist, and that the possibility of it was just a straw man invented for the sake of argument!

It seems to follow the notion that more comments = better code. If the symbol names are explanatory enough, the code itself will be self-explanatory. Then all that must be commented about will be special cases handled/not handled and why a certain decision is taken about handling in a certain way.

Often, "every single line is commented" results from blindly following a certain "Standard" that demands that. Eventually generating meaningless comments all over the code. Then the important points never gets mentioned in the comments or just gets buried in the noise.

I almost feel like game code is a counter example. Unless you're making a massively multiplayer game, the code is throwaway because once you ship development effectively stops (or is limited for bugfixes and dlc). Would be interesting to see what the LoL or Warcraft codebase looks like though.

For C, how about ReactOS?

Redis would probably be better.

Any chance you could elaborate? How/why?

Redis is widely used in production environments. The code quality matters.

> Redis is widely used in production environments. The code quality matters.

That's how you judge source code quality? By measuring its popularity? And ignoring the actual code?

That's not what I said.

I actually find Chromium to be a lot more approachable than I assumed it would be. Have only really checked out the layout / graphics part of Blink, but it’s laid out pretty intuitively.

But yeah, Postgres also gets my vote. I guess there’s a bit of a bias there because devs are likely to read the code of the tools they use; either to track downs bug or just to understand how it works.

Yeah. I was amazed to find out it was designed so flexibly to inject C++ backed JS objects (eg write a request function which added requests to a queue after certain lengths based on multi-tenant fairness). Clients couldn’t circumvent because system code wasn’t necessarily in JS user space.

The latest version of NetNewsWire is a nice example of a modern MacOS application codebase.


It’s just a demo app, but I find the ClojureScript re-frame implementation of the real world app to be the cleanest of the bunch: https://github.com/gothinkster/realworld/blob/master/README....

I think everything from Hashicorp is absolutely top-notch, working on their stuff taught me to be a much better Ruby and Go programmer.

I'm honestly surprised to see Hashicorp mentioned here, not because I've read their code, but because I had so many hours of tear-my-hair-out misery using their products.

I found them so under-tested and buggy that finally I just gave up. They're now blacklisted in my mind -- I'll never use one of their products again.

I guess the lesson here is that good code doesn't always have a strong relationship to a usable or bug-free product.

I think you should reconsider, I use their products daily (terraform) and have absolutely no issues?

To do what, though? All of their products are available elsewhere, but better integrated into our stack and IDE.

Have you by any chance also used their Vault[0]? I'm looking at solutions for secrets management and theirs is mentioned often.

[0] https://www.vaultproject.io/

I haven't and am not sure whom the target audience is exactly.

Azure and AWS have secrets management, and most stacks have some open-source way to do it if you're the roll-your-own-server type.

Yeah, cloud option isn't available for the product we're building (very specific needs with regards to data and services geo location).

But to me it looks like that the target audience is anyone operating a Kubernetes cluster as it seems to be the most mature option for managing secrets that is also well integrated into Kubernetes. And their open source version already provides quite a few features.

At least that's the vibe I got from reading various comparisons and blog posts about secrets management.

Discourse is a wonderful code base to peruse. When I was just starting out building more complex web apps I repeatedly referred to it to learn common patterns such as how queues are used to offload tasks.

Just to drive the point home, I was developing in Python and with no knowledge of Ruby, I was able to go through code using just github and I got what I wanted every single time.

It shouldn't be a surprise that developers are more likely to read and understand the source code of development tools

sqlite is the usual example of elegant code in a domain that often breads inelegant code.


For a C codebase, Postgres[1] wins for me hands down. It's clean and suuuuuuper well commente, such that with a little context you can dive into something very complex and still get a feel for what is going on.

[1]: https://github.com/postgres/postgres

Second this; Postgres codebase is what got me out of the "good code is self documenting" nonsense. For those of us in the database space it is an incredible resource - and overall a great example of good code.

sqlite is much less complex, but similarly approachable.

In more recent examples, I think you see a lot of this same reader-centric pragmatic ethos in many Go projects. The Kubernetes codebase comes to mind as a very large tome that remains approachable. And the Go stdlib, of course.

Java generally falls on the opposite side, but there are counterexamples. A lot of Martin Thompsons code eschews Java "best practices" in favor of good code. Seeing competent people in the Java space "break the rules" helps.. though of course Java is forever hampered by having internalized illegible patterns as best practices in the first place.

It's a shame because at least the OpenJDK implementation of the standard library in Java is generally quite good, especially around the concurrency parts. Clean, easy to follow, reasonable comments. But of course that's Java written by C developers, mostly.

> Postgres codebase is what got me out of the "good code is self documenting" nonsense.

I'm a fervent believer in "good code is self-documenting", so I was curious to be proven wrong, clicked randomly until I found code and I saw this.

     * Round off to MAX_TIMESTAMP_PRECISION decimal places.
     * Note: this is also used for rounding off intervals.
    #define TS_PREC_INV 1000000.0
    #define TSROUND(j) (rint(((double) (j)) * TS_PREC_INV) / TS_PREC_INV)
Usage of acronyms is one of the worst offenders in bad code. The context makes it clear that TS means timestamp, so that's not too bad (still bad though), but I'm still not sure what INV means, luckily I presume it's the only place it's used.

If it was named TIMESTAMP_ROUND, I wouldn't need to know "Round off to MAX_TIMESTAMP_PRECISION decimal places." Now that I've copy/paste that, it seems like the comment is wrong too, it's rounded off based on TS_PREC_INV, so if I was to believe the comment, I wouldn't get the right behaviour.

I'm not saying Postgres codebase isn't good code, just that "good code is self-documenting" is still true. That code was pretty much self-documenting except for the acronyms, but considering it was all used together, it's was fine and I was able to understand what they meant.

For me, comments should only be needed when something isn't clear. Defining what isn't clear is hard to determine for sure, but that's one thing for which code review helps quite a bit.

I mostly agree with this. Though as I wrote that sentence I realized Go has somewhat softened my position on abbreviations. I think the "note" portion is useful; ultimately a test would stop you breaking that secondary use, but the comment stops you spending time in that direction in the first place. But either way, overall I think you're right this would be fine without a comment.

I'm thinking more of examples like this: https://github.com/postgres/postgres/blob/master/src/backend...

I just picked this at random from the storage subsystem, but I think it highlights what I mean. The comments are mostly about context. The comment for the routine is about when and who calls it, so that someone that reads the routine has that in mind. The specific line I'm linking to highlights in English prose that the correctness checking on the page headers is just a minimum guard and should not be fully trusted.

Back in the day I would have argued "oh, well, but you could break that into a function called "provisional_page_header_check(..)", or something. But.. there is nothing in the compiler that checks that function names stay in sync with their implementation any more than there is for comments. Writing it as a comment lets you use regular English sentences, breaking out a function takes that away and adds no compiler protection.

It's also.. friendly, somehow, to me. Working in this codebase is like participating in an ongoing and very slow conversation, which feels very pleasant.

That comment is exactly what I mean by when needed.

That does confirm that they are making pretty amazing code. I would have much prefered to get that file instead of the other one :P.

They do have a redundant comment at someplace but it's clearly a tiny minority and they aren't losing any one time.

Yep, I think we're in agreement :)

It's like in any sizable codebase with quite some history. There's substantial difference in quality between parts. Some of the worst parts go back to the early days of postgres - the priorities and resources available back then were just very different than today. Obviously there's also noticeable differences in more recent code, but I don't think to the same degree (although there've been definitely subsystems that worked out better and some that worked out worse).

E.g. the code above is essentially (although somewhat mechanically renamed and moved since), from:

   commit 41f1f5b76ad8e177a2b19116cbf41384f93f3851
   Author: Thomas G. Lockhart <lockhart@fourpalms.org>
   Date:   2000-02-16 17:26:26 +0000

       Implement "date/time grand unification".

> Usage of acronyms is one of the worst offenders in bad code.

Not really on board with that... It's a balance. Brevity does have it's value too. Everybody is going to understand that TS stands for timestamp, that WAL stands for write ahead log, etc. Especially when dealing with a language that doesn't have namespaces etc, you're going to have to realistically deal with prefixes a good bit. There's plenty of bad abbreviations in postgres code, however, don't get me wrong.

In the above I'm more bothered by the inconsistent naming, which I think is probably one postgres' bigger code quality issues.

> For me, comments should only be needed when something isn't clear.

I pretty strongly disagree. Most of the time comments shouldn't explicitly restate all that code is doing, sure (although there's clearly exceptions to that too). But e.g. stating why an algorithm is doing something, what the higher level goals of some checks are, why some shortcut is reasonable all make a code base a lot more maintainable in the medium to long run.

I work a lot on postres, and occasionally dabble around the corners of linux. For me it's the* defining difference making it much more painful to understand most linux subsystems.

Edit: formatting, typo

IMO good code needs to be readable at the call site much more than the function definition. That’s why I believe in the Google C/Go style rule that all mutable parameters must be passed by pointer. A call site with &arg communicates mutability quickly. Also, in my framework designs I consider the impact on code complete heavily:

E.g. in the above sample “TS” is commented to mean timestamp but that will be lost during scans of code complete options. Also, MAX_TIMESTAMP_PRECISION may not show up in code-complete for timestamp macros/consts, but TIMESTAMP_MAX_PRECISION will.

>If it was named TIMESTAMP_ROUND, I wouldn't need to know "Round off to MAX_TIMESTAMP_PRECISION decimal places."

TSROUND is already as obvious as TIMESTAMP_ROUND. TS is a very common abbreviation of timestamp.

And you would still need to know the decimal places.

The real issue is that it's based on TS_PREC_INV and not MAX_TIMESTAMP_PRECISION as per the comment (though MAX_TIMESTAMP_PRECISION might still agree with the decimals offered by TS_PREC_INV, it's not obvious here, and would need manual work to keep them in sync).

> TS is a very common abbreviation of timestamp.

Common != universal. It's known up until someone doesn't know it. We have pretty powerful autocompletes, let use them instead, or just lose 3 seconds writing the 10 letters, it won't be so bad.

> And you would still need to know the decimal places.

Sure, by reading the code and understanding what it does and how it does it. You change a constant that will affect that code, it seems fine to see how it's affected either way.

> The real issue is that it's based on TS_PREC_INV and not MAX_TIMESTAMP_PRECISION as per the comment

Which is bound to happen when your documentation isn't the code directly.

> > TS is a very common abbreviation of timestamp.

> Common != universal. It's known up until someone doesn't know it. We have pretty powerful autocompletes, let use them instead, or just lose 3 seconds writing the 10 letters, it won't be so bad.

The cost of that compounds though. There's plenty times where there is not just one abbreviation in a symbol name, but multiples. And pretty soon the "logical" lines long enough to contain multiple references to such symbols get considerably slower to read (be it due to long lines, or being broken up into multiple lines).

I've an extremely hard time to believe that the widespread use of ts, xact, wal, ... is a significant factor in how quickly somebody can get started with the postgres code base.

> > The real issue is that it's based on TS_PREC_INV and not MAX_TIMESTAMP_PRECISION as per the comment

> Which is bound to happen when your documentation isn't the code directly.

Hm? Those aren't out of sync? TS_PREV_INC is the relevant factor/divisor to round to MAX_TIMESTAMP_PRECISION here. It'd be nicer if that were explicit in the code by defining TS_PREV_INC based on MAX_TIMESTAMP_PRECISION, sure, but they're in sync. It's just not that trivial to state in C. But they're in sync. Note also that TS_PREC_INV really is just an implementation detail for TSROUND(), it's not used elsewhere. These days we'd just write this in an static inline function, in all likelihood.

>Common != universal. It's known up until someone doesn't know it.

Someone might also not know what a timestamp is, or a UNIX timestamp at least, so there's that.

>We have pretty powerful autocompletes, let use them instead, or just lose 3 seconds writing the 10 letters, it won't be so bad.

The problem with the above idea is that it implies "spelling it out fully == better". Which is not necessarily the case, long variable names can make code hard to follow and verbose. Ask the Java community...

> For me, comments should only be needed when something isn't clear. Defining what isn't clear is hard to determine for sure, but that's one thing for which code review helps quite a bit.

Totally agree. Comments should be limited to sections in which the code is unexpected. For example, for a workaround for a bug in another part of the system. That you should comment because if the bug is ever fixed someone reading the code will understand why it looks wonky.

I don't agree with acronyms. They are fine to use as long as you are consistent. For example, if you write "ts" in one place, you can't write "timestamp", "tstamp", "tstmp" for the same data in some other code. In my code, I always use "n" for "length". Since I'm consistent about it, and ever use "n" for anything else, it doesn't make the code harder to read.

> If it was named TIMESTAMP_ROUND, I wouldn't need to know "Round off to MAX_TIMESTAMP_PRECISION decimal places."

With only the name, how would you know how many decimal places? The comment isn't wrong/out of date here btw.

> With only the name, how would you know how many decimal places

I'm not saying the name is wrong because of the comment, I'm saying it's wrong because of the usage of the acronym.

> The comment isn't wrong/out of date here btw.

Isn't wrong? How? It's true in the sense that they expect TS_PREC_INV to be related to MAX_TIMESTAMP_PRECISION (which would be a perfect example to my mind of a needed comment, if actually it was a requirement), but it's actually false in the sense that it's not what that code does.

You wouldn't get a different rounding if you were to modify MAX_TIMESTAMP_PRECISION, which is what you would expect based on that comment.

> I'm not saying the name is wrong because of the comment, I'm saying it's wrong because of the usage of the acronym.

I get that. But you also said the comment would be unnecessary with a name change. The comment does communicate more information than your proposed name, IMO, hence is not replaceable by the name. (IMO).

> You wouldn't get a different rounding if you were to modify MAX_TIMESTAMP_PRECISION, which is what you would expect based on that comment.

Good point. The comment isn't wrong with the definition of MAX_TIMESTAMP_PRECISION as it is in the code. If you override it though, the code doesn't do what the comment says.

It's an interesting case: if you trust the comment indicates the desired behavior, then you can see the code may have room to be improved. If you distrust that the comment is correct or has value, then you might just remove the comment, and the code doesn't get better.

Are there guidelines for when abbreviations are ok in Python code? I tend to avoid them except popular abbreviations like admin and ts.

Possibly INterVal?


My guess was gonna be "invariant", but that makes more sense.

>the "good code is self documenting" nonsense

Man I hate dogma like that. My "common sense" comment style is always, "code tells me how, comments tell me why." The only exception is in hand optimized code where I'll non-doc comment what the reference implementation would be above the optimized version, which is _sometimes_ necessary when tests aren't in the same translation unit.

Comments tell me why if I happen to be asking the question the comment is answering. You can't answer every "why" question in a comment. If you take "good code is self documenting" as dogma and say, "therefore I won't write comments", then probably you deserve what you get in the end ;-) But good code reduces the number of questions I might try to ask. That way I can get down to one "why" or possibly even no "why"s.

For example, I can write "a - 1" or I can write "-1 + a". If I write the second form people are going to wonder why I did it that way. Is there some reason why "a - 1" wouldn't work? Perhaps I don't have a - operator for a for some reason. But if I have no reason to write it the second way, then I should avoid doing it because then I don't have to write a comment saying, "I did it this way for no reason".

That's a simple and contrived example, but it holds true for larger code as well. There are ways of doing things that follow the programming culture most of us share. We shouldn't comment everything we do -- only the things that are unusual. Otherwise we'll be lost in comments. I'd rather read code to understand what's going on than English if I can help it.

So, the lack of need to write comments can indicate a good code base. However, like many things, it's a poor metric. The lack of comments does not indicate a good code base ;-) Similarly, some things are just complex, or unusual even when you have simplified it as much as humanly possible. Generally speaking, though, if you have a choice between code that needs a comment and code that doesn't need a comment, choose the latter.

Totally agree. It is impossible to show why one tradeoff was chosen over another without comments. Is there a loop unrolled for performance reasons? Is there a design goal that looks complex on the inside, but yields a simple, easy to understand API on the outside? I feel like "good code is self documenting" is a pitfall that devs always fall into believing on their journey to becoming a "Senior", and then somewhere along the way run into their old "good" code and can understand what it does perfectly well enough, but cant recall the circumstances that led them to arrive at the choices they made in writing that code.

/* don't even try to load if not enabled */ if (!jit_enabled) return false;

I still think comments like these are super redundant and annoying.

It tells you the intent. The actual code does exactly that, because the intent matches the code, but sometimes the two aren't aligned and therefore the "redundant" comment can help you determine if the bug is the code or the intent.

Understanding the "Why?" of code is often the most valuable thing about comments. The code remains forever, but the "Why?" is often lost to time.

Except that it does not. The 'why' would be explaining why the jit needs to be enabled to be able to load. As it stands it explains the 'what' and therefore is a bad comment.

A point well made. I conceded it could have been better written.

I still believe "redundant" comments can be of value though. This just may not be the best example.

> It tells you the intent.

A good unit test that tries to load with jit_enable=false and expects false in return will communicate intent even better.

Comments are usually less prone to logic bugs. When fixing issues, if the code behaves opposite to what the comment says, it can be a very helpful clue.

An argument can be made that this is what Test Driven Development is here to solve. State intent in the test, not the comment.

That said, I agree it can be a helpful contextual clue.

Heh, that's one of mine. The point isn't so much to restate the if itself, it's to explain that we're explicitly testing before loading the JIT provider - which happens immediately below.

Which in turn is because we a) do not want a hard dependency on any JIT provider, in particular we don't want a hard dependency on LLVM b) LLVM is a really slow to load dependency. Those bits are expanded more upon in the README in the same directory.

Edit: expand.

Agreed, sounds like trying to convince that imperative code is an easy reading. But OK, for some people it may be.

I think that beauty of OpenJDK concurrency mostly seems to come down to Doug Lea being a genius of concurrency.

The best java is C.

Back when I maintained postgresql on cygwin I did it because if its clean codebase. But eventually I got struck when trying to fix a build system bug when creating importlibs. On all other build systems it was easy to fix the bug, but no so with postgresql. I eventually gave up after years, and I think it's still broken.

A special mingw tool to create importlibs is/was broken on 64bit. I think it was called dlltool. Normally you'll just need to add a flag to the linker to create that.

So no, postgresql not.

Anything by burntsushi, but especially xsv and ripgrep:

xsv: https://github.com/BurntSushi/xsv

ripgrep: https://github.com/BurntSushi/ripgrep

His code typically has extensive tests, helpful comments, and logical structure. It was fun trying to imitate his style when writing a PR for xsv.

The Quake 2 engine was also pretty interesting: It was almost totally undocumented, and it had plenty of weird things going on. But I could count on the weird things being there for a reason, if only I thought about it long enough.

Seconding burntsushi. I learnt the basic Rust idioms by doing the advent of code last year and comparing my solution to his.

Note: this assumes you speak Rust...

Don't all the comments assume you speak the language the codebase is written in? I mean I can't tell if a given PHP/C/Perl/other language I don't know well codebase is good or not. I guess there could be people that can somehow do that...

Yes, but other comments actually mention the language, and the original post was using web languages as an example, so I thought I'd save someone the trouble of clicking expecting and realizing it's not what they expected.

NetBSD, hands down. Beautiful, simple to understand, consistent. The documentation is also top notch -- I wrote a trivial character-device kernel driver using only the man pages as a reference. And you can too.

Also -- the source code to Doom. Read it, marvel at its clarity and efficiency -- and then laugh when you realize that the recent console ports were completely rewritten in fucking Unity. And the Switch version chugs, despite the original running well on 486-class hardware.

> and then laugh when you realize that the recent console ports were completely rewritten in fucking Unity. And the Switch version chugs, despite the original running well on 486-class hardware.

I wonder why they didn't just write an emulator, then. Especially on the Switch if there are performance issues.

Unity sucks but as a hobbyist game developer it is a godsend. The only other way I can support all the platforms I want to (Android, web, PC) is through Web APIs directly, and nobody likes more Electron.

SDL be like "am I a joke to you?"

SDL is fine for cross platform development. Personally, I have used it for some commercial games in the past, but I think it's not comparable to Unity or unreal engine. SDL is not a game engine, it provides only lower level features (audio, opengl context initialization, joystick support, etc). Even for things like in-game menus and text rasterisation you are on your own.

Well there is also Unreal Engine!

I used NetBSD as a reference when I needed a cross platform strptime that behaved identical everywhere.

I found the source very approachable. Source was well laid out and fairly clear. Some of it was subjectively a bit ugly to just look at, but when you read it, it was very clear.

Couldn't use glibc as a reference because this in a closed source commercial product and, well, GPL.

I completely agree. I've gotten to work with both of these at my summer job and it's been an absolute pleasure.

You've gotten to work with both NetBSD and Doom in one summer job? Now I'm really curious what you've been doing. I think I may want that too.

My project was to port DOOM to the seL4 microkernel (on rpi3b+). To continue with that I’ve been porting NetBSD Rump kernel components to seL4 as well.

DOOM itself is a breeze to port. NetBSD can get confusing at times but it’s nice to know there’s always a logical reason behind the design decisions. It rewards you for spending time to understand the code and I like that. Rump kernels are especially cool.

It’s a myth that you can’t get paid to work on BSD ;)

Think about it. When you stand up a new ARM-processor embedded widget for the first time, what are two of the first things you do to verify it works?

1) Port NetBSD to it

2) Port Doom to it

For canonical C code, without a doubt I would say Redis and Postgres. Redis is written and annotated in a way that even someone with a cursory knowledge of C can understand what's going on.

For Python, I really like how SQLAlchemy is written and designed.

For Rust, ripgrep stands out as a sterling example of how to write a powerful low-level utility like that.

> Redis is written and annotated in a way that even someone with a cursory knowledge of C can understand what's going on.

Strongly agree with this one about Redis.

I just opened up server.c in Redis (not familiar with Redis and never seen it before) and at first glance it seems... alright, but I'm not sure what's particularly outstanding about it, and it could definitely be better. Some of the logic seems questionable (exit() in a signal handler? [9] also what about thread safety?), and more superficially (but annoyingly) I see: names like "sdscatprintf" [1] which are a little cryptic, lack of attention to spacing (too little [2] or too much [3] or inconsistent [4] [8]), lack of braces for single-line blocks (which at least I consider bad) [5], irritatingly inconsistent line breaks [6], etc.

Overall it's still definitely on the more readable side compared to other C code I've seen, I like the thorough comments, and it's generally decent, but I'm not particularly coming away from it in awe like everyone else seems to have.

[1] https://github.com/antirez/redis/blob/unstable/src/server.c#...

[2] https://github.com/antirez/redis/blob/unstable/src/server.c#...

[3] https://github.com/antirez/redis/blob/unstable/src/server.c#...

[4] https://github.com/antirez/redis/blob/unstable/src/server.c#...

[5] https://github.com/antirez/redis/blob/unstable/src/server.c#...

[6] https://github.com/antirez/redis/blob/unstable/src/server.c#...

[7] https://github.com/antirez/redis/blob/unstable/src/server.c#...

[8] https://github.com/antirez/redis/blob/unstable/src/server.c#...

[9] https://github.com/antirez/redis/blob/unstable/src/server.c#...

Redis code quality could be improved significantly, but you picked the wrong file to evaluate it: the first file created, and the one that for the nature of system software will be the less "overall designed", because it is the place where we call almost everything else sequentially. Still, if I had more time, I could improve it and other parts a lot. However to see how modern Redis was coded check the following:

* hyperloglog.c

* rax.c

* acl.c (unstable branch, the Github default)

* even cluster.c

Everything you'll pick will likely be a lot better than server.c

Other things you mentioned are a matter of taste. For instance things like:

    if (foo) bar();
Is my personal taste and I enforce it everywhere I can inside the Redis code, even modifying PRs received.

The line breaks are to stay under 80 cols. And so forth. A lot of the things you mentioned about the "style" are actually intentional. The weakness of server.c is in the overall design because it is the part of Redis that evolved by "summing" stuff into it in the course of 10 years, without ever getting a refactoring for some reason (it is one of the places where you usually don't have bugs or alike).

Ah I see. I didn't realize it was old, I just picked server.c since it just sounded like it'd have a decent mix of stuff. I think the only 'taste' things I mentioned is the lack of braces and possibly the function name, which I'm happy to ignore. But the rest were just inconsistencies, regardless of which way anyone's tastes lean -- the spaces after commas are entirely inconsistent (although I would argue if you took the stance that there shouldn't be a space at all, that's bordering on just being wrong, not merely a matter of taste!), and some lines are far longer than 80 columns [1] and some are broken at really inconvenient points just to fit them into 80 columns, which gets you the worst of both worlds.

Thankfully, like you said, the other files do seem better. :) However, they do have the same problems I just pointed out above: spaces are inconsistent everywhere (after commas, after 'while', after casts, etc.) and lacking in awful places (honestly, how am I supposed to read a function call with eight identifiers and zero spaces? [2]) and some lines are broken and others longer than 80 [3][4].

A couple other immediate issues I have on syntactic things (semantic analysis would take me a long time so it's hard to give me feedback on that):

- One comment I have that some others would vehemently disagree with me on is: I would cut down on the early returns. Personally, I really hate them, for multiple reasons -- the most practical one of which is that they prevent you from inserting a single breakpoint in a debugger (or printf(), or whatever you feel like doing) and being able to see what the function returns easily. Instead, you have to hunt through the entire function and place a dozen breakpoints to make sure you didn't miss any return path. And it becomes the most confusing thing in the world when you inevitably miss one. To me that's bad enough, but some people still insist on them. But if you're going to do that, I would at least make it easier for people to debug them. On my first reading, for example, I completely missed the return on line [5] -- because it's right after the if condition and visually blends in. That also makes it hard if not impossible to put a breakpoint on it, which wouldn't be an issue if it were on a separate line. (Maybe you don't do that because you don't want to forget braces, but you know my opinion on that too. :P) That makes it even more painful to debug this function.

- I'm not a fan of the liberal macro usage. I'm not saying you should never use macros -- I know that sometimes it's outright impossible not to (especially in C), and sometimes they're the perfect tool (code deduplication) -- but you shouldn't be using them to use them to define constants and perform normal function calls. [6] [7] Not only is it often possible to break them because of their textual nature, and not only do they lack type information that would be extremely useful, but they also make it harder to debug, since you can't just type them into a debugger at runtime and get their values. And you can't step through them like normal code either.

Hope some of this is helpful. All of it said though, they're pretty minor things overall, and the code does seem good quality, at least as far as I can tell in in this timespan. The fact that it's in C does always leave me with this nagging feeling that there's always going to be some resource leak somewhere (especially with early returns!), which I wouldn't have in most other languages, but that's not really an indictment of the code but of the language. And I love, love, love the comments. I don't leave comments nearly that good myself, and I will almost certainly refer back to them later.

[1] https://github.com/antirez/redis/blob/583933e2d6b4c2721554ab...

[2] https://github.com/antirez/redis/blob/unstable/src/rax.c#L10...

[3] https://github.com/antirez/redis/blob/unstable/src/rax.c#L50...

[4] https://github.com/antirez/redis/blob/unstable/src/rax.c#L45...

[5] https://github.com/antirez/redis/blob/unstable/src/rax.c#L13...

[6] https://github.com/antirez/redis/blob/unstable/src/hyperlogl...

[7] https://github.com/antirez/redis/blob/unstable/src/quicklist...

sdscatprintf is just mimicking the C style for a lot of string manipulation functions that have -printf suffixed. My guess would be the function just concatenates the data onto some sort of "SDS" string

The Windows operating system.

Windows is quite an engineering achievement. We didn't prioritize readability or "clean code". All the variables used hungarian notation, so you had horrible names like lpszFileName (lpsz = long pointer to a zero terminated string) or hwndSaveButton (window handle). You also had super long if(SUCCEEDED(hr)) chains that looked like your code was spilling down a staircase. Oh yeah, and pidls (pronounced "piddles" and short for "pointer to an id list") used for file operations.

What made the code base beautiful was the extreme lengths we went to to be fast and keep 3rd party software working. WndProcs seem clunky, but they are elegant in their own way and blazingly fast. All throughout the code base you would find stuff like "If application = Corel Draw, don't actually free the memory for this window handle because Corel uses it after sending a WM_DESTROY message."

The fact that thousands of people worked on the code base was mind boggling.

I worked on Windows for 3.5 years and hated most the code I touched:

1. I think I counted 5 string implementations in active use and code at the boundary had to convert between them all.

2. The SUCCEEDED macro is a mask against HRESULT but who the hell actually uses non-zero HRESULTS to communicate domain-specific success codes? And don’t forget that posix APIs return 0-for-non-error ints and COM APIs can use S_TRUE (0 to be a non-error) and S_FALSE (1) so you have to flip them for real bools. Or have if (bResult == S_TRUE)

3. Nobody wanted to touch old codebases. I fixed an assert in Trident layout code because a whole library used upper-left, lower-right input (and params called ul, lr) but one function (contrary to docs) used upper-left, width, & height. When I fixed the library and 2/3 call sites I was called arrogant, to revert changes in the library, and change the last 1/3 to also have the inverse bug in its call-site.

4. Another Trident API (written by an intern) had a tree where fastInsert() could only be called after slowLookup() but nothing in the api enforces this

5. Every COM object decides whether it’s faster or thread-safe by whether the refcount uses atomic ops or just —/++

6. Saw parallel arrays in files where a struct held an object which might have suffered the slicing problem in insert. Another struct field held an into the index of the sliced part array. Users rehydrated. This wouldn’t happen with an object pointer, but indirection was unacceptable because the author didn’t trust the small allocation heap’s locality.

7. My codebase included a while c++ runtime because my core-OS team didn’t trust msvcrt.dll because the shell team wrote it.

I once tried to add a feature to Windows (back in the sd days), and it was a nightmare. I was in the C&E org so it was a side project thing, and I eventually postponed it (but then left MS so never got it finished). I imagine it's gotten much better since then, in large part from the shift to git alone. It certainly was super impressive feat for the shear scale and longevity. I have a lot of respect for the folks who make that beast work. And there are bits of it that are brilliant. But it was an ugly beast.

The number of cycles that have been wasted checking for Corel Draw must be astounding. I wonder if an environmental cost could be calculated for something like that.

I'm mega curious about this

> ...like lpszFileName (lpsz = long pointer to a zero terminated string)

I remember those.

AIUI hungarian gives you some kind of typing. The typing is done by humans using the names. The humans have to get it right; they are the typecheckers.

The first thing I'd do is offload the typechecking onto an automatic framework - the idea of letting people do a computer's job is madness. It would not have been too hard to do (relatively very cheap for a large codebase like an OS), I think, and would have allowed the hungarian prefixes to be dropped because they'd become redundant, and strengthened and speeded up typechecking. So where is the flaw in my thinking?

(aside: one of my first contract jobs was working in pascal (delphi actually). The company I worked for had coding standards cos you need standards, don't you. It was to prefix every integer with i_, every float with f_, every int array with ai_, et cetera. As pascal was strongly typed this was totally pointless).

According to Joel[1] it was originally about distinguishing variables that had the same type but had different semantics, e.g. a window's width (dx) vs its horizontal position (x). Both ints, but not really the same kind of data.

[1] https://www.joelonsoftware.com/2005/05/11/making-wrong-code-...

> So where is the flaw in my thinking

The codebase has 34 years worth of code written already, 100 million LOC or more if you count Office, VS etc. The cost of typechecking is trivial but the cost of rewriting this much code to be consistent with any new convention is in the hundreds of millions of dollars. This legacy cost then of course becomes higher every year...

Oh hell. Where did I say anything about it having to be rewritten? I did not. I said "allowed the hungarian prefixes to be dropped", not required. And all I meant was "in subsequent work", of course changing existing code would be idiotic.

And this absolutely misses the vital point that typechecking can be offloaded from the human to an automated typechecker, so why wasn't it?

"The cost of typechecking is trivial" - not if it's done by humans.

I’ll do it for only 50 million dollars. Remove it with repeated regexes like s/^wec//g (remove all word prefixes for element counts). Of course, you’ll need another convention for all the collisions that Hungarian notation fixes like IFoo (Foo’s test fake-able interface) and CFoo (the concrete class)

Would whoever downvoted this please explain why. I laid out my thoughts for critiquing by others, but learn nothing unless I'm told why it's wrong.

I’ll save you with an upvote. Hungarian notation also helps avoid collisions that lead to confusingly similar names:

pArray (array pointer) ecArray (element count of array) bcArray (byte count of the full array)

The Windows API was very ugly... not to mention unnecessarily complicated. It does not belong in this thread.

I have quite opposite opinion backed by real life experience.

As an author of Sciter Engine that works on Windows, MacOS and Linux/GTK I have first hand experience working with all three API sets.

Windows API is the most logical, complete and stable API among all others.

It has everything that you really need to create performant and manageable UI.

MacOS is good but less good. It uses reference counting (which is not bad by itself) but in very strange manner. Name of the function determines need of [obj retain] / [obj release] and not all names that they use are consistent in that respect. Yet Apple changes API quite frequently and dramatically.

GTK, while is built on top of quite reliable Glib foundation is a mess to be honest. You have GtkWindow and GdkWindow, you have gtk_window_resize(), gtk_window_set_default_size() and 6 more functions that should allow to set size of the window but they may or may not work in particular situations.

> GTK, while is built on top of quite reliable Glib foundation is a mess to be honest. You have GtkWindow and GdkWindow, you have gtk_window_resize(), gtk_window_set_default_size() and 6 more functions that should allow to set size of the window but they may or may not work in particular situations.

You know, even as a guy who loves Windows, this example makes my head explode. We have MoveWindow, SetWindowPos, SetWindowPlacement, DeferWindowPos, ShowWindow, ShowWindowAsync etc. all of which overlap in functionality. Similarly SetActiveWindow and SetForegroundWindow, etc...

IIRC there were three different kinds of handles, the documentation was a crock, the api did minimal checking, it was horrible. Fast maybe, but delphi succeeded so well because it hid all the awfulness. MS just didn't get it for years.

Yes, there are problems but if to compare with others...

Windows is so popular just because of stability and completeness of its API. And that is non-disputable I think. Quite few simple and flexible abstractions (handles, WndProc, messages, etc).

macOS has had automatic retain counting for the last decade. You have to remember your weak/strong for cycles but that's all.

It's preferable over garbage collection because there are no unexpected pauses/memory scans and it's deterministic.

RE pauses, that’s only due to the system-injected calls to autoReleaseReturnValue and retainReturnValue which work in tandem to do assembly call stack analysis and set an extra register to keep return values out of the release pool. This allows pool runs to happen frequently enough with few objects in them and have smooth latency.

The autorelease optimization actually regularly breaks, because not even the compiler engineers remember it exists, and it relies on libraries tall-calling -autorelease which they also forget to do.

Even without it, there isn't a pausing problem because you can clear the pool at the end of the runloop tick, when the app is idle anyway.

But if you didn’t had the trick, CPU intensive runloop ticks could create lots of small objects that were put in the pool because they had a 0 refcount as a function returned the value. That’s why we had the ability to write our own pools.

Personally, COM gives me the heebie-jeebies. I get why it exists and think it's a cool idea. Under the hood it's just RTTI across shared library boundaries, and it's cool that it "just works" with the way that compilers generate vtables from pure virtual base classes.

But hell did it take me awhile to figure out how that worked because it's so poorly documented and auto-magical. Reverse engineering a COM DLL just to find out how the hell it has a stable ABI is not fun.

Also the whole point of COM was interface composition which led to the diamond problem (multiple inheritance) in any real C++ object that implemented two interfaces (which each had to inherit IUnknown if all implementations were to be COM compatible). Virtual inheritance was a hack fix to a platform whose primary language fought its primary purpose.

True. That. I've spent a fair amount of weekend time over the last couple months working on a non-standard COM API port to Rust, and the one thing that's stuck me the most is how batshit crazy relying on virtual base classes is for an extensible API. It only works when you want users to deal with your C++ SDK, not the API itself.

Some parts are ugly, some are beautiful. You can't really lump it into one bucket.

I really liked the Go standard library (or at least from around 1.4-ish, it might have gotten more complicated now).

I liked that it was actually possible to read it and understand what was going on.

In a similar vein, P. J. Plauger's version of the The Standard C Library is nice because even if it might not be especially optimized(?), you can actually read the code and understand the concepts that the standard library is based on.

Software Tools by Kernighan and Plauger would also be great except that you have to translate from the RatFor dialect of Fortran or Pascal to use the code examples.

Even so, I used its implementation of Ed, to create a partial clone in PowerShell that let me do remote file editing on Windows via Powershell when that was the only access that was available.

So even over 4 decades and various operating systems removed, there are still concepts in there that are useful.

Jonesforth is also a great and mind blowing code base although I'm not sure where the canonical repository is currently.

I supposed it's this one, but it's public domain and over the years has been forked many many times (far more than I ever expected).


it seems that in real life, a really high-quality codebase is hard to come by

I think a common misconception amongst mid-experienced programmers is that they confuse look with quality. Reading clean written code gives you a feeling of control and also the feeling that someone must have thought about that program. It's reassuring. You have in front of you a code that gives you trust.

When in fact, that code can be complete garbage.

The look of the code doesn't matter, what matters is the program. In the abstract meaning of the term. You don't judge a code by reading it, but by running it in your head. Granted you have to understand it in order to do that. Once you understand the code, you run it in your head and that's when quality enter the scene because running it in your head is what you do all day when you code. Some says that you spend most of your time reading code. That's simply not true, the effort is definitely not in reading but in running the code in your head. Basically what I'm describing is a 2 by 2 matrix where there is one column for look bad, one for look good, one row for runs badly in the head and one for run smoothly in the head. Granted, the best may be when both the code looks right and runs right, but don't be mistaken, the real important and difficult part is whether or not it runs well in the head.

A poor quality program may look good, but don't run well in the head. It's too complex or too confusing (in terms of logic, not in terms of presentation) or convoluted or simply wrong in terms of what it's supposed to do. On the other hand good quality code is code that surprises you by the way it runs. It's beautiful in terms of simplicity, it delivers a lot, it's small so that it fits well in the coder's head. And it may look like garbage which is not so important.

You may wonder how to know very quickly the quality of a code base. Run part of it in your head. Contemplate the machinery. Try not to think to much about the language and how it's constructed in this language, try instead to contemplate it in an abstract manner. Be critic, and critic your critics.

> by running it in your head

This is probably related to a factor named 'local reasoning', procedural programing tried to encourage this through procedure, OO tried to encourage this through encapsulation, and FP encourage this through purity.

Basically the goal is when anyone look into a function, the reader can easily make sense of the code without moving around.

For pure functional programming, to make sense of a function is to make sense of the branch under the acyclic calling tree of the function. The caller and sibling branches are always completely irrelevant.

So that it will be much easier to run in people's head.

> by running it in your head.

I have encountered far too many people who don't even realize this is a thing. I figure they must be doing it in some limited form and just aren't recognizing it as a skill that can be trained up, otherwise I'm not sure how they can do any development, but...

I generally intentionally avoid running the program in my head because that requires too much context: I prefer to think about the change I want to make and the minimal set of preconditions that make the change valid.

Running non-trivial code in your head is far too tedious for daily use. I'd much rather add debug checks for invariants when I want to confirm that the code works the way I expect it to.

The human brain has, what, about 7 registers in medium term memory?

5±3, if I remember right, but of indeterminate size. You can expand it to a couple dozen by moving up and down abstraction levels.

An exercise: I have 36 letters for you to memorize, in order and without mistakes: "The quick brown fox jumped over the lazy dog"

The sentence as a whole only uses 1 of those memory slots. As necessary, you can jump down one abstraction level to get 9 words, or again on a word-by-word basis to get the full 36 letters.

Code is the same. For example, use multiple slots to understand loop conditions, then file it away in a single slot as the whole conceptual loop. Use multiple slots to understand the loop body, in the context of one element; file it away as well. Combine them into a one-slot conceptual understanding of the whole loop. So on and so forth.

As you work with the conceptual models, you'll also end up with them in long-term memory and not have to do this reconstruction each time.

This pattern can continue indefinitely, with conceptual models of whole packages and systems. Most will do it unconsciously at that level as they become familiar with a codebase, but it works at every level, and is the basis of "running code in your head".

I don't always run the code in my head exactly - i do something more important, which is consider the set of possible situations the code can be in, and look for problems.

I'd be curious to see an example of code that looks bad but yet is easy to grasp mentally. I do think these 2 features that you describe "easy to run mentally" and "look good" are correlated.

At least, I may agree with the following: bad program infers bad presentation but bad presentation does not infer bad program.

In my career, I've seen many times peer programmers laugh at me when looking at my code and subsequently keep laughing, but in the other direction, as we were rolling over our competitors. Once, at a FANG, I've had the perfect setup where our team was doing the same project than another team. We did it in 12 month with 8 people, it took them 4 years with 18. Both projects started from scratch.

Follows an example of code of my own. This is the perfect example because it looks like complete utter garbage and you will think I'm a beginner and I've never seen any good designed code. You are gonna laugh. While it is certainly not flawless, I'm pretty sure you would like to be on my team with that code if you'd knew more. All the features, to the tiniest nice subtlety are there, it has almost no bug and it's dead simple. Really dead simple. Please note that I code with sublime text which has multi cursor making duplicated code not so bad. So there is a lot of duplicated code. Again, I'm not saying the following code is flawless, I'm saying that despite its flaws, or even because of them, from my experience you'd prefer to be in my team.


I love this description. Thank you.

dont u mean a 2x2 matrix

I wish there was a way to read the codebase where there is a tag that tells you what the folder does.

In github, rather than see what has changed, it would be interesting if there was a comment that told you what the folder contained.

edit: Relevant here because the best codebase for me is one where I can understand the folder structure, but that is a sort of 0th order effect that should be equalized with some tool.

In Go, a folder is a package and as soon as you write a comment before the `package foo` declaration, it's package documentation. And thus GoDoc automatically generates a nice webpage out of it.

See for example this package comment: https://github.com/golang/go/blob/master/src/net/http/doc.go...

Turns into this documentation (the beginning only): https://godoc.org/net/http

Out of the box.

Nice. I like the way it also shows you the functions inside a file/folder, and that clickable index is nice.

What's the point of all these empty comments throughout the code?

    // ...

It's just an ellipsis indicating that there be would more code after.

It's likely the Go libter requiring every public method / variable to be documented.

Yup. I saw a lot of codebases with things like:

// EnqueueEvent ...

func (q *Queue)EnqueueEvent(event Event)

It sounds like you're describing a Readme file -- which by convention can exist as documentation for any folder and not just the project root. It's not adopted by all codebases but is becoming more common as source browsers like Github will render the readme as rich text when navigating to the folder.

Yep. For some of the larger projects that I've worked on, I've gotten into the habit of adding folder level READMEs. I don't know if anyone else has benefited from them, but I certainly have myself when I need to remind myself of some context or pitfalls.

Having a sensible folder structure and good folder names is nice, but taking a few minutes to write individual READMEs can make a repo even easier to understand.

readme files might be the natural place for them but in practice readme files sort of tell you about the project, the author, the purpose, examples of what it can do and maybe how to install it.

It rarely gives you what is in each folder, and what part of the functionality each folder handles, although perhaps we should try to change the conventions of readme files to include file structure.

edit: I mean the root readme might contain what is in each folder so you don't have to click on each one to see which one you want to start with.

Java did this with package-info.java. It is underused, in my experience.

Unfortunately, I've found that almost all developers are incapable of objectively judging the quality of code until they actually have to start working with it and then after a few months they can start to appreciate or despise the code.

It takes a lot of investment from a developer before they can appreciate the beauty of the code... To make matters more confusing, a lot of developers tend to become extremely attached to even horrible code if they spend enough time working with it; it must be some kind of Stockholm syndrome.

I think the problem is partly caused by a lack of diversity in experience; if a developer hasn't worked on enough different kinds of companies and projects, their understanding of coding is limited to a very narrow spectrum. They cannot judge if code is good or bad because they don't have clear values or philosophy to draw from to make such judgements. If you can't even separate what is important from what is not important, then you are not qualified to judge code quality.

If you think that the quality of a project is determined mostly by the use of static vs dynamic types, the kind of programming paradigm (e.g. FP vs OOP), the amount of unit test coverage and code linting, then you are not qualified to judge code quality.

I think that the best metric for code/project quality is simply how much time and effort it takes for a newcomer to be able to start making quality contributions to the project. This metric also tends to correlate with robustness/reliability of the code and also test quality (e.g. the tests make sense and they help newcomers to quickly adapt to the project).

As developers, we are familiar with very few projects. If a developer says that they like React or VueJS or Angular, etc... they usually have such limited view of the whole ecosystem that their opinion is essentially worthless; and that's why no one ever seems to agree about anything. We are all constantly dumbing down everything to the lowest common denominator and regurgitating hype. Hype defies all reason.

It's the same with developers; most developers (especially junior and mid-level) are incapable of telling who is actually a good developer until they've worked with them for about 6 months to a year.

If you are not a good developer, you will not be able to accurately judge/rank someone who is better than you at coding until several months or years of working with them. Sometimes it can take several years after you've left the company to fully realize just how good they were.

> Unfortunately, I've found that almost all developers are incapable of objectively judging the quality of code until they actually have to start working with it and then after a few months they can start to appreciate or despise the code.

While I agree when evaluating a codebase by the broad architecture (which I often judge by cohesion and coupling), I feel evaluating details first requires learning to read code as well as prose. Then “bad” or “ugly” code is code that reads arcanely like olde English.

It’s like you’re reading my mind. This matches my experience closely and is exactly what I first thought of when I saw this topic.

.. for some meanings of "good" .. aside from that one over-generality, valid insights yes

Django! And django rest framework. To me, both codebases are so readable and so well put together that even if their documentation was bad (which it isn't), you could fully grasp their APIs and how to use their libraries by just reading through some of the code.

Also agree! I learned a lot about how Python works from digging through the inner guts of the Django ORM. I’ve been pleasantly surprised, getting back into Django after about 5 years, that the codebase is just as comprehensible as I remember it to be, and the documentation equally as comprehensive.

I would add Flask too, I guess everyone on that realm of python learned the lesson from werkzeug, which after all these years some functions/classes still doesn't have docstrings and variable names are awful.

Second this; dove into Django and DRF source multiple times in the past, always got out with more than I wanted. Ofcourse on 95% of the occasions the docs are good enough.

The Codemirror codebase [0] is simply written and richly commented, and using Codemirror itself in a project is a pleasure.

Tellingly, Marijn Haverbeke, Codemirror's creator, is also the author of the excellent 'Eloquent Javascript' [1].

[0] https://github.com/codemirror/codemirror

[1] http://eloquentjavascript.net/

It's unfortunate that the author doesn't appear to understand the value of a strict Content-Security-Policy.

For context, see this GitHub issue: https://github.com/codemirror/CodeMirror/issues/4937

Author is unwilling to change a handful lines of code to make the package compatible with a strict style-src.

Why does this matter? You can exfiltrate data such as CSRF tokens using inline styles from a HTML injection vulnerability: https://medium.com/bugbountywriteup/exfiltration-via-css-inj...

See also

[1] Interesting Codebases: https://news.ycombinator.com/item?id=15371597

[2] Show HN: Awesome-code-reading - A curated list of high-quality codebases to read https://news.ycombinator.com/item?id=18293159

Thanks for the links!

I think the best codebases are the ones you immediately think “Oh, that’s easy. I can reimplement that in a couple of hours.”. When it's not. It’s never easy.

I've been enjoying reading Spectrum's codebase[0]. It's very simple, with little documentation you can understand pretty much what is going on and in the architecture level, is as simple as it needs. The first time I tried to open a PR, I was up and running my feature in a few minutes.

Small summary of the features I liked:

- Simple documentation

- Intuitive structure

- Lots of JS best practices, but still simple

- Event-driven architecture

- A simple API gateway that will just fire events to workers

- Properly divided workers (kind of microservices but with lots of shared code)

- Monorepo

It recently been bought by GitHub(1) and was discussed here(2).

The author has talked in his blog about some decisions he took wrong. Super interesting post(2).

0. https://github.com/withspectrum/spectrum


2. https://news.ycombinator.com/item?id=18570598

3. https://mxstbr.com/thoughts/tech-choice-regrets-at-spectrum/

Anything by John Carmack. (DOOM's open source release, for example)

I'd second that. Some years ago I had a look at the quake 3 bots code, which I think were also written by John Carmack. The code was amazingly intuitive.

The Q3 bot AI was written by this man https://doomwiki.org/wiki/J.M.P_van_Waveren_(MrElusive)

Carmack said that he was "the best developer I ever worked with"

Tornado, a python web framework. I doubt it is still popular today. But reading the source code is a pleasure. Its naming conveys clearly; its encapsulation allows extensibility; its documentation teaches me how to write an industry-level library.


I used Tornado to build the the WebSocket back end for this page: https://www.slickdns.com/live/. It's less than 100 lines of code and very clean.

For most web apps my default choice is Django, but for special purpose web servers Tornado or Flask are still useful.

I've always been impressed by the quality of Three.js - https://github.com/mrdoob/three.js/

Peter Norvig's sudoku solver (Python) is excellent. Solves every possible variant of sudoku.


oh yeah, this one is so good. And his spell checker is really understandable, readable too https://norvig.com/spell-correct.html

For something a little different, the Clojure codebase – particularly `clojure.core` – is extraordinary in its elegance and simplicity.

Whenever I have to dig into Spring sources, they always look awesome. This is the best Java project I've ever seen. Clean concise code, versatile architecture using proper design patterns, lots of documentation.

Plan 9 codebase has been the best that I have encountered.


Kubernetes for Golang. This has been brought up as an example of fine documentation: https://github.com/kubernetes/kubernetes/blob/ec2e767e593953...

Requests for Python https://github.com/psf/requests

K8S was transpiled from Java, and it shows. Much of the code isn't idiomatic Go at all. So I wouldn't cite K8S as a shining example.

Comments like the one cited are fantastic, but we're interested in examplars of good (i.e., elegant and readable) code, not ancillary matter.

Very interesting! I never knew that! Did a bit of googling to verify it. [1]

[1] https://fosdem.org/2019/schedule/event/kubernetesclusterfuck...

I’m not sure I’d call Space Shuttle Style appropriate for 99% of code, but it does make it easy to read and harder to mess up down the road.

Space Shuttle Style is appropriate when the design is fixed, the code is extremely tricky and it's very difficult to test - in other words, when it's not possible to follow normal good coding practices.

If you can comprehensively test, then the docs can live there, and you don't need quite so many warnings about introducing bugs.

The Stockfish chess engine [1] is very nice, especially so since chess engines tend to be very kludgy.

[1] https://github.com/official-stockfish/Stockfish

Weirdly I find the v8 source and jQuery (at least several years ago) to be the best. They're examples of real-world, multi-distributed-collaborator, mature projects. Harder, I think, than writing "clean" code is maintaining a large code base's organization, distribution, etc and these are really good examples of that.

(d3 and three.js are also very interesting to read, but they're not quite in the same class as the former.)

The compiler for the B language, by Arthur Withney http://kparc.com/b/

I'd add to this family the J system, ver. 7 - https://github.com/jsoftware/jsource .

A good example of creating a DSL and then efficiently using that. "Never had a memory leak from day 1 (Roger Hui)" (written in C).

ha ha

My personal favorite is Sequel. https://github.com/jeremyevans/sequel

Beautiful codebase, rock solid, and way better option than ActiveRecord IMHO

Not only it is a high quality codebase, support from Jeremy is a top notch, there are no outstanding issues, everything gets fixed promptly. It's been a great joy using both sequel and roda all these years

In 2010, the nginx source was quite startling. It's very readable C, but also somewhat difficult in that it did not mechanize a lot of bookkeeping. If it wanted to construct a string of 8 components, there were 8 components in getting the length right, then 8 adds to the string. This was toil, and error-prone. But it was always done with such precision, and so consistent, that it was still quite beautiful.

The Lua source code is very easy to read, especially for an interpreter. It's a hand written recursive descent parser and incredibly tiny. http://www.lua.org

It's a toy, but I remember being very impressed with the implementation of Turing Drawings: http://maximecb.github.io/Turing-Drawings/


What do you like about Laravel and dislike about React?

I like the Linux kernel codebase: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

At some point in Martins book, he mentioned a forgotten codebase from one of his friend that was solely composed of tiny functions but yet achieve a fairly complex effect. That's the kind of feeling I had working at Laravel (admittedly it's been a few years). React codebase, on the other hand, seem less human-friendly, it seems to require some prior knowledge of the code to be able to dive into.

> I like the Linux kernel codebase

Same here, I was looking at linux kernel network code lately and I was surprised by how clean and easy to follow it was.

Although it's not a "codebase" per se, Architecture of OpenSource Applications is a great way to see how lots of major applications are structured. http://aosabook.org/en/index.html

Octave is amazingly written, in my opinion. I remember delving into its source expecting a mess made by mathematicians and instead found a very well organised and clean code base. Kudos to that team!

I think Faust looks like exemplary modern Python.


Also voted for Postgres, Redis, and NetBSD.

I was impressed by code of Konstantin Knzhnik ( http://www.garret.ru/ ). All his database engines in particular.

Well, Symfony's codebase is really great: It is very readable, It's behavior is logical, The documentation is great, It uses common patterns (solid), so you are never loss

Suckless tools (https://suckless.org/), clean, efficient, small (easy to know the whole codebase) and very hackable, hackable to the point where even configuration is in code: https://git.suckless.org/dwm/file/config.def.h.html

I would especially recommend reading through sbase and ubase:



Qt's codebase is very clean

I love Qt, and Qt has a great cross platform API. However Qt uses a lot of Qt-Only idioms, in particular there is has simplified memory management based on parent/child relationships (all children are deleted with parents), and also "COW" copy on write containers that can simplify programming but make it harder to reason about memory allocation.

Further, all internal classes also include a PIMPL (d-pointer or data pointer) to hide internal details from API customers.

IMO, The d-pointer makes stuff much more difficult to read, and the Qt idioms are probably only useful if you are a Qt developer. So maybe might not be useful for you if you are on a non-Qt project.

I thought QT was a collection of libraries, even with different licenses. As such, it seems strange to apply a word like "clean" to the whole thing.

Qt core and most libraries are written by the same group of people in the same development process.

Splitting it up into libraries is good engineering practice as it ensures boundaries. (Whether it is packaged into their own dll/so/a/dynlib is a deployment question, you can compile all parts statically together)

The licensing is a political/business question independent from code being nice, good and clean. (Except one has to be careful during refactorings, as code might change license, but since afaik the Qt Company has CLAs ensuring copyright ownership they are able to do that)

I would say the Mattermost ("open Source slack") codebase: Go server: https://github.com/mattermost/mattermost-server

React frontend app: https://github.com/mattermost/mattermost-webapp

Not your question but I think the bash source code is the worst I've ever seen.

Sedgewick and Wayne algorithm implementations. Although, those are literally textbook implementations :-) I’m not quite sure what it would take to turn them into a production ready standard library (generics?), but they look like a good foundation for that purpose.


I enjoy reading Django's source code so much. I also love writing Django apps. Its elegance forces me to write a much cleaner code.

I think Brogue is quite nice to work with and well structured: https://github.com/tsadok/brogue

Kudos also to Postgres, SQLite, Lean (theorem prover), and the containers library for Haskell.

The build system and assembly system (x86asm) are very underrated. Open source went from autotools, which are awful, to cmake, which also seems to be awful. ffmpeg's configure/make system has the same interface as autotools but is actually good.

libavformat is rather difficult to use and difficult to fix bugs in - you'll never find the bugs. Same with the ffmpeg frontend, which makes it easy to ask for something it's near impossible to get right, like copying an mkv file to an avi, it'll just corrupt your data silently.

Everything about the video decoders is great, but encoding never worked as well, which is why nobody uses ffmpeg2/4/etc and x264 is a separate project.

libavformat is rather difficult to use and difficult to fix bugs in - you'll never find the bugs.

Some examples?

encoding never worked as well, which is why nobody uses ffmpeg2/4/etc and x264 is a separate project.

Most users use x264 _via_ ffmpeg since they may need to filter the video and/or filter/process/mux audio and other streams.

> Some examples?

Just try copying video between different containers (mkv, ts, avi for one) without reencoding.

> Most users use x264 _via_ ffmpeg since they may need to filter the video and/or filter/process/mux audio and other streams.

They don't have to do that; you can handle each track separately and mux them back afterward.

I'm talking about ffmpeg's builtin MPEG2/4/MP3 encoders, which nobody used when they were competitive because it leaves all the options at "go fast" instead of providing tunings.

They're also unmaintained and the code is hard to figure out - that's why x264 was a separate project instead of just another part of libavcodec.

Just try copying video between different containers (mkv, ts, avi for one) without reencoding.

I do, all the time. AVI is an old container and it has issues with B-frames and also VFR but those are container limitations, not a libavformat issue. All transmuxing between common modern containers with present-day common codecs work fine. There are always edge cases, but that's what they are, edge cases.

you can handle each track separately and mux them back afterward.

Why do that? What's the benefit?

I'm talking about ffmpeg's builtin MPEG2/4/MP3 encoders

You seem to be talking about the state more than a decade back. How's that relevant to 2019? BTW, there is no native MP3 encoder.

Recently had to make a few changes to ffmpeg codebase. It's pretty good.

prestodb is the best java codebase I've ever worked in, llvm is the best c++ codebase I've ever worked in.

For Java, below are my favorites:

1. Jersey - https://github.com/eclipse-ee4j/jersey 2. Jetty - https://github.com/eclipse/jetty.project 3. Guava - https://github.com/google/guava

Common theme is - Easy to follow, clean documentation & use of consistent patterns.

I found the Akka source code to be very clean and readable: https://github.com/akka/akka

I work at Citus (now Microsoft) so my opinion is biased but I think Citus [1] codebase is a really good example.

It borrows all the best practices from PostgreSQL the naming of variables and functions are more self-explaining in general.

I also believe that the practices around PRs and code reviews are also good examples.

[1] https://github.com/citusdata/citus

arc.arc, core of the Arc language. Lots of lessons in there on how to write good Lisp code. I’d say the same thing about the core.clj of Clojure.

many mature frameworks have good standards. For php you can look at any code base that adbides by PSR standards and get good code.

Not only that but laravel is in a mature space where the problems are already solved. Its basically reinventing the wheel.

Im not surprised that Laravel is written cleanly but I hate its API. It reminds me of the bloat of Zend but with an obnoxious artsy style added to it.

Im an engineer not an artisan.

> Im an engineer not an artisan.

That was mostly branding, "I'm not a code monkey banging out the same thing as has been done 500 times before I'm an artisan.

Meanwhile the actual framework breaks backwards compatibility regularly and frequently and only just with 6 picked a damn versioning system.

Imo my opinion if you want to see a good framework that solves it's problems mostly well and is properly decoupled then Symfony kicks the shit out of Laravel on documentation, religious adherence to deprecating and backwards compatibility as well as genuinely useful/genuinely decoupled components.

The author of Laravel knew that as a massive chunk of Laravel depends on Symfony components, in fact the earlier versions where basically Ruby on Rails implemented via Symfony.

> look at any code base that adbides by PSR standards and get good code.

No. PSR does not at all ensure good code, only standardized code style and some of the interfaces.

It does one thing: It indicates the author is willing to use "established patterns" instead of reinventing the wheel. (While sometimes it's good to reinvent the wheel - today's vehicles won't work with wodden wheels)

PSR is an abomination. You end-up with more blank lines than code and about 5 lines of doc comments to every line of code. Factor-in PHP's pseudo-Java verbosity and the actual business logic is drowning in a sea of boilerplate.

Anything by Tobias Lütke ( https://github.com/tobi )

Maybe I have a low-ish bar, but the Kodi Add-Ons in the official code base is a nice system of continuous integration with established acceptance and review procedures. I wrote a small add-on for a woodworking show and as soon as all the checks were passed and there was a final sign off, it was immediately available for installation on all platforms.


This python code is responsible for the fairly recent imaging of the black-hole (i.e. imaging, analysis, and simulation software for radio interferometry).

It's extremely easy to digest despite the complexity involved.

The cleanest real/nontrivial code I've seen is the plan9 implementation of the standard unix command line tools.

With some healthy spoonfuls of caveats, Quake 3 (https://github.com/id-Software/Quake-III-Arena). Skipped a lot of classes back in the day ripping that one apart and piecing it back together again.

I liked some Objective-C codebases that adhered to the muffin-man principle:

- (BOOL) doYouKnowTheMuffinMan:(TheMuffinMan *)theMuffinMan;

Also, lots of the Objective-C runtime code was clear enough to explain concepts like ARC hacks well enough that I could learn about and give a talk on the Objective-C runtime with a month’s notice.

Well there was a post few year back with title "Best-architected open-source business applications worth studying?" : https://news.ycombinator.com/item?id=14836013

Unreal 4 is far from perfect but considering the size of the project I would say its pretty impressive.

I’m only learning Swift and iOS dev but a fair amount of people recommended me to take a deeper look into Kickstarter for iOS app: https://github.com/kickstarter/ios-oss

Came here to post this.

It's got some interesting usage of custom Swift operators to create almost diagrammatic code, like here: https://github.com/kickstarter/ios-oss/blob/master/Kickstart...

  _ = self.cardholderNameTextField
      |> formFieldStyle
      |> cardholderNameTextFieldStyle
      |> \.accessibilityLabel .~ self.cardholderNameLabel.text
And it's the first iOS codebase I've seen that puts test files right next to the files that define the things being tested. It's all there together.

Tons of other goodies to find.

2 of the guys that worked on that app are doing video series about functional programming now where they talk a lot about the ideas in the app, https://www.pointfree.co

Nice rec! They're the ones that wrote the tagged [0] repo I use, had no idea they were connected. Very cool.

[0]: https://github.com/pointfreeco/swift-tagged

ZeroMQ[1] for sure. It has a clean interface wrapped over low-level IO details and a well-designed unified polling mechanism. It's a piece of art.

[1] https://github.com/zeromq/libzmq

sel4 microkernel is pretty cool.

"The world's first operating-system kernel with an end-to-end proof of implementation correctness and security enforcement is available as open source."

for instance, look at strnlen:

  word_t strnlen(const char *s, word_t maxlen)
      word_t len;
      for (len = 0; len < maxlen && s[len]; len++);
      return len;


I've had the pleasure of working on this: https://github.com/cloudify-cosmo was the cleanest I've seen

Someone mentioned joda-time. Does anyone know of other exemplary java repos?

I think that the Twisted codebase has the best combination of sound and comprehensive metaphor, straightforward functionality, and fun.

I also enjoy diving into it when I hit a breakpoint that calls one of its methods.

Whatever you do ... do not read the guts of PonyORM (python db orm). Also I would strongly urge anyone who has ever considered it to run, not walk, in the opposite direction.

Any good to study Kotlin codebases? Or even Java (8+)?

I really like AllenNLP.


A side question. What is a good way to start understanding such code bases as mentioned on here, so that the learning is optimal and effective?

minitest. The testimonials speak volumes.


Apparently, some people think my toy channel implementation called Normandy is somehow good:


The easiest to read codebases I've found were all written in Common Lisp, like the clsql lib for example.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact