Hacker News new | past | comments | ask | show | jobs | submit login
Interesting Codebases (medium.com)
344 points by markpapadakis on Oct 2, 2017 | hide | past | web | favorite | 45 comments

If you enjoy going through interesting code bases and learning some new tricks and patterns, you would probably enjoy "The Architecture of Open Source Applications" ([0]) - each chapter is a description about the history and architecture of a separate open source project. Whenever learning a new technology, I usually try to find a chapter in this series about it.

[0] - http://aosabook.org/en/index.html

I've actually done undergraduate research with AOSA, and by far it was the most interesting to analyze the differences and commonalities in architectural approaches between FOSS projects with similar domains.

one of my biggest issue with looking through codebases is, where do i start?

if i'm not familiar at all with the language, or more specifically how the language architectures the program, i'm just going to be spending a lot of time looking at stuff that probably isn't the meat and bones of the library/app.

take for example, his first suggest codebase, seastar. i haven't done c++ in years (school, using turbo borland) so where do i look? "apps" maybe? nope just seems to be a folder of libraries. ah, probably the core folder. whoops there are 20+ files/headers. should i dig into this assuming that's where most of the code for the app/library is?

i suppose it would probably be nice if there's a site that explains how most programming languages layout their code. e.g) javascript generally is laid out similarly now, as is ruby/rails. so a site to explain the general layout structure would be kinda cool. it's kinda late at night so maybe i'm overthinking this.

Search for "Code Review" posts by the author of the "Game Engine Black Book":


He walks you through his whole code review process starting from cloning/building the project:



Well there goes the rest of my productivity today. I love reading these.

For what it’s worth, this is how I usually go about exploring codebases ( I am the author of that blog post ).

I either have something specific in mind that I want to understand, e.g in the case of Seastar, I wanted to understand their reactor design and implementation, so I locate the respective file(s) that implement them and I start from there, and then I just branch out to other files -- I usually keep multiple vim windows open, and I take notes.

When I am not looking for a specific answer, I choose a directory, and then I sort its files by size (e.g ls -lShr *.{h,cpp,hh,cc,java} ). I usually sort by file size in descending order(smallest first), but some times it makes more sense to sort by largest file first, and I start from there. I still map my way around, and if something stands out, I open the respective file in another vim window/tab, and look it up, and then continue with the previous file.

You can also make a git script to check what files are the most modified.

I also have a script that checks what are the most crypto-y file in a directory.

> I also have a script that checks what are the most crypto-y file in a directory.

Could you share it? :)

First and foremost, clone and build it. If you cannot clone, build and get it running within an hour, it's generally crap and not worth your time[0].

Once you get it building and running, you can now start making simple changes in order to find your way around.

[0] Developers should pride themselves on how (relatively)simple this process is for their project(s).

That would disqualify every C++ codebase I've worked on.

One of the quality standards of my company is "devs have rebuilt the project from scratch on their machine, in < 15 min". It is checked weekly and works wonder

This is where I find robustness of Go particularly interesting and it's probably one of the reasons where it gained so much of momentum recently.

If the project is using git, there's a git-extras package to show statistics on the git history where you can see the files that are most frequently worked on.


This is really cool. What are some of your favorite git extras commands?

When in doubt,

    grep -r -- 'main[ ]*(' .

The question to me is, do you really learn anything just by reading it or is it just an exercise in self-aggrandizement and programmer posturing. You will be able to say: I've read that code and understood it. Some people claim that it's more important that you also look at the "bad" code and find ways to improve it that way you'll be truly learning the craft of programming and not just be on board with the latest engineering practices of a particular language and tool-sets.

You know the problem they are trying to solve. I think looking at how they solved it is the definition of learning. I don't think many people read code just for the bragging rights of saying they understood it. That seems silly.

You're right, such things should actually be in the README, but go figure, nobody does it.

Documentation should not be a matter of "every function has a docstring", a description of the architecture is always useful.

I'm with you on this! Clicked on the same suggestion and had no idea where to start. Expected to see a "src" folder which is typically used to separate the source code of the project from the docs/objs/tests/utils, it was missing so I closed the tab not willing to search for each part of the actual code was.

Generally, if it's an executable, you should look for the main function. Otherwise, it's a library, in which case you should look at the documentation and start from the handful of points of entries into the library.

I start with finding entry points and analyzing data structures.

Redis is a code base I've been impressed with.

Short, readable functions and good comments make it quite easy to follow.

[1]: https://github.com/antirez/redis

An important thing it has, is an actual source code layout in the readme, something that is missing from nearly all open sourced projects, making it a hassle to map out the project and start reading them...

After seeing your comment I decided to click through and check out the readme. That's a great amount of documentation. Wish the codebases I work with would be so detailed.

This is a fantastic example; thanks for sharing!

I guess I'll be that guy and say that I highly doubt the author has taken more than a cursory glance at more than half his list. He certainly hasn't had the deep epiphanies he's implying from each of them.

Seriously, he expects people to believe he's evaluated the code of the Linux kernel, the Chrome browser, Postgres, LLVM, Tensorflow (just to name a few, less than half of his list), deeply enough to be able to make statements like "finest codebase in [x, y category] that I've seen", while also being the CTO of a company?

I never implied that I studied the Linux kernel codebase insight out -- specifically, I was interested in the networking stack and the VFS layer, and from there I looked into the memory subsystem and certain drivers implementation. I did map the codebase to the extent that I can more less find my way around easily.

Same is true for Chrome. I was interested in the code for the various UI components, but I looked into other bits here and there.

I 've studied most of Postgres, LLVM and Tensorflow codebases - as in, I went through pretty much most if not all files looking for interesting bits (and finding plenty). I guess there's no way to "prove" anything to anyone, but I don't really care or want to do that either -- it's not about bragging rights, if that's what you implied; I just thought I 'd share a list of codebases I came across that I thought were interesting and worth of other people times.

As for my job, and my work, you may want to check out my Github profile (https://github.com/markpapadakis) -- though there's only public stuff there. I guess I like to spend my free time learning instead of say, watching tv or waste time elsewhere:)

Can you also suggest how much time one can expect to spend on a particular codebase (which are listed) to get any meaningful take-aways?

I am interested in exploring codebases now that I have sufficient experience and feel confident (and not initimidated). This will be useful because then I can start from the smaller (simpler) ones and move on to the complex ones later.

Any recommended path, please do suggest.

It depends on the size and what you are trying to do. For example, lately I 've spent maybe a week studying Lucene, and kept going back to it every other day, because I needed to understand pretty much everything about it, to get ideas for improving Trinity ( https://github.com/phaistos-networks/Trinity ).

Some other codebases are so vast it takes a lot, lot longer to understand them enough to feel 'comfortable' navigating them (e.g the Unreal Engine codebase).

Most codebases however are quite small, and within 1 hour, or a few, you 'll be able to understand where to go to find what you need, which are the primary data structures and functions, etc.

You can definitely skim codebases and get a feel for things like "no comments" or "names are in English", but for large codebases I doubt any single person could say they "understand" it in its entirety.

It's hard to even make an observation like "names are meaningful" without a lot of context.

At least in regard to the Linux kernel, the interesting thing about this code base is that you can study the different sections separately. For instance the networking code, the VFS, the scheduler, are quite isolated from other stuff. However this also means that there is a variety of authors writing a variety of unrelated parts, in practical terms is a non uniform code base so it's hard to have an evaluation that works for everything. Btw, some of the parts I liked less are the ones written by Linus :-) Because he tends to make things a bit implicit in certain cases apparently. The VFS layer is an example.

Exactly. This isolation ( though obviously they all make use of many core APIs ) makes it easier to focus on one component at a time. I didn’t like the VFS layer that much ;)

I'm interpreting "finest codebase in [x, y category] that I've seen" to mean only what it says - out of the codebases he's seen, which may not be many, it's the finest.

I tend to run into new large codebases pretty frequently just from having to work with them. I'd say that happens maybe once every month or two for me, and I'm not even prioritizing this. I literally didn't know Ruby two weeks ago and now I've read through RSpec because I needed to submit a patch to some RSpec tests that were doing something unusual. For a CTO who's listing all the things he's seen his entire career, is two dozen codebases really that surprising?

I'll echo the sentiment. In our working life we don't get to know many codebases in depth. And the ones we do know, it's not at all obvious if they are relevant, or a regular disposable mess.

Paradigms of Artificial Intelligence Programming: Case Studies in Common Lisp


Classic expert systems, but IMHO not outdated. I think they will make a comeback soon once we understand how to integrate probabilistic reasoning, logic and connectionist approaches.

There was some discussion on the very same for an Ask HN[0]. Copying my comment from there I've found the google/leveldb[1] source code to be immensely educational, authored by Jeff Dean and Sanjay Ghemawat. The implementation of leveldb is similar in spirit to the representation of a single Bigtable tablet[2].

[0]: https://news.ycombinator.com/item?id=13854431

[1]: https://github.com/google/leveldb

[2]: http://research.google.com/archive/bigtable.html, section 5.3

Great achievement to have read so much code. Honestly, I felt a little depressing for not having read this much code.

As an aside, how do you read codebases? Do you read every single line to understand what's going on? Or you get general idea about design/architecture? What are the proven strategies to read code bases?

If you are interested in reading codebases but find some of the larger projects intimidating, I suggest checking out Timothy Davis' sparse matrix library CSparse: (http://people.sc.fsu.edu/~jburkardt/c_src/csparse/csparse.ht...).

It is is used internally for sparse matrix representations in Python, R, and Matlab. The entire library fits into 2100 lines of concise yet well documented C code. It is now mostly installed bundled with SuiteSparse, but the link above has the 2006 codebase from the original stand alone library.

Back in high school and college I spent a lot of time reading code. I was on the Amiga, and remember reading Matt Dillon's stuff (a C compiler with library, the DME editor, and Dnet), Tim Budd's smalltalk, David Betz' "advsys" (which inspired me to read a book on parsing and automata), and other stuff that I've since forgotten.

These days, it's hard to find time between work and an absolute deluge of interesting stuff to study and play with.

This is great! We need more of these sorts of posts. There is so much the community can learn from excellent code bases. Thx for posting!

Start with looking at the unit test, it will help you split the code bases to units and understand them one by one

Adding golang/go to the list. It's interesting to read how an actual language is implemented, also it's fairly well documented (most of the documentation is extracted from the codebase, so all the crucial bits have to be there).

There are a few parts of the codebase that were automatically transpiled from C, but the rest is usually very readable.

I've always found thefuck to be a great Python code base which is nicely organized and easy enough to wrap your head around. Also an easy project to contribute to.


The codebase I inherited and now maintain is "interesting" too, but more like this:


I always look to requests in how python should be used:


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact