Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How do you read and understand an open-source project?
105 points by amerf1 on July 6, 2019 | hide | past | favorite | 24 comments
Looking for different approaches

Would love to hear your experiences

First, RTFM. If there's documentation, I read it. README, CONTRIBUTING, whatever is available really.

Then, I start hunting through the codebase. Sometimes despite people's best efforts at compartmentalization, there's one or two files that are the heart of the project. Depending on the project they'll be different things. For instance, TypeScript has checker.ts, which contains the core typechecking logic. Ruby has vm.c, compile.c and parse.y. If that's the case, that's actually very helpful, as I can spend the majority of my time in one file.

To aid in this hunting, I use a few tools. Stuff like grep and find (although I prefer ripgrep and fd) are a huge help, cause you can search through large codebases with relative ease. IDEs are great too. I particularly like being able to goto definition, then go back, then go forwards, etc. Switching between call site and definition makes understanding functions easier.

I take notes on occasion, although I don't always reference them. It's more to process what I'm reading. I try to write notes about types, functions and files. I do it in org mode and embed urls so that I can link definitions together.

Definitely run the code as soon as possible. Then add print statements and see where they go. I've used flamegraphs on occasion to see the stack trace.

If the project is small (just a handful of files), then I will just jump directly to the code.

But if it's medium to large (several folder, tens or hundred files), then I prefer "working" with the code more than just read it.

The process usually follows this order:

- Check the document for how to build and run the test. Make sure test passes before doing anything else.

- Once test pass, go back to the document, look for the program entry point (if it's built into a binary), or the exposed interface (if it's library). Skim through that to get the overview.

- Load the project into my IDE, and try debug the test to understand the flow. Sometimes, I'll also write new code to check my understanding.

That said. If the goal of reading that OSS project is to hunt down a bug, I would just start from my own code, and debug into the library itself. Skip the overview part.

After glancing over the README here are a few of the other things I take a look at.

The "contributors" tab on GitHub helps get an idea for the health of the project: how long has it been actively maintained? Who did most of the work? https://github.com/dgraph-io/dgraph/graphs/contributors

For software libraries, I like looking at the brand new "used by" tab - https://github.com/huge-success/sanic/network/dependents - in addition to indicating project health it's also a great source of examples to look at later when I'm trying to figure out how to do advanced things with the library.

I love reading through the CI configuration - .travis.yml or .circleci/config.yml - because at the very least it shows what it takes to run the test suite, and often I'll pick up some fun new CI and automation tricks.

I use GitHub code search extensively: sometimes for searching within the project, but I'll also search the whole of GitHub for examples of people doing something I want to do with the library: https://github.com/search?q=sanic+cookies&type=Code

Good tip about Github's code search, I always forget it's a thing and don't use it nearly as much as I should

I read it like a book and Google all the tooling I don't understand. Usually the first page will be something like a make file. Understanding every detail of this file will reveal the many hacks needed to get the project to run and that tells you a lot about the project.

After that, I look for the entry point and try get a sense how the code files have been organised. If it's good code, it will look like a curated library of small functions with few inter-module dependencies. It should be easy to add and remove functions without having to change countless files.

From this, you can start thinking of improvements and how to add them to the existing software. It's a lot easier if it's only for your use as you can get by with something that 'works' rather than something that 'works well.'

Second what another poster here said too: if there's too many files its easier to work with the code than try read all of it.

Following one feature through the code base can be one interesting way - especially if that feature is something you understand from previous experience. e.g. follow the life of a packet or a block read sya in a kernel; or one particular library call somewhere and just follow the path it takes. You do have to be a little wary that you might be following an unloved/old part of the project that needs work; so you might not be learning the ways that they want new contributions to use.

This is pretty much what I do. Very few people do literate programming, so it's hard to read code like a book. Instead, I think, "How did they do X" and try to find out. This introduces you to the code. Another interesting way to read code is to ask yourself the question, "If I wanted to add feature X, what would I have to do"? This allows you to peruse the structure of the code. I find that searching code is much, much easier than reading it and once I've answered my questions, I usually have a better understanding.

The last way I read code is to add a unit test. Most code is poorly structured for unit testing (by "unit test" I mean testing some function with real collaborators -- the real collaborators are important for this exercise). Then I try to refactor the code so that it's easy to write the unit test. Just trying to create the collaborators usually leads to a lot of insight into the code (and is one of the reasons why I recommend to people that they avoid over-mocking their tests -- but that's a story for anther time). You may not succeed in writing your test, but you will almost certainly understand how the code fits together and where the smooth and rough parts are.

In general:

1. Find key data structures, guess at their purpose, and examine frequently-called functions that operate upon them

2. Outline program entry points: command line, API calls, RPC, etc.

3. Identify major systems and interfaces, especially any module management code

I usually get to the code when I have a purpose beyond merely understanding the project, like wanting to make a change or just understand why a certain behavior is like it is.

After I've downloaded the project, I'll think of a few words that are related to what I want. For example, in the program "sweep", an audio editing program, there's currently a bug where the arrow in the horizontal ruler gets redrawn, but the horizontal ruler as a whole doesn't. That causes overlapping drawings of the arrow to be drawn in a similar fashion as when a program with a window freezes and you move another window over it.

So, I'll think of the words "arrow", "cursor", "ruler", "point", etc. and grep them in the code. The grep's/ag's -C option is awesome for this, too. I'll look over the matches and visit the matches that look the most relevant to what I'm looking for.

This is easier when what you want is logically near text that the program outputs or that you otherwise know must be in the code. That way, you don't need to guess. For example, to modify gnupg so that you can change the directory it uses for socket files with an environment variable, you can just grep for something like the filename of a certain socket like "S.gpg-agent" and look at the code for where the directory path comes from. That's pretty much guaranteed to quickly take you to where you want to go.

grepping is awesome. It's simple and works with every language.

I get it up and running, build something trivial, read the API a bit, and maybe try running the test suite. You can learn a fair amount of stuff just by using a thing. Then if it's remotely interesting, I start the RTFM slog (slog meaning I don't really learn efficiently from reading). I also like to skim the Changeling, and fish for high quality video summaries in YouTube.

A recent example was I was curious how to write a PostgreSQL client. I used a PG client, skimmed the source code to see how it was wired up, read the public API docs, and then watched this video in the PG wire protocol... https://youtu.be/qa22SouCr5E.

If you're trying to learn and dont have much of a foundation for evaluating a open source, try building it from scratch. When I wanted to better understand the DSL is testing framework, I implemented a basic testing library of my own. It was really informative.

What is the git command to find out most frequently changed file ? ( Apart from compiled or temp file one forget to gitignored. )

git log I guess, assuming you're on the most recently pushed-to branch

The only approach I found to understand any non trivial open source project is to have a specific goal of what you want to do.

Once you have a goal it feels so much easier to understand anything because you narrow down the scope and you have some key words you can grep the code.

Example. let's say you found a Redis driver for your language. Now Redis6 include some new commands which you want to add it immediately instead of waiting for your driver. Now you will know how to search for similar command(grep the heck out of it) and try adding break points or just printf to see where the code path it.

I enjoy reading open source code and publish a newsletter[0] with a section call "Code to read" that have some repo you can try to read and see how they do thins


[0] https://betterdev.link

If you're starting from scratch, as other users have mentioned, the READMEs, CONTRIBUTING, etc. are all good sources.

After that, take a peek into the Issues. Many people open bugs and/or create pull-requests to resolve something. It's quite plausible that you can gleam a lot of information from what's going on behind the scenes from these; especially, if they're open-source projects with a lot of public consumption.

If it's in a language I don't understand (presumably because I have never had the need to use it), I'll try to write the basic "Hello, world!" apps or something slightly more complicated, just to get the gist of the language. The helped a lot with Rust, for example.

Start with the available documentation.

After that, getting an abstract overview over the packages/modules and their responsibilities is key IMO. It helps you understand the structure and how the logic is tied together.

Developers often stay away from contributing to open source projects, because of the initial hurdle to understand a large grown code base written by someone else.

My team and I are currently building a developer productivity tool. The goal is to help developers grasp code quicker with visuals / graphs. If anyone’s interested — preferably OSS maintainers, contributors — feel free to reach out. We would love to get some feedback.

One trick I've learned is to browse through recent commits. Aside from recent activity, it's also a shortcut to where the source files are located. Some larger projects turn into a jungle of directories.

Adding to all what been said, the first thing I do is to count the lines of code using cloc, and see the contributors list. There is generally one key person who has the most commits, so it's good to know that person philosophy and style. For example, for reading the redis source code, I learned a lot from its creator blog posts and the redis manifesto (http://oldblog.antirez.com/post/redis-manifesto.html). By the way, I love Redis' style.

Visual debuggers are really nice. PyCharm's is mostly trivial to set up. One thing I've learned in debugging the framework I'm currently using (https://ckan.org/) is that even if I can trace code execution, legacy code might make no sense at all if the reasoning for a block isn't explained.

For me, if there's not comprehensive documentation, I will not try to understand it. I'm not smart enough to grasp it quickly and not willing to dedicate the time to doing so. I'm forever in debt to those who have the time and patience to make that documentation possible because I wouldn't have a career without them

1. Read/skim the API documentation to get the shape of the project

2. Pick an entry point to dig into, and read the code on Github. Octolinker is a lifesaver: https://octolinker.now.sh/

If I'm getting to the source, it's generally because I need to add/tweak a feature, so the first step generally involve `grep' through the source tree, and pull the string from there.

What I am missing in almost any project is a second README (the first should be an introduction to the project) where arcitecture and design decisions are discussed.

To use it? To find out how it works? To begin contributing to it? Different approaches to all these.

1. Clone the source

2. Try to build it

3. If there are tests, try to run them

Just from doing this many times over, I learned a lot about programming.

Also fork the repo, then do `git clone <url>` for the original repo, `cd repo`, then do `git remote add yourusername <yourgithubforkurl>`

To ease the above process, I created vcspull: https://vcspull.git-pull.com

Here is an example of my vcspull file: https://github.com/tony/.dot-config/blob/master/.vcspull.yam...

This also helps studying, read code, and also do open source in general since it's easy to setup the original repo and the fork.

For generic open source: Download source, check README.md/rst to see if they are testing/development instructions. Check .travis.yml commands, those are showing what packages/steps are taken to build and probably test the code

If it's node: do `npm install` and check the "scripts" in the package.json. Those commands can be run like "npm run <task>"

If it has CMakeLists.txt, it uses CMake. Download and install cmake, then do `cmake .`. cmake will let you know if you're missing libraries and those are easy to google package names for. Then `make && [sudo] make install`

If it has Makefile.am/autogen.sh... download and install autotools/autoconf/automake. Run `./autogen.sh`, then `./configure` (google for package names of any libraries that show missing headers, .h, or symbols). Then `make && [sudo] make install`

If it is python, and there's a Pipfile, download and install pipenv. Then do `pipenv install .`. If it has requirements.txt, `pip install -r requirements.txt`

Carried foreward, if the project has anything resembling a package manifest (e.g. Gemfile, composer.json) google them to find the appropriate package manager for your OS. That gets you 75%-100% of the way to running locally a lot of the time.

For the more complex projects, they typically have dedicated setup instructions and sometimes very detailed overviews (e.g. https://github.com/OpenTTD/OpenTTD/blob/master/docs/Readme_W..., https://devguide.python.org/setup/, https://www.kernel.org/doc/html/latest/process/index.html)

Reading the source of your favorite interpreted programming language can be rewarded, e.g. https://github.com/python/cpython

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact