Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How to understand the large codebase of an open-source project?
188 points by maqbool on Feb 3, 2018 | hide | past | favorite | 47 comments
Hello All!

what are techniques you all used to learn and understand a large codebase? what are the tools you use?

You can use the debugger on low level api calls to get pretty much anywhere in the codebase. If you want to find whats changing a label to "foo" you can hook into every set_Text call and put a conditional breakpoint on all label changes to break on "foo", then just go up the callstack to find the logic. This strategy works on network interfaces and file interfaces as well. I abused this on our 2M+ SLOC legacy codebase and it has saved me many hours.

Also use version control to identify the most commonly edited files in the project. These are usually the files that are doing all the work (80/20 rule) and you likely need to know of them.

git log --pretty=format: --name-only | sort | uniq -c | sort -rg | head -10

A variation on this is instrumenting an codebase with profiler flamegraphs, which I find a lot more straightforward to drill in/out with than stepping through functions one at a time.

I normally use manual profiling libraries - I need an excuse to try out orbit, which uses automatic instrumentation for similar purpouses: https://www.youtube.com/watch?v=L8w0qI8qzvM

A little lower fidelity in some ways, but faster iteration than what I've been doing in others...

This isn’t abuse... it’s exactly how you’re supposed to use a debugger.

I suppose the difference is you'd normally use a debugger to find out why the code isn't doing what it's supposed to, rather than using it to find out what the code is supposed to be doing in the first place.

I don't consider it abuse either, though.

Agreed, and riffing on that a bit — I find the name "debugger" is actually troublesome when teaching newcomers (I work with kids of various ages).

I see the debugger much more like a "REPL for a compiled language" than a "bug removal tool". I try to teach people to think of it as an interactive inspection tool, not as (merely) a thing to fix broken programs.

"REPL for a compiled language", or "Binary REPL", I like that.

Besides that, it's times like these when I realise how useful IDEs are. Instead of needing to use grep (or something similar), I can simply right click on a variable and choose 'Find all references' (this is in VS, but I'm sure many of the leading IDEs will have this feature). When I use the command line it's to save myself time.

You can save even more time by pressing Shift + F12. CTRL + - goes really well with it (step back to previous code location). I'm a bit biased toward hotkeys as I'm using the fantastic vim extension, VSVim, so I barely ever have to take my hands off the keyboard. VS really is a great tool. Adding the Docker integration for dotnet core has really opened up deployment options for what was once a Windows only product for deploying to Windows only. It's still essentially a Windows only product (I've heard the Mac version isn't comparable), but you can deploy and debug in containers. Dotnet core is at v2 now, and seems stable enough to actually use in production (finally!). On a tangent here, but the point is it's a good time to be working with dotnet.

Appreciate the handy tip you tacked on at the end there. Thanks!

The straightforward way to understand code is to work with it. That is to say, look at open tickets and try to implement the suggested functionality or fixes (or some of your own ideas).

The analytic approach is a bit more awkward, since you have no specific goals and need to make them up yourself. So you could pose questions like how a specific behaviour of the application comes about ("why does it do that when this happens?", "how does it do X?") and then try to answer those comprehensively, systematically (a format that works well for me is short snippets of code interleaved with explanations and arguments).

A bottom-up approach is generally easier, because your questions will give you information at the bottom (like specific application messages), which are generally easy to find (ag, grep). A good IDE can be helpful for navigating the code and finding call sites, especially in projects written in dynamic languages where such analyses can become kinda annoying. (However, in more awkward code bases analysers like PyCharm are quickly overwhelmed and are unable to resolve indirections)

Top-down is in my experience less useful, because there are far too many choices on each level for most applications, and the first few layers are generally the least interesting and most arcane/fragile and difficult to follow along (things like initialization sequences).

The most difficult projects are typically those relying on multiple languages, code generation and runtime mutation (reflection, on-the-fly UI generation, overly dynamic Python code are typical examples). Another frequent obstacle is excessive abstraction and indirection (implementing something that could be done in a few lines of easy to understand and reason about C using multiple C++ templates spread out over a bunch of files and a healthy dozen of advanced language features is an almost archetypal example).

Use `tree | less` just to get an idea of where everything is and how it is structured and `tokei` to check out what the code actually is written in.

Then I try to go down the main code path of some examples or the primary binary if available and just check out out how things are called/done around there.

Then run an example through callgrind and visualize the call graph in kcachegrind to get an idea of how often things are called and where and where the heavy lifting happens. That last step is optional and really depends on the type of project.

Then I use my code editor and lots of searching and call site lookups to get a better idea of how things are used.

This strategy takes time, but it works really well:

1. If you can, get an overview from a mentor. This will make the next steps a lot easier. Get: - history - philosophy - design style - high-level flow

2. Get a stack of white paper from the printer and put it in front of you along with a pen. Colored pens and scotch tape are a bonus (There may or may not have been a shortage of both printer paper and colored pens next to them when I started at my last job)

3. Open a debugger with a breakpoint on the first line of code

4. Pick a request flow and initiate a request. Let the debugger guide you through the entire request flow

5. Record the path of the flow as a sequence diagram on your paper

( BONUS ) Record the relationships between the components in the system in a class diagram

Why does this work?

There's software out there for making these diagram, so why draw them by hand? For most people, visual memory is the strongest. So, the idea is you use your strong visual and spacial memory to assist you in recalling random objects, facts. And hey, why not a codebase? And that’s why this works.

When you look at different files that the debugger guides you through, you are engaging your visual memory. You remember how the code is organized and what the files look like.

When you draw the sequence diagram you engage your spatial memory. E.g., the Router class doesn't interact with the Database class and so they are one sheet of paper apart. Visually, you can see what clusters of components work together to make larger structures. This allows you to mentally group the classes into a single concept.

The point is to get this information into your head, and not to produce a diagram on a piece of paper. If all you need is the latter, use software, of course. Seriously, when you're done, throw away the piece of paper; it will be outdated the next day anyway.

I'd like to point out that if it is truly large, then deep intimate knowledge of specific parts will never be truly had. Those who do know the entire project understand the flow and architecture of it but the details are blurred.

That's what I was thinking. My first job was dev on a C++ and Java project that was in the high hundreds of thousands of lines when I started, and grew from there. Client-side had about 5 core binaries, server side had about a dozen, and each of those was a sizeable file.

Starting to understand it involved reading our documentation on the data-flow between different components during operation, to know the purposes of the important binaries. For the really core components, we had a fair bit of documentation at the level of classes.

You'd usually end up learning the sections of particular programs that you worked on in great detail, the programs themselves as a whole in slightly less detail, getting fuzzier as you moved away from your areas of greatest experience.

To add, close mentorship with a core member of the team is one of the ways to get around this hurdle. Anything less formal than that, in many cases, it's fairly impossible (unless you have some very similar experience you can draw from).

This is one of the times when an IDE (hat-tip to Jetbrains, but many other are available) really comes into its own. Fast navigation facilitates understanding. If you really are a vim die-hard, "exuberant ctags" is an excellent tool in this area.

This is not open source specific but what I do to any code base I am expected to understand and be prouductive with.

I usually start by running cloc and sloccount to get an idea of the metrics of it, languages line of code estimates etc...

I progress to looking at the tests if there are any. They usually give an idea of how the authors expect things to work.

Once I have browsed some of the tests, in particular integration tests I start following how they work through the code. Your IDE of choice will help out here or failing that use ripgrep, ack, the silver searcher, searchcode server (note I run this so I am biased), sourcegraph.

One thing that I have found especially valuable is running something to determine the cyclomatic complexity of the code. Knowing which parts are complex is a good way to determine where you should focus your time.

Here is what I don't do:

1. Read some code

2. Try to understand how it works

3. Repeat

Here is what I do:

1. Try to figure out what the code might be in advance using the information you have. (For example: I know nothing but the fact that it's a spreadsheet. Then figure out in your mind how the basics of a spreadsheet might work.)

2. Now read a little bit of the code. Compare with what you were thinking. If it matches, go to 3. If it doesnt match, figure out why by reading the code and by thinking more.

3. Repeat

Note the two processes are relatively similar because step 2 of the former process is a little bit like step 1 of the later process. Just try to focus on figuring out first, read second. Figure out first, read second. It's an active approach which makes you work more, and the more you work, the faster you go - or some benefits of that sort.

I actually wonder if people do that.

You have to be realistic about what you expect here. Suppose it takes you 1/100th of the time it took them to write the code to understand it. Some of these projects have had hundreds of people working on them for years. It may take you months to get a basic grasp of things.

At my current company it often takes 6 months for experienced people to become productive, and that's with a helping hand.

IDEs are nice, but grep remains the best tool. You tend to need to find things in XML files and config as well.

It depends heavily on code quality.

In general, I think that high quality code attracts high quality pull requests. That's because high quality code is easy to understand and easy to change because the core structure of the code is fundamentally sound and well suited to the problem that it is solving.

Run the examples to see what it does, try to build something with it to get a deeper understanding of the depth of its capability, then finally dive into the code itself by fixing a bug or adding a feature or even just playing around and changing stuff. Join the developer channels and ask questions. People usually love it when you show an interest in what they've built.

Assuming you know how to use the product: write down a path within the software that's intuitively familiar to you. Then follow that same path in the code, starting from main() or equivalent.

When you trace your own usage footsteps like this, it's often amazing how much goes on behind the scenes that you never realized.

Step by step, part by part, and IMHO it's no different than the onboarding process on any new project. I usually try first to understan a general idea of the whole project, where is what (the structure), a bird's eye view of the business logic and the supporting DB structures. And then with time you dive deeper in areas where work needs to be done. If the code is structured properly usually you can start working on a few related parts without a need to know much about the rest of the system. And tests are there to give you the confidence to refactor and change things freely...

Find the entry point of the program then go from there. Use grep for function calls or event listeners. Get some background of the framework used, if there's any. Skim the issue tracker to add more perspective.

I use the program, find an idea of some detail to change or improve, and crawl my way into the code to achieve that.

Give this book a go: https://en.m.wikipedia.org/wiki/Code_Reading

I enjoyed reading it many years ago.

The biggest things that help me get started with any large codebase are this: First, use the software. What does it do? Learn it, learn what the buttons do, read the user docs, try to understand as much as you have time for. You can never hope to reason about the code behind something, if you don't understand what the code is trying to accomplish. From there, pick something to familiarize yourself with just a small portion of the codebase. This can be something from the issues or bugs list, or it could be some new feature that you want to or are told to add, or it could be something as simple as trying to figure out what the "correct" way would be to change the color of a button or background of a form. Be ready to throw away your work and start over multiple times as you learn the caveats of the codebase and read the other developers' code.

- git-extras has some nice features... "git summary" and "git effort". These commands show: most active users, most active files (by active days), etc.

- gource can be used to visualize the activity in a repository.

- sloccount and cloc can be used to count lines of code.

- For C/C++/C#, you can run Doxygen, and ask it to generate documentation for undocumented entities. This can make give you another perspective on the code base.

- In runtime there are various tools you can use to audit what a program does... On Linux you've got strace, lsof, wireshark and many others... On Windows you've got Process Monitor from Sysinternals, as well as wireshark.

Primitive is a VR codebase visualizer tool. We use it to teach architecture for large open source projects: https://youtu.be/x6y14yAJ9rY

First steps for me are building and running the tests. Then I browse the code, look for what might interest me, explore some classes/etc that are intriguing, maybe refactor a bit to break the tests and fix them, etc

I find something in the UI, and then trace it back to the code. Do that a few times across features and you get a really solid start to where things are, which then starts to fill in the mental model blanks.

Ideally, you'd be able to:

* Read any developer contribution docs.

* Glean what info you can from the layout and naming of the source tree.

* Peruse the code and any comments and see what does what.

* Read the unit tests to see how things are expected to work.

* Peruse the issues list to see what's breaking.

* Try to get a feel for how the contributor(s) think by reading any public blog posts, etc.

If none of those approaches yield any insight, don't blame yourself; maybe instead look for a different OSS project to contribute to.

Check the documentation, if there is any. I've actually tried to add a small section on "where to start reading the code" to my larger projects. If it's a web application for example, you'd probably want to start where the routes are defined and go from there to whatever subsystem you're trying to modify or understand.

And if there's no documentation, write some! Another great way to learn, rubber duck it in writing.

My personal copy-paste summary of a similar topic on HN some time ago (https://news.ycombinator.com/item?id=9784008):

# Getting familiar with a new codebase

### Use the right tools

- grep, ack, ag, global search (Visual Assist)

- doxygen, javadocs

- sourcegraph, pfff (facebook), open-grok, SourceInsight

- Proper IDE, REPL

- chronon (dvr for java)

- SWAG (Software Architecture Group)

- Static code analysis

### Use the repository

- Find most relevant (frequently, recently edited) files

- Find dependancy graphs

- Get basic information like which languages are used for what

- Use good source control so that you don't have to worry about breaking things

- Look at commits, in general or for specific issues

- Browse the directory structure, packages, modules, namespaces etc.

- Use "blame" to see when things changed

### Ask questions

- Talk to the customer, find out the purpose of the application

- Pair up with another developer who is more familiar with the code

### Read the documentation

- Look at use cases, diagrams describing architecture, call graphs, user docs

- Understand the problem domain

- Add more documentation as your knowledge grows

- Comments and docs might be wrong!

### Browse the code

- Skim around to get a general idea and a feeling for where things are

- Look at public interfaces, header files first

- Find out which libraries are used

- Take some important public API or function in the UI and follow the code from there. Find implementations of functions, dive into related functions and data structures until you understand how the it's done. Then work your way back out.

- Use tools to quickly find declarations, definitions, references and calls of variables/functions/etc., usage patterns

- Find the entry point of the program

- Figure out the state machine of the program

- Focus on your particular issue

- Use a large, vertical screen with small font size with a pane to show file/class structure

### Take notes

- Use pencil and paper to write down summaries, relationships between classes, record definitions, core functions and methods

- Write a glossary: Function names, Datatypes, prefixes, filenames

- Document everything you understand and don't understand

- Use drawings to create a mental model

### Look at the data

- Find out how the data is stored in the database

### Build the project

- First make sure you can build it and run it

### Use the debugger, profiler and logging

- Set breakpoints, poke around the code, change variables, inspect local variables, stack traces, ...

- Watch the initialization process

- Start from main() and see where it goes

- Find hotspots with the profiler

- Set logging level to max/add logging and use the output to go through the code

### Edit the code

- Adopt the existing coding style

- Try to recreate and fix small bugs, make sure you understand the implications of the fix to the rest of the program first

- Tidy up the code according to the common standard after talking with the team

- Make the code clearer (best with tests)

- Add TODO comments

- Add comments describing what you think the code does

- Hack some feature into the code, then try to not break other stuff, build up a mental model over time, re-write the feature properly

### Use Tests

- Run the tests, make sure they are all passing

- Create new tests

- Browse the tests as an examples reference

First, make sure you can build and run this code. Open Source is usually good about this. Next, pick a path and start tracing through the code. Let's say there's a GUI at the front and DB at the back. Find a simple form and trace the "save" button all the way back to the DB. Finally, just start making some small changes.

Pick something simple the program does and follow it through. Feel free to follow any side branches in the code that you come across as you read through. If there's one bit you're interested in, look at it with git (or whatever it's stored in) - see recent changes, and try and understand them.

The keyword here is "software maintenance". Search for software maintenance tools. There are tools which visualize the code base which should improve your understanding of the code.

Write the namespace outline out by hand on a whiteboard or a sheet of paper.

Use a static analyzer to build a graph of the codebase.

Build an adjacency list and a graph of the imports; and topologically + (…) sort.

If the project is not written in a framework with which I am already familiar, what I usually do is trying to find the application's entry point and start reading from there.

Try read the tests code, then write some. It helps me a lot.

In C++ for the LivreOffice project I recently found an operator function I wanted swapped out to a GetColor() function.

I used =delete on the class function, really helped!

Off the top of my head:

- Count the lines of code with find | wc, get a sense for what's there, and what language it's written in. The biggest file in the project is usually worth a look -- it is often where the "meat" is. Read the function names.

- Use the program. grep for strings that appear in the UI in the source code. That's a good place to start reading. Read function names.

- strace the program. What system calls does it make when? ltrace is also sometimes useful, although it also gives a ton of output.

- Look at header files. Understanding data structures is often easier than understanding code.

- Look at commit logs. Those are hidden "comments". And reading diffs can be easier than reading code.

- Do a "log" or "blame" on the file. How has it evolved?

- Start reading main(). This often reveals something about the structure of the program. Even just finding main() in many programs is a good exercise :) Sometimes it's a little hard to find.

- Make sure to build it. And if you can, look at the build system. How is it put together? Most build systems are pretty darn unreadable. I don't really know how to read autoconf, and GNU make is tough too. Forget about cmake :) But sometimes this can help.

I haven't gotten that far with this, but I tried uftrace recently and like it:


You can think of it like a dtrace that knows about every function in a C or C++ program.


I want to try some kind of code explorer thing. I saw this in a CppCon video and on HN:


And older ones like:


But somehow I get by with Unix tools. I think this is because I feel like building the project in a way to accomodate the source browsers might be a big pain.

Counterpoint: I think the hardest part of understanding a project is usually the build system :-) I don't have too much of a problem with reading C, C++, Python, or (sometimes) JS code. Volume is always a problem, but I can read a specific function pretty easily. But the build system is where things get ugly, in my experience.

Also, reading multi-threaded code requires some special consideration. grepping for every place that threads are started is a good idea.

OpenGrok, it's very simple to setup and makes navigating in large codebase easy.

Review open patches

Go back to the first commit and work from there.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact