
Ask HN: How to understand the large codebase of an open-source project? - maqbool
Hello All!<p>what are techniques you all used to learn and understand a large codebase? what are the tools you use?
======
macromaniac
You can use the debugger on low level api calls to get pretty much anywhere in
the codebase. If you want to find whats changing a label to "foo" you can hook
into every set_Text call and put a conditional breakpoint on all label changes
to break on "foo", then just go up the callstack to find the logic. This
strategy works on network interfaces and file interfaces as well. I abused
this on our 2M+ SLOC legacy codebase and it has saved me many hours.

Also use version control to identify the most commonly edited files in the
project. These are usually the files that are doing all the work (80/20 rule)
and you likely need to know of them.

git log --pretty=format: --name-only | sort | uniq -c | sort -rg | head -10

~~~
to3m
This isn’t abuse... it’s exactly how you’re _supposed_ to use a debugger.

~~~
jstanley
I suppose the difference is you'd normally use a debugger to find out why the
code isn't doing what it's supposed to, rather than using it to find out what
the code is supposed to be doing in the first place.

I don't consider it abuse either, though.

~~~
_dps
Agreed, and riffing on that a bit — I find the name "debugger" is actually
troublesome when teaching newcomers (I work with kids of various ages).

I see the debugger much more like a "REPL for a compiled language" than a "bug
removal tool". I try to teach people to think of it as an interactive
inspection tool, not as (merely) a thing to fix broken programs.

~~~
ZenoArrow
"REPL for a compiled language", or "Binary REPL", I like that.

Besides that, it's times like these when I realise how useful IDEs are.
Instead of needing to use grep (or something similar), I can simply right
click on a variable and choose 'Find all references' (this is in VS, but I'm
sure many of the leading IDEs will have this feature). When I use the command
line it's to save myself time.

~~~
caseymarquis
You can save even more time by pressing Shift + F12. CTRL + - goes really well
with it (step back to previous code location). I'm a bit biased toward hotkeys
as I'm using the fantastic vim extension, VSVim, so I barely ever have to take
my hands off the keyboard. VS really is a great tool. Adding the Docker
integration for dotnet core has really opened up deployment options for what
was once a Windows only product for deploying to Windows only. It's still
essentially a Windows only product (I've heard the Mac version isn't
comparable), but you can deploy and debug in containers. Dotnet core is at v2
now, and seems stable enough to actually use in production (finally!). On a
tangent here, but the point is it's a good time to be working with dotnet.

------
blattimwind
The straightforward way to understand code is to work with it. That is to say,
look at open tickets and try to implement the suggested functionality or fixes
(or some of your own ideas).

The analytic approach is a bit more awkward, since you have no specific goals
and need to make them up yourself. So you could pose questions like how a
specific behaviour of the application comes about ("why does it do _that_ when
_this_ happens?", "how does it do _X_?") and then try to answer those
comprehensively, systematically (a format that works well for me is short
snippets of code interleaved with explanations and arguments).

A bottom-up approach is generally easier, because your questions will give you
information at the bottom (like specific application messages), which are
generally easy to find (ag, grep). A good IDE can be helpful for navigating
the code and finding call sites, especially in projects written in dynamic
languages where such analyses can become kinda annoying. (However, in more
awkward code bases analysers like PyCharm are quickly overwhelmed and are
unable to resolve indirections)

Top-down is in my experience less useful, because there are far too many
choices on each level for most applications, and the first few layers are
generally the least interesting and most arcane/fragile and difficult to
follow along (things like initialization sequences).

The most difficult projects are typically those relying on multiple languages,
code generation and runtime mutation (reflection, on-the-fly UI generation,
overly dynamic Python code are typical examples). Another frequent obstacle is
excessive abstraction and indirection (implementing something that could be
done in a few lines of easy to understand and reason about C using multiple
C++ templates spread out over a bunch of files and a healthy dozen of advanced
language features is an almost archetypal example).

------
Svenstaro
Use `tree | less` just to get an idea of where everything is and how it is
structured and `tokei` to check out what the code actually is written in.

Then I try to go down the main code path of some examples or the primary
binary if available and just check out out how things are called/done around
there.

Then run an example through callgrind and visualize the call graph in
kcachegrind to get an idea of how often things are called and where and where
the heavy lifting happens. That last step is optional and really depends on
the type of project.

Then I use my code editor and lots of searching and call site lookups to get a
better idea of how things are used.

------
georgecalm
This strategy takes time, but it works really well:

1\. If you can, get an overview from a mentor. This will make the next steps a
lot easier. Get: \- history \- philosophy \- design style \- high-level flow

2\. Get a stack of white paper from the printer and put it in front of you
along with a pen. Colored pens and scotch tape are a bonus (There may or may
not have been a shortage of both printer paper and colored pens next to them
when I started at my last job)

3\. Open a debugger with a breakpoint on the first line of code

4\. Pick a request flow and initiate a request. Let the debugger guide you
through the entire request flow

5\. Record the path of the flow as a sequence diagram on your paper

( BONUS ) Record the relationships between the components in the system in a
class diagram

Why does this work?

There's software out there for making these diagram, so why draw them by hand?
For most people, visual memory is the strongest. So, the idea is you use your
strong visual and spacial memory to assist you in recalling random objects,
facts. And hey, why not a codebase? And that’s why this works.

When you look at different files that the debugger guides you through, you are
engaging your visual memory. You remember how the code is organized and what
the files look like.

When you draw the sequence diagram you engage your spatial memory. E.g., the
Router class doesn't interact with the Database class and so they are one
sheet of paper apart. Visually, you can see what clusters of components work
together to make larger structures. This allows you to mentally group the
classes into a single concept.

The point is to get this information into your head, and not to produce a
diagram on a piece of paper. If all you need is the latter, use software, of
course. Seriously, when you're done, throw away the piece of paper; it will be
outdated the next day anyway.

------
whb07
I'd like to point out that if it is truly large, then deep intimate knowledge
of specific parts will never be truly had. Those who do know the entire
project understand the flow and architecture of it but the details are
blurred.

~~~
khedoros1
That's what I was thinking. My first job was dev on a C++ and Java project
that was in the high hundreds of thousands of lines when I started, and grew
from there. Client-side had about 5 core binaries, server side had about a
dozen, and each of those was a sizeable file.

Starting to understand it involved reading our documentation on the data-flow
between different components during operation, to know the purposes of the
important binaries. For the really core components, we had a fair bit of
documentation at the level of classes.

You'd usually end up learning the sections of particular programs that you
worked on in great detail, the programs themselves as a whole in slightly less
detail, getting fuzzier as you moved away from your areas of greatest
experience.

------
gjvc
This is one of the times when an IDE (hat-tip to Jetbrains, but many other are
available) really comes into its own. Fast navigation facilitates
understanding. If you really are a vim die-hard, "exuberant ctags" is an
excellent tool in this area.

------
boyter
This is not open source specific but what I do to any code base I am expected
to understand and be prouductive with.

I usually start by running cloc and sloccount to get an idea of the metrics of
it, languages line of code estimates etc...

I progress to looking at the tests if there are any. They usually give an idea
of how the authors expect things to work.

Once I have browsed some of the tests, in particular integration tests I start
following how they work through the code. Your IDE of choice will help out
here or failing that use ripgrep, ack, the silver searcher, searchcode server
(note I run this so I am biased), sourcegraph.

One thing that I have found especially valuable is running something to
determine the cyclomatic complexity of the code. Knowing which parts are
complex is a good way to determine where you should focus your time.

------
quadcore
Here is what I don't do:

1\. Read some code

2\. Try to understand how it works

3\. Repeat

Here is what I do:

1\. Try to figure out what the code might be in advance using the information
you have. (For example: I know nothing but the fact that it's a spreadsheet.
Then figure out in your mind how the basics of a spreadsheet might work.)

2\. Now read a little bit of the code. Compare with what you were thinking. If
it matches, go to 3. If it doesnt match, figure out why by reading the code
and by thinking more.

3\. Repeat

Note the two processes are relatively similar because step 2 of the former
process is a little bit like step 1 of the later process. Just try to focus on
figuring out first, read second. Figure out first, read second. It's an active
approach which makes you work more, and the more you work, the faster you go -
or some benefits of that sort.

I actually wonder if people do that.

------
nitwit005
You have to be realistic about what you expect here. Suppose it takes you
1/100th of the time it took them to write the code to understand it. Some of
these projects have had hundreds of people working on them for years. It may
take you months to get a basic grasp of things.

At my current company it often takes 6 months for experienced people to become
productive, and that's with a helping hand.

IDEs are nice, but grep remains the best tool. You tend to need to find things
in XML files and config as well.

~~~
jondubois
It depends heavily on code quality.

In general, I think that high quality code attracts high quality pull
requests. That's because high quality code is easy to understand and easy to
change because the core structure of the code is fundamentally sound and well
suited to the problem that it is solving.

------
git_rancher
Run the examples to see what it does, try to build something with it to get a
deeper understanding of the depth of its capability, then finally dive into
the code itself by fixing a bug or adding a feature or even just playing
around and changing stuff. Join the developer channels and ask questions.
People usually love it when you show an interest in what they've built.

------
pavlov
Assuming you know how to use the product: write down a path within the
software that's intuitively familiar to you. Then follow that same path in the
code, starting from main() or equivalent.

When you trace your own usage footsteps like this, it's often amazing how much
goes on behind the scenes that you never realized.

------
ivanhoe
Step by step, part by part, and IMHO it's no different than the onboarding
process on any new project. I usually try first to understan a general idea of
the whole project, where is what (the structure), a bird's eye view of the
business logic and the supporting DB structures. And then with time you dive
deeper in areas where work needs to be done. If the code is structured
properly usually you can start working on a few related parts without a need
to know much about the rest of the system. And tests are there to give you the
confidence to refactor and change things freely...

------
phektus
Find the entry point of the program then go from there. Use grep for function
calls or event listeners. Get some background of the framework used, if
there's any. Skim the issue tracker to add more perspective.

------
NewEntryHN
I use the program, find an idea of some detail to change or improve, and crawl
my way into the code to achieve that.

------
nazri1
Give this book a go:
[https://en.m.wikipedia.org/wiki/Code_Reading](https://en.m.wikipedia.org/wiki/Code_Reading)

I enjoyed reading it many years ago.

------
camhenlin
The biggest things that help me get started with any large codebase are this:
First, use the software. What does it do? Learn it, learn what the buttons do,
read the user docs, try to understand as much as you have time for. You can
never hope to reason about the code behind something, if you don't understand
what the code is trying to accomplish. From there, pick something to
familiarize yourself with just a small portion of the codebase. This can be
something from the issues or bugs list, or it could be some new feature that
you want to or are told to add, or it could be something as simple as trying
to figure out what the "correct" way would be to change the color of a button
or background of a form. Be ready to throw away your work and start over
multiple times as you learn the caveats of the codebase and read the other
developers' code.

------
partycoder
\- git-extras has some nice features... "git summary" and "git effort". These
commands show: most active users, most active files (by active days), etc.

\- gource can be used to visualize the activity in a repository.

\- sloccount and cloc can be used to count lines of code.

\- For C/C++/C#, you can run Doxygen, and ask it to generate documentation for
undocumented entities. This can make give you another perspective on the code
base.

\- In runtime there are various tools you can use to audit what a program
does... On Linux you've got strace, lsof, wireshark and many others... On
Windows you've got Process Monitor from Sysinternals, as well as wireshark.

------
atonalfreerider
Primitive is a VR codebase visualizer tool. We use it to teach architecture
for large open source projects:
[https://youtu.be/x6y14yAJ9rY](https://youtu.be/x6y14yAJ9rY)

------
RBerenguel
First steps for me are building and running the tests. Then I browse the code,
look for what might interest me, explore some classes/etc that are intriguing,
maybe refactor a bit to break the tests and fix them, etc

------
lbotos
I find something in the UI, and then trace it back to the code. Do that a few
times across features and you get a really solid start to _where_ things are,
which then starts to fill in the mental model blanks.

------
rodsenra
[https://pragprog.com/book/atcrime/your-code-as-a-crime-
scene](https://pragprog.com/book/atcrime/your-code-as-a-crime-scene)

------
_greim_
Ideally, you'd be able to:

* Read any developer contribution docs.

* Glean what info you can from the layout and naming of the source tree.

* Peruse the code and any comments and see what does what.

* Read the unit tests to see how things are expected to work.

* Peruse the issues list to see what's breaking.

* Try to get a feel for how the contributor(s) think by reading any public blog posts, etc.

If none of those approaches yield any insight, don't blame yourself; maybe
instead look for a different OSS project to contribute to.

------
ioddly
Check the documentation, if there is any. I've actually tried to add a small
section on "where to start reading the code" to my larger projects. If it's a
web application for example, you'd probably want to start where the routes are
defined and go from there to whatever subsystem you're trying to modify or
understand.

~~~
kthejoker2
And if there's no documentation, write some! Another great way to learn,
rubber duck it in writing.

------
JustSomeNobody
First, make sure you can build and run this code. Open Source is usually good
about this. Next, pick a path and start tracing through the code. Let's say
there's a GUI at the front and DB at the back. Find a simple form and trace
the "save" button all the way back to the DB. Finally, just start making some
small changes.

------
trebligdivad
Pick something simple the program does and follow it through. Feel free to
follow any side branches in the code that you come across as you read through.
If there's one bit you're interested in, look at it with git (or whatever it's
stored in) - see recent changes, and try and understand them.

------
wibr
My personal copy-paste summary of a similar topic on HN some time ago
([https://news.ycombinator.com/item?id=9784008](https://news.ycombinator.com/item?id=9784008)):

# Getting familiar with a new codebase

### Use the right tools

\- grep, ack, ag, global search (Visual Assist)

\- doxygen, javadocs

\- sourcegraph, pfff (facebook), open-grok, SourceInsight

\- Proper IDE, REPL

\- chronon (dvr for java)

\- SWAG (Software Architecture Group)

\- Static code analysis

### Use the repository

\- Find most relevant (frequently, recently edited) files

\- Find dependancy graphs

\- Get basic information like which languages are used for what

\- Use good source control so that you don't have to worry about breaking
things

\- Look at commits, in general or for specific issues

\- Browse the directory structure, packages, modules, namespaces etc.

\- Use "blame" to see when things changed

### Ask questions

\- Talk to the customer, find out the purpose of the application

\- Pair up with another developer who is more familiar with the code

### Read the documentation

\- Look at use cases, diagrams describing architecture, call graphs, user docs

\- Understand the problem domain

\- Add more documentation as your knowledge grows

\- Comments and docs might be wrong!

### Browse the code

\- Skim around to get a general idea and a feeling for where things are

\- Look at public interfaces, header files first

\- Find out which libraries are used

\- Take some important public API or function in the UI and follow the code
from there. Find implementations of functions, dive into related functions and
data structures until you understand how the it's done. Then work your way
back out.

\- Use tools to quickly find declarations, definitions, references and calls
of variables/functions/etc., usage patterns

\- Find the entry point of the program

\- Figure out the state machine of the program

\- Focus on your particular issue

\- Use a large, vertical screen with small font size with a pane to show
file/class structure

### Take notes

\- Use pencil and paper to write down summaries, relationships between
classes, record definitions, core functions and methods

\- Write a glossary: Function names, Datatypes, prefixes, filenames

\- Document everything you understand and don't understand

\- Use drawings to create a mental model

### Look at the data

\- Find out how the data is stored in the database

### Build the project

\- First make sure you can build it and run it

### Use the debugger, profiler and logging

\- Set breakpoints, poke around the code, change variables, inspect local
variables, stack traces, ...

\- Watch the initialization process

\- Start from main() and see where it goes

\- Find hotspots with the profiler

\- Set logging level to max/add logging and use the output to go through the
code

### Edit the code

\- Adopt the existing coding style

\- Try to recreate and fix small bugs, make sure you understand the
implications of the fix to the rest of the program first

\- Tidy up the code according to the common standard after talking with the
team

\- Make the code clearer (best with tests)

\- Add TODO comments

\- Add comments describing what you think the code does

\- Hack some feature into the code, then try to not break other stuff, build
up a mental model over time, re-write the feature properly

### Use Tests

\- Run the tests, make sure they are all passing

\- Create new tests

\- Browse the tests as an examples reference

------
irundebian
The keyword here is "software maintenance". Search for software maintenance
tools. There are tools which visualize the code base which should improve your
understanding of the code.

------
westurner
Write the namespace outline out by hand on a whiteboard or a sheet of paper.

Use a static analyzer to build a graph of the codebase.

Build an adjacency list and a graph of the imports; and topologically + (…)
sort.

------
lufte
If the project is not written in a framework with which I am already familiar,
what I usually do is trying to find the application's entry point and start
reading from there.

------
viach
Try read the tests code, then write some. It helps me a lot.

------
chris_wot
In C++ for the LivreOffice project I recently found an operator function I
wanted swapped out to a GetColor() function.

I used =delete on the class function, really helped!

------
chubot
Off the top of my head:

\- Count the lines of code with find | wc, get a sense for what's there, and
what language it's written in. The biggest file in the project is usually
worth a look -- it is often where the "meat" is. Read the function names.

\- Use the program. grep for strings that appear in the UI in the source code.
That's a good place to start reading. Read function names.

\- strace the program. What system calls does it make when? ltrace is also
sometimes useful, although it also gives a ton of output.

\- Look at header files. Understanding data structures is often easier than
understanding code.

\- Look at commit logs. Those are hidden "comments". And reading diffs can be
easier than reading code.

\- Do a "log" or "blame" on the file. How has it evolved?

\- Start reading main(). This often reveals something about the structure of
the program. Even just finding main() in many programs is a good exercise :)
Sometimes it's a little hard to find.

\- Make sure to build it. And if you can, look at the build system. How is it
put together? Most build systems are pretty darn unreadable. I don't really
know how to read autoconf, and GNU make is tough too. Forget about cmake :)
But sometimes this can help.

I haven't gotten that far with this, but I tried uftrace recently and like it:

[https://github.com/namhyung/uftrace](https://github.com/namhyung/uftrace)

You can think of it like a dtrace that knows about every function in a C or
C++ program.

\-----

I want to try some kind of code explorer thing. I saw this in a CppCon video
and on HN:

[https://www.sourcetrail.com/](https://www.sourcetrail.com/)

And older ones like:

[https://www.sourceinsight.com/](https://www.sourceinsight.com/)

But somehow I get by with Unix tools. I think this is because I feel like
building the project in a way to accomodate the source browsers might be a big
pain.

Counterpoint: I think the hardest part of understanding a project is usually
the build system :-) I don't have too much of a problem with reading C, C++,
Python, or (sometimes) JS code. Volume is always a problem, but I can read a
specific function pretty easily. But the build system is where things get
ugly, in my experience.

Also, reading multi-threaded code requires some special consideration.
grepping for every place that threads are started is a good idea.

------
jeremiem
OpenGrok, it's very simple to setup and makes navigating in large codebase
easy.

------
jjirsa
Review open patches

------
kristianov
Go back to the first commit and work from there.

