
Ask HN: What is the best practice for reading large pieces of C code? - AlbertJWilliams
How do I read a large piece of code that has a lot of header files and source code files? Is there any method or is it just about reading it one by one? How do you do it efficiently?
Thanks in advance.
======
z3phyr
The basics steps to follow are:

1) Make a note of all the files. Well named files tell what they do. If they
do not, then its probably in the comments. Its important that you know what
each file is for.

2) Find the entry point (main, WinMain, _main_)

3) Make a note of all the major macros, structs and functions you see. Note
them down. Do it recursively as you go from main towards the other defs.

4) (optional) Read the changelog and the documentation if available.

5) It will be easier to keep track of what you are reading if you note it
down. Then go over and read, and keep making notes. Keep detailed but easy to
find notes.

6) Please keep making notes.

7) Make notes as you read

8) Did I stress on making notes?

This method helped me a lot :) Happy Hacking

~~~
ryanwaggoner
Strongly agree with everything here, except perhaps the bit about notes.

~~~
baud147258
Not enough notes for your taste?

------
hevi_jos
The most efficient way to read source code is not to read it. :-D

That means you have a very clear outcome of what do you want from the code.
You spend the minimal amount of energy in order to understand the forest
before you look at the trees, and you only spend time on the trees that you
need.

What I do is not reading, I execute parts of the code,and debug it, so I get
the "feeling" of what the code does before I know exactly how it does it.

In something moves like a dog, barks like a dog, eats and moves the tail like
a dog....odds are it is a dog. Humans identify things by global behavior much
better that painfully studying a hair or something.

There is code out there that is not worth reading, as it is so complex. Take
for example Tex, it was created to be so efficient in old machines, that is
almost impossible to read without expending a tremendous amount of time. But
understanding how it behaves is easy.

I use my own visual tools to inspect and debug the code visually in a high
level way, for example instead of just a number I can "see" a point, that is 3
coordinates, or a polygon that is tree points and a normal vector. Without
these tools I would be seeing individual numbers(or nodes in a tree)having to
reconstruct the whole using my mind(not good).

Those visual tools are created in a fast prototyping environment(dear imgui)
with debugging scripts in order to see more globally.

------
mikekchar
The easiest way I've found is to try to find out how a specific piece of
functionality works. Try to modify it in some small way -- that way you can
test your understanding. It's a bit like putting together a big puzzle. You
just take it one piece at a time -- but try to break it up by functionality so
that you are forced to see how everything fits together. Often it's great to
try to write tests or fix bugs. Again, this gives you some focus. But don't
expect to learn it quickly. It takes time for things to sink in.

~~~
krishoog
In addition: if other people are working on the code base, I find it very
insightful to first read what task/issue they are working on and then look at
the commit they made. I then try to see if I understand why the commit
implements the described task or solves the issue.

You can do the same if nobody is actively working on the code but you have the
history in source control.

------
DanielBMarkham
At first you really don't want to understand the code, you want to understand
what it _does_. Nobody can suck in huge codebases and make sense out of them.

So pick some various things the program does and walk through them in the
debugger. Do you understand what's happening? Why? Is there a clear pattern at
work for how things are constructed, used, torn-down?

Do that a few times with various behaviors. A good place to start is whatever
the "init" functionality is, since everything else will use that.

After you're comfortable with the patterns you're seeing, begin to move
"vertically", looking at other methods in the module near the one you're
interested in. Are there patterns to the way the modules are organized? Does
it make sense?

At some point, you'll have a good idea of whether or not you're in a strongly-
patterned architecture or not. Strongly-patterned architectures may be huge
and complex, but a small number of patterns can take you far. If not? You got
a mess. Then you have a new question: how do I deal with this mess in front of
me?

ADD: An important point to understand about really complex and poorly-writen
codebases is that _they might not be understandable_ , at least overall. At
some point when you're writing bad systems, you have a magic act, not a
program.

------
daniel-levin
Compile it with debugging symbols and run it with gdb in tui mode. This will
let you place breakpoints in source files, view back traces, find the
definitions of functions, structures, macros, etc. The learning curve is a tad
steep, but this video is pretty good:

[https://m.youtube.com/watch?v=PorfLSr3DDI](https://m.youtube.com/watch?v=PorfLSr3DDI)

~~~
baud147258
How does GDB differ with using the debuger integrated in an IDE, like Visual
Studio? Does it has more features?

~~~
majewsky
This question implies the assumption that the reader would be using an IDE. If
they do, sure, using the integrated debugger is probably easier. Otherwise,
GDB is probably easier to install than Visual Studio.

~~~
baud147258
Thanks. I'm using Visual Studio, since I'm developping on (and for) Windows
and Visual Studio is part of the build process (we're using its compiler). So
it's not really applicable to my situation, but I was interested in knowing
more.

------
osrec
To understand code quickly, I recommend grabbing the source repo and adding it
to an IDE so you can easily jump from a function call to it's declaration etc
and just hop around the code. Sometimes, however, an IDE can get in the way,
and it's easier to just use grep in a terminal to see where a particular
function call is used/defined.

Eventually, you do build up some intuitions about code style and flow within a
particular code base and will be able to comprehend it faster, but often it
can simply be a painstaking process.

~~~
svet_0
Another option is an indexing engine suck as OpenGrok or source insight. These
are often simpler to setup than an IDE, and yet provide a good 'read-only'
environment for exploring a code base.

------
samblr
Source Insight is a little known editor which does a great job to browse large
C projects. Cannot recommend it enough.

Can also be used on linux and mac but with wine or something like that. It is
a windows software and is not free. But a trial version is enough sometimes
just to understand codeflow.

~~~
qorrect
[https://www.sourcetrail.com/](https://www.sourcetrail.com/) is also a good
source code explorer. There is also the much older
[https://scitools.com/](https://scitools.com/) but it costs a fortune.

------
hemansan
Remember: GDB is your greatest friend in this endeavor.

1\. Compile the code and try to run it for known use cases. 2\. Traverse
through high-level functions using GDB. 3\. Note down the files and functions
while you are at it. 4\. Read all the comments from the files you have
traversed so far. 5\. Check if you have covered most of the files which
constitue to the majority of SOC. 6\. If you don't understand a function, add
a breakpoint to it and see the stacktrace.

Repeat these steps until you have not covered the whole set of files.

------
Nr7
This obviously doesn't apply to all C code, but one thing that I like to do is
to find some sort of endpoint. A piece of code that communicates with the
"outside world" and start from there. See what commands you can send to the
software and how they are the handled. This is often a good way to get into
what the code does and how it does it.

------
saagarjha
A good editor/IDE that allows you to do something like "go to definition" is
often helpful in unraveling complexity.

------
vaylian
If you need to understand the entire project, the best place to start is the
main() function. If the code is reasonably structured this might give you a
good top-level view on what the main components are. And from there you can
basically dive into any direction of calls that you like.

I would really not bother with the headers. Headers are for the most part only
there for type declarations, so that a compiler can compile an isolated code
unit (.c file) with the information on types and function signatures in other
code units. This type information is needed for the compiler because otherwise
linking will fail after compilation.

------
donmcc
Doxygen can build fully hyperlinked docs with complete source listing. It’ll
give you an index of all the functions, structs and global variables, and
makes it easy to drill down.

If the project uses Cmake, you might look into Jet Brains’ Clion IDE, which
has really great tools for code navigation, finding uses of a symbol, etc.

As others have said, find the main function or other primary entry points and
start reading. It can seem hopeless at first if the codebase is large, but
persistence pays off.

I often find making flow charts and graphing data relationships of key parts
of the code to be really helpful.

------
acln
I've used OpenGrok for this in the past. These days, I like to run it in a
container, so I don't have to deal with Java and Tomcat myself.

------
dana321
Start with the tests, examples and apis; at least implement a couple of parts
of the api, whether a plugin, or using the code as a library depending on what
the code does.

Once you understand those, read the headers of the apis you can follow the
code through to the rest of the application.

Usually, a project will use 3rd party deps so you will have to understand at
least the apis of those as well.

If the project uses cmake or whatever build system builder, learn it inside
out. If its just a Makefile then thats simple enough to understand, follow how
everything is compiled and linked together.

------
zygotic12
You're reading an encyclopedia. Isolate the bits you are interested in. But in
the end it's grunt work. If you aren't prepared to see EVERY line of code
executing you will never know what it intends\does. JavaScript you can get
away with black box knowledge. C by definition is a thin wrapper over machine
implementations. If you went to uni make notes. If you've been programming for
40 years you'll derive your own system of boxes, lines and scribbles that fits
on the back of a cigarette packet.

------
jolmg
grep is your friend.

Pick some aspect you want to learn about, select some text that's output or
input in the user interface that should be relatively close in the code, and
grep it. Follow from there. Check what it does, appreciating the variable and
function names that the writer picked so you wouldn't have to follow every
function definition to understand what it does in a broad sense. When the
writer didn't care to pick quickly understandable names it's probably that
those things work arcane details that are of no interest to understand the
greater picture of what's happening. In other words, skim for the informative
names and deduct what's happening from them. Check what calls that code, where
that code leads to, etc. When you get bored of these features, rinse and
repeat from the top, selecting another feature by its UI output.

Another technique is to just grep random things related to the program. For
example, if it's a scroll shooter game, grep "bullet", "ship" and things like
that. Checking the results gives you a good idea of where the most
identifiable, high-level code is and how it's organized.

When I started out, I thought the only sensible way was to find main() and
read line by line, depth-first, following every function call, understanding
every detail. It's not effective. You don't need every detail. You don't need
to parse it like a computer.

It's like sight. A computer will work, no problem, pixel by pixel, left to
right, top to bottom. For us, it's more useful to just look and identify the
most distinctive characteristics that we can work with. The ones that relate
the most to how we can interact with the object.

------
ecesena
I usually run doxygen enabling graphs for includes (INCLUDE_GRAPH,
INCLUDED_BY_GRAPH) and function calls (CALL_GRAPH, CALLER_GRAPH), and generate
html docs. This way you get a "nice" clickable visualization, that helps you
move around the code. Of course, you need also an editor, I'm not saying to
just use doxygen. But for me at least it's very helpful.

------
drelihan
No really easy way to do it. Just step through in a debugger a bunch of times
and pay attention to the call stack and you'll start to get a feeling of the
flow and structure of the code.

That and try to tackle a number of smaller projects / fixes / improvements.
That helped me the most and I was marginally productive while I was doing it.

------
egraether
You should give Sourcetrail a try! It's a tool for browsing and understanding
unfamiliar source code based on LLVM/Clang LibTooling.

[https://www.sourcetrail.com/](https://www.sourcetrail.com/)

------
notacoward
As it happens, I wrote a blog post about this not too long ago. Basic summary:

(1) Write things down as you go.

(2) Learn your tools (e.g. IDE or cscope).

(3) Learn the core data structures and RPC definitions first.

(4) Trace through important code flows.

(5) Experiment with small changes to see how they affect things.

The full post is here: [http://obdurodon.silvrback.com/navigating-a-large-
codebase](http://obdurodon.silvrback.com/navigating-a-large-codebase)

BTW, a lot of folks here need to stop talking about "the entry point" or
"main". For many kinds of code, from kernel code to network and storage
servers, main will not exist or will be intensely uninteresting. Finding the
_real_ entry points is part of the investigative process, not something to be
assumed.

------
edoo
Scan the functions to get an idea of what is possible and a rough estimate of
how everything fits together. Follow the code flow through an example to see
how it is actually used.

------
frederikvs
Pray that there's some decent software architecture documentation. If there
is, this should be your starting point, to help you understand which files do
what.

------
sys_64738
Emacs and TAGS.

------
meh2frdf
Same way you eat an elephant!

------
rramadass
This actually is something which requires both careful thought and a
systematic approach. It can be frustrating in the beginning but with enough
persistence, a mass of code will start making sense slowly but surely. The
process is not always linear and is often sped up by prior experience and
intuition. IMO, this is a most important skill to cultivate for a
"professional" programmer since most of the time one spends far more time
reading other peoples code, understanding it, fixing bugs and adding features.
Rarely does one get an opportunity to build everything from ground up.

The following is an approach i try to follow for C/C++ code;

0) DO NOT try to understand the details of how exactly something works in the
beginning. Work top-down, iterating and gradually getting into the details as
needed. The key is to get a firm idea of the system as a whole before diving
into the nitty-gritties.

1) We need to focus on three main aspects; \- Physical Structure: How is the
code distributed across files and directories? \- Static Structure: What are
the top-level logical Subsystems and Modules in the codebase? What are the
dependencies amongst them? What are the major data structures and static call-
trees? \- Dynamic Structure: What are the major use-case scenarios? For a
given use-case scenario, what is the actual call flow at run-time? How does a
relevant data structure change?

2) First, try and find some oldtimer in the group/company/wherever who has
worked with the code for a while. Setup a few whiteboard sessions and pick
his/her brain to get a good overview of the system. Also go through any and
available documentation. Take notes as needed.

3) Sit with QA/Testing and try out the system as a "end-user". This will
identify the major features/use-case scenarios.

4) The above would have given us a good overview and now we can drill down
into the codebase. You can use your favourite IDE/tools (Visual Studio,
Eclipse etc.) but you still need to keep the above-mentioned three aspects of
the system in mind. I tend to use the following tools; a) Doxygen, CScope,
CFlow, etags, GLOBAL, grep to cross-reference data structures, symbols and
follow static call flows. b) Call graph using gprof for Dynamic call flows
(ignore performance data initially). c) Most large codebases have some sort of
Trace/Debug scaffolding which you can turn on in the build system. This often
provides us with a lot of insight into the runtime behaviour of the system
since programmers output/verify/check important state data using these
statements.

5) For each major use-case/feature scenario peruse the static call graph
generated using the above tools. Note carefully the major data structures
whose state is changed. Pay particular attention to "asserts" in the code
since they verify pre/post-conditions and invariants thus clarifying "what"
the code is supposed to do. The "how" is the code itself.

6) Now we execute the use-case with specific inputs and take a look at the
Dynamic call graph generated using gprof or Trace/Debug statements and match
it to the static call graph. Depending upon the system, it is often easier to
generate the Dynamic call graph and then lookup its corresponding Static call
graph.

Finally, we should now have the following; \- A list of all major subsystems
and modules and where physically they are located in the source tree. \- A
list of the major data structures in each module. In case of common/shared
data structures what are its dependent modules. \- Static call graphs for
major use-case scenarios. You can annotate this with the main data structures
that a function touches. \- Dynamic call graphs. Note the input used. You can
merge this with the above static call graph.

With the above in hand, we should be well on our way towards "grokking" the
system.

------
spkhaira
vim + cscope + ctags

