Hacker News new | comments | show | ask | jobs | submit login
Ask HN: What is the best practice for reading large pieces of C code?
74 points by AlbertJWilliams 10 days ago | hide | past | web | favorite | 37 comments
How do I read a large piece of code that has a lot of header files and source code files? Is there any method or is it just about reading it one by one? How do you do it efficiently? Thanks in advance.





The basics steps to follow are:

1) Make a note of all the files. Well named files tell what they do. If they do not, then its probably in the comments. Its important that you know what each file is for.

2) Find the entry point (main, WinMain, _main_)

3) Make a note of all the major macros, structs and functions you see. Note them down. Do it recursively as you go from main towards the other defs.

4) (optional) Read the changelog and the documentation if available.

5) It will be easier to keep track of what you are reading if you note it down. Then go over and read, and keep making notes. Keep detailed but easy to find notes.

6) Please keep making notes.

7) Make notes as you read

8) Did I stress on making notes?

This method helped me a lot :) Happy Hacking


Strongly agree with everything here, except perhaps the bit about notes.

Not enough notes for your taste?

> Make notes as you read

Make executable notes as you read.

FTFY

ie. Add in asserts or logs that check what you believe and log (or abort) if what you believe is false.

Otherwise it's idle speculation that never gets falsified.


In big project reading recursivly from the entry point is essentialy to say "read it all".

With code generation or libs there might even be no entry point in the repo source code.

I would say go from the edges of the call tree towards the root but not all the way to it. Otherwise I get lost. Just guess which edges are interesting and start from them.


The most efficient way to read source code is not to read it. :-D

That means you have a very clear outcome of what do you want from the code. You spend the minimal amount of energy in order to understand the forest before you look at the trees, and you only spend time on the trees that you need.

What I do is not reading, I execute parts of the code,and debug it, so I get the "feeling" of what the code does before I know exactly how it does it.

In something moves like a dog, barks like a dog, eats and moves the tail like a dog....odds are it is a dog. Humans identify things by global behavior much better that painfully studying a hair or something.

There is code out there that is not worth reading, as it is so complex. Take for example Tex, it was created to be so efficient in old machines, that is almost impossible to read without expending a tremendous amount of time. But understanding how it behaves is easy.

I use my own visual tools to inspect and debug the code visually in a high level way, for example instead of just a number I can "see" a point, that is 3 coordinates, or a polygon that is tree points and a normal vector. Without these tools I would be seeing individual numbers(or nodes in a tree)having to reconstruct the whole using my mind(not good).

Those visual tools are created in a fast prototyping environment(dear imgui) with debugging scripts in order to see more globally.


The easiest way I've found is to try to find out how a specific piece of functionality works. Try to modify it in some small way -- that way you can test your understanding. It's a bit like putting together a big puzzle. You just take it one piece at a time -- but try to break it up by functionality so that you are forced to see how everything fits together. Often it's great to try to write tests or fix bugs. Again, this gives you some focus. But don't expect to learn it quickly. It takes time for things to sink in.

In addition: if other people are working on the code base, I find it very insightful to first read what task/issue they are working on and then look at the commit they made. I then try to see if I understand why the commit implements the described task or solves the issue.

You can do the same if nobody is actively working on the code but you have the history in source control.


Good points. It's also the way to incrementally adopt unit tests to existing codebase.

At first you really don't want to understand the code, you want to understand what it does. Nobody can suck in huge codebases and make sense out of them.

So pick some various things the program does and walk through them in the debugger. Do you understand what's happening? Why? Is there a clear pattern at work for how things are constructed, used, torn-down?

Do that a few times with various behaviors. A good place to start is whatever the "init" functionality is, since everything else will use that.

After you're comfortable with the patterns you're seeing, begin to move "vertically", looking at other methods in the module near the one you're interested in. Are there patterns to the way the modules are organized? Does it make sense?

At some point, you'll have a good idea of whether or not you're in a strongly-patterned architecture or not. Strongly-patterned architectures may be huge and complex, but a small number of patterns can take you far. If not? You got a mess. Then you have a new question: how do I deal with this mess in front of me?

ADD: An important point to understand about really complex and poorly-writen codebases is that they might not be understandable, at least overall. At some point when you're writing bad systems, you have a magic act, not a program.


Compile it with debugging symbols and run it with gdb in tui mode. This will let you place breakpoints in source files, view back traces, find the definitions of functions, structures, macros, etc. The learning curve is a tad steep, but this video is pretty good:

https://m.youtube.com/watch?v=PorfLSr3DDI


How does GDB differ with using the debuger integrated in an IDE, like Visual Studio? Does it has more features?

This question implies the assumption that the reader would be using an IDE. If they do, sure, using the integrated debugger is probably easier. Otherwise, GDB is probably easier to install than Visual Studio.

Thanks. I'm using Visual Studio, since I'm developping on (and for) Windows and Visual Studio is part of the build process (we're using its compiler). So it's not really applicable to my situation, but I was interested in knowing more.

To understand code quickly, I recommend grabbing the source repo and adding it to an IDE so you can easily jump from a function call to it's declaration etc and just hop around the code. Sometimes, however, an IDE can get in the way, and it's easier to just use grep in a terminal to see where a particular function call is used/defined.

Eventually, you do build up some intuitions about code style and flow within a particular code base and will be able to comprehend it faster, but often it can simply be a painstaking process.


Another option is an indexing engine suck as OpenGrok or source insight. These are often simpler to setup than an IDE, and yet provide a good 'read-only' environment for exploring a code base.

Remember: GDB is your greatest friend in this endeavor.

1. Compile the code and try to run it for known use cases. 2. Traverse through high-level functions using GDB. 3. Note down the files and functions while you are at it. 4. Read all the comments from the files you have traversed so far. 5. Check if you have covered most of the files which constitue to the majority of SOC. 6. If you don't understand a function, add a breakpoint to it and see the stacktrace.

Repeat these steps until you have not covered the whole set of files.


Source Insight is a little known editor which does a great job to browse large C projects. Cannot recommend it enough.

Can also be used on linux and mac but with wine or something like that. It is a windows software and is not free. But a trial version is enough sometimes just to understand codeflow.


https://www.sourcetrail.com/ is also a good source code explorer. There is also the much older https://scitools.com/ but it costs a fortune.

This obviously doesn't apply to all C code, but one thing that I like to do is to find some sort of endpoint. A piece of code that communicates with the "outside world" and start from there. See what commands you can send to the software and how they are the handled. This is often a good way to get into what the code does and how it does it.

A good editor/IDE that allows you to do something like "go to definition" is often helpful in unraveling complexity.

If you need to understand the entire project, the best place to start is the main() function. If the code is reasonably structured this might give you a good top-level view on what the main components are. And from there you can basically dive into any direction of calls that you like.

I would really not bother with the headers. Headers are for the most part only there for type declarations, so that a compiler can compile an isolated code unit (.c file) with the information on types and function signatures in other code units. This type information is needed for the compiler because otherwise linking will fail after compilation.


I've used OpenGrok for this in the past. These days, I like to run it in a container, so I don't have to deal with Java and Tomcat myself.

Doxygen can build fully hyperlinked docs with complete source listing. It’ll give you an index of all the functions, structs and global variables, and makes it easy to drill down.

If the project uses Cmake, you might look into Jet Brains’ Clion IDE, which has really great tools for code navigation, finding uses of a symbol, etc.

As others have said, find the main function or other primary entry points and start reading. It can seem hopeless at first if the codebase is large, but persistence pays off.

I often find making flow charts and graphing data relationships of key parts of the code to be really helpful.


Start with the tests, examples and apis; at least implement a couple of parts of the api, whether a plugin, or using the code as a library depending on what the code does.

Once you understand those, read the headers of the apis you can follow the code through to the rest of the application.

Usually, a project will use 3rd party deps so you will have to understand at least the apis of those as well.

If the project uses cmake or whatever build system builder, learn it inside out. If its just a Makefile then thats simple enough to understand, follow how everything is compiled and linked together.


You're reading an encyclopedia. Isolate the bits you are interested in. But in the end it's grunt work. If you aren't prepared to see EVERY line of code executing you will never know what it intends\does. JavaScript you can get away with black box knowledge. C by definition is a thin wrapper over machine implementations. If you went to uni make notes. If you've been programming for 40 years you'll derive your own system of boxes, lines and scribbles that fits on the back of a cigarette packet.

grep is your friend.

Pick some aspect you want to learn about, select some text that's output or input in the user interface that should be relatively close in the code, and grep it. Follow from there. Check what it does, appreciating the variable and function names that the writer picked so you wouldn't have to follow every function definition to understand what it does in a broad sense. When the writer didn't care to pick quickly understandable names it's probably that those things work arcane details that are of no interest to understand the greater picture of what's happening. In other words, skim for the informative names and deduct what's happening from them. Check what calls that code, where that code leads to, etc. When you get bored of these features, rinse and repeat from the top, selecting another feature by its UI output.

Another technique is to just grep random things related to the program. For example, if it's a scroll shooter game, grep "bullet", "ship" and things like that. Checking the results gives you a good idea of where the most identifiable, high-level code is and how it's organized.

When I started out, I thought the only sensible way was to find main() and read line by line, depth-first, following every function call, understanding every detail. It's not effective. You don't need every detail. You don't need to parse it like a computer.

It's like sight. A computer will work, no problem, pixel by pixel, left to right, top to bottom. For us, it's more useful to just look and identify the most distinctive characteristics that we can work with. The ones that relate the most to how we can interact with the object.


I usually run doxygen enabling graphs for includes (INCLUDE_GRAPH, INCLUDED_BY_GRAPH) and function calls (CALL_GRAPH, CALLER_GRAPH), and generate html docs. This way you get a "nice" clickable visualization, that helps you move around the code. Of course, you need also an editor, I'm not saying to just use doxygen. But for me at least it's very helpful.

No really easy way to do it. Just step through in a debugger a bunch of times and pay attention to the call stack and you'll start to get a feeling of the flow and structure of the code.

That and try to tackle a number of smaller projects / fixes / improvements. That helped me the most and I was marginally productive while I was doing it.


You should give Sourcetrail a try! It's a tool for browsing and understanding unfamiliar source code based on LLVM/Clang LibTooling.

https://www.sourcetrail.com/


As it happens, I wrote a blog post about this not too long ago. Basic summary:

(1) Write things down as you go.

(2) Learn your tools (e.g. IDE or cscope).

(3) Learn the core data structures and RPC definitions first.

(4) Trace through important code flows.

(5) Experiment with small changes to see how they affect things.

The full post is here: http://obdurodon.silvrback.com/navigating-a-large-codebase

BTW, a lot of folks here need to stop talking about "the entry point" or "main". For many kinds of code, from kernel code to network and storage servers, main will not exist or will be intensely uninteresting. Finding the real entry points is part of the investigative process, not something to be assumed.


Scan the functions to get an idea of what is possible and a rough estimate of how everything fits together. Follow the code flow through an example to see how it is actually used.

Pray that there's some decent software architecture documentation. If there is, this should be your starting point, to help you understand which files do what.

Emacs and TAGS.

Same way you eat an elephant!

vim + cscope + ctags

This actually is something which requires both careful thought and a systematic approach. It can be frustrating in the beginning but with enough persistence, a mass of code will start making sense slowly but surely. The process is not always linear and is often sped up by prior experience and intuition. IMO, this is a most important skill to cultivate for a "professional" programmer since most of the time one spends far more time reading other peoples code, understanding it, fixing bugs and adding features. Rarely does one get an opportunity to build everything from ground up.

The following is an approach i try to follow for C/C++ code;

0) DO NOT try to understand the details of how exactly something works in the beginning. Work top-down, iterating and gradually getting into the details as needed. The key is to get a firm idea of the system as a whole before diving into the nitty-gritties.

1) We need to focus on three main aspects; - Physical Structure: How is the code distributed across files and directories? - Static Structure: What are the top-level logical Subsystems and Modules in the codebase? What are the dependencies amongst them? What are the major data structures and static call-trees? - Dynamic Structure: What are the major use-case scenarios? For a given use-case scenario, what is the actual call flow at run-time? How does a relevant data structure change?

2) First, try and find some oldtimer in the group/company/wherever who has worked with the code for a while. Setup a few whiteboard sessions and pick his/her brain to get a good overview of the system. Also go through any and available documentation. Take notes as needed.

3) Sit with QA/Testing and try out the system as a "end-user". This will identify the major features/use-case scenarios.

4) The above would have given us a good overview and now we can drill down into the codebase. You can use your favourite IDE/tools (Visual Studio, Eclipse etc.) but you still need to keep the above-mentioned three aspects of the system in mind. I tend to use the following tools; a) Doxygen, CScope, CFlow, etags, GLOBAL, grep to cross-reference data structures, symbols and follow static call flows. b) Call graph using gprof for Dynamic call flows (ignore performance data initially). c) Most large codebases have some sort of Trace/Debug scaffolding which you can turn on in the build system. This often provides us with a lot of insight into the runtime behaviour of the system since programmers output/verify/check important state data using these statements.

5) For each major use-case/feature scenario peruse the static call graph generated using the above tools. Note carefully the major data structures whose state is changed. Pay particular attention to "asserts" in the code since they verify pre/post-conditions and invariants thus clarifying "what" the code is supposed to do. The "how" is the code itself.

6) Now we execute the use-case with specific inputs and take a look at the Dynamic call graph generated using gprof or Trace/Debug statements and match it to the static call graph. Depending upon the system, it is often easier to generate the Dynamic call graph and then lookup its corresponding Static call graph.

Finally, we should now have the following; - A list of all major subsystems and modules and where physically they are located in the source tree. - A list of the major data structures in each module. In case of common/shared data structures what are its dependent modules. - Static call graphs for major use-case scenarios. You can annotate this with the main data structures that a function touches. - Dynamic call graphs. Note the input used. You can merge this with the above static call graph.

With the above in hand, we should be well on our way towards "grokking" the system.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: