1) Make a note of all the files. Well named files tell what they do. If they do not, then its probably in the comments. Its important that you know what each file is for.
2) Find the entry point (main, WinMain, _main_)
3) Make a note of all the major macros, structs and functions you see. Note them down. Do it recursively as you go from main towards the other defs.
4) (optional) Read the changelog and the documentation if available.
5) It will be easier to keep track of what you are reading if you note it down. Then go over and read, and keep making notes. Keep detailed but easy to find notes.
6) Please keep making notes.
7) Make notes as you read
8) Did I stress on making notes?
This method helped me a lot :) Happy Hacking
Make executable notes as you read.
ie. Add in asserts or logs that check what you believe and log (or abort) if what you believe is false.
Otherwise it's idle speculation that never gets falsified.
With code generation or libs there might even be no entry point in the repo source code.
I would say go from the edges of the call tree towards the root but not all the way to it. Otherwise I get lost. Just guess which edges are interesting and start from them.
That means you have a very clear outcome of what do you want from the code. You spend the minimal amount of energy in order to understand the forest before you look at the trees, and you only spend time on the trees that you need.
What I do is not reading, I execute parts of the code,and debug it, so I get the "feeling" of what the code does before I know exactly how it does it.
In something moves like a dog, barks like a dog, eats and moves the tail like a dog....odds are it is a dog. Humans identify things by global behavior much better that painfully studying a hair or something.
There is code out there that is not worth reading, as it is so complex. Take for example Tex, it was created to be so efficient in old machines, that is almost impossible to read without expending a tremendous amount of time. But understanding how it behaves is easy.
I use my own visual tools to inspect and debug the code visually in a high level way, for example instead of just a number I can "see" a point, that is 3 coordinates, or a polygon that is tree points and a normal vector. Without these tools I would be seeing individual numbers(or nodes in a tree)having to reconstruct the whole using my mind(not good).
Those visual tools are created in a fast prototyping environment(dear imgui) with debugging scripts in order to see more globally.
You can do the same if nobody is actively working on the code but you have the history in source control.
So pick some various things the program does and walk through them in the debugger. Do you understand what's happening? Why? Is there a clear pattern at work for how things are constructed, used, torn-down?
Do that a few times with various behaviors. A good place to start is whatever the "init" functionality is, since everything else will use that.
After you're comfortable with the patterns you're seeing, begin to move "vertically", looking at other methods in the module near the one you're interested in. Are there patterns to the way the modules are organized? Does it make sense?
At some point, you'll have a good idea of whether or not you're in a strongly-patterned architecture or not. Strongly-patterned architectures may be huge and complex, but a small number of patterns can take you far. If not? You got a mess. Then you have a new question: how do I deal with this mess in front of me?
ADD: An important point to understand about really complex and poorly-writen codebases is that they might not be understandable, at least overall. At some point when you're writing bad systems, you have a magic act, not a program.
Eventually, you do build up some intuitions about code style and flow within a particular code base and will be able to comprehend it faster, but often it can simply be a painstaking process.
1. Compile the code and try to run it for known use cases.
2. Traverse through high-level functions using GDB.
3. Note down the files and functions while you are at it.
4. Read all the comments from the files you have traversed so far.
5. Check if you have covered most of the files which constitue to the majority of SOC.
6. If you don't understand a function, add a breakpoint to it and see the stacktrace.
Repeat these steps until you have not covered the whole set of files.
Can also be used on linux and mac but with wine or something like that. It is a windows software and is not free. But a trial version is enough sometimes just to understand codeflow.
I would really not bother with the headers. Headers are for the most part only there for type declarations, so that a compiler can compile an isolated code unit (.c file) with the information on types and function signatures in other code units. This type information is needed for the compiler because otherwise linking will fail after compilation.
If the project uses Cmake, you might look into Jet Brains’ Clion IDE, which has really great tools for code navigation, finding uses of a symbol, etc.
As others have said, find the main function or other primary entry points and start reading. It can seem hopeless at first if the codebase is large, but persistence pays off.
I often find making flow charts and graphing data relationships of key parts of the code to be really helpful.
Once you understand those, read the headers of the apis you can follow the code through to the rest of the application.
Usually, a project will use 3rd party deps so you will have to understand at least the apis of those as well.
If the project uses cmake or whatever build system builder, learn it inside out. If its just a Makefile then thats simple enough to understand, follow how everything is compiled and linked together.
Pick some aspect you want to learn about, select some text that's output or input in the user interface that should be relatively close in the code, and grep it. Follow from there. Check what it does, appreciating the variable and function names that the writer picked so you wouldn't have to follow every function definition to understand what it does in a broad sense. When the writer didn't care to pick quickly understandable names it's probably that those things work arcane details that are of no interest to understand the greater picture of what's happening. In other words, skim for the informative names and deduct what's happening from them. Check what calls that code, where that code leads to, etc. When you get bored of these features, rinse and repeat from the top, selecting another feature by its UI output.
Another technique is to just grep random things related to the program. For example, if it's a scroll shooter game, grep "bullet", "ship" and things like that. Checking the results gives you a good idea of where the most identifiable, high-level code is and how it's organized.
When I started out, I thought the only sensible way was to find main() and read line by line, depth-first, following every function call, understanding every detail. It's not effective. You don't need every detail. You don't need to parse it like a computer.
It's like sight. A computer will work, no problem, pixel by pixel, left to right, top to bottom. For us, it's more useful to just look and identify the most distinctive characteristics that we can work with. The ones that relate the most to how we can interact with the object.
That and try to tackle a number of smaller projects / fixes / improvements. That helped me the most and I was marginally productive while I was doing it.
(1) Write things down as you go.
(2) Learn your tools (e.g. IDE or cscope).
(3) Learn the core data structures and RPC definitions first.
(4) Trace through important code flows.
(5) Experiment with small changes to see how they affect things.
The full post is here: http://obdurodon.silvrback.com/navigating-a-large-codebase
BTW, a lot of folks here need to stop talking about "the entry point" or "main". For many kinds of code, from kernel code to network and storage servers, main will not exist or will be intensely uninteresting. Finding the real entry points is part of the investigative process, not something to be assumed.
The following is an approach i try to follow for C/C++ code;
0) DO NOT try to understand the details of how exactly something works in the beginning. Work top-down, iterating and gradually getting into the details as needed. The key is to get a firm idea of the system as a whole before diving into the nitty-gritties.
1) We need to focus on three main aspects;
- Physical Structure: How is the code distributed across files and directories?
- Static Structure: What are the top-level logical Subsystems and Modules in the codebase? What are the dependencies amongst them? What are the major data structures and static call-trees?
- Dynamic Structure: What are the major use-case scenarios? For a given use-case scenario, what is the actual call flow at run-time? How does a relevant data structure change?
2) First, try and find some oldtimer in the group/company/wherever who has worked with the code for a while. Setup a few whiteboard sessions and pick his/her brain to get a good overview of the system. Also go through any and available documentation. Take notes as needed.
3) Sit with QA/Testing and try out the system as a "end-user". This will identify the major features/use-case scenarios.
4) The above would have given us a good overview and now we can drill down into the codebase. You can use your favourite IDE/tools (Visual Studio, Eclipse etc.) but you still need to keep the above-mentioned three aspects of the system in mind. I tend to use the following tools;
a) Doxygen, CScope, CFlow, etags, GLOBAL, grep to cross-reference data structures, symbols and follow static call flows.
b) Call graph using gprof for Dynamic call flows (ignore performance data initially).
c) Most large codebases have some sort of Trace/Debug scaffolding which you can turn on in the build system. This often provides us with a lot of insight into the runtime behaviour of the system since programmers output/verify/check important state data using these statements.
5) For each major use-case/feature scenario peruse the static call graph generated using the above tools. Note carefully the major data structures whose state is changed. Pay particular attention to "asserts" in the code since they verify pre/post-conditions and invariants thus clarifying "what" the code is supposed to do. The "how" is the code itself.
6) Now we execute the use-case with specific inputs and take a look at the Dynamic call graph generated using gprof or Trace/Debug statements and match it to the static call graph. Depending upon the system, it is often easier to generate the Dynamic call graph and then lookup its corresponding Static call graph.
Finally, we should now have the following;
- A list of all major subsystems and modules and where physically they are located in the source tree.
- A list of the major data structures in each module. In case of common/shared data structures what are its dependent modules.
- Static call graphs for major use-case scenarios. You can annotate this with the main data structures that a function touches.
- Dynamic call graphs. Note the input used. You can merge this with the above static call graph.
With the above in hand, we should be well on our way towards "grokking" the system.