what are techniques you all used to learn and understand a large codebase? what are the tools you use?
Also use version control to identify the most commonly edited files in the project. These are usually the files that are doing all the work (80/20 rule) and you likely need to know of them.
git log --pretty=format: --name-only | sort | uniq -c | sort -rg | head -10
I normally use manual profiling libraries - I need an excuse to try out orbit, which uses automatic instrumentation for similar purpouses: https://www.youtube.com/watch?v=L8w0qI8qzvM
A little lower fidelity in some ways, but faster iteration than what I've been doing in others...
I don't consider it abuse either, though.
I see the debugger much more like a "REPL for a compiled language" than a "bug removal tool". I try to teach people to think of it as an interactive inspection tool, not as (merely) a thing to fix broken programs.
Besides that, it's times like these when I realise how useful IDEs are. Instead of needing to use grep (or something similar), I can simply right click on a variable and choose 'Find all references' (this is in VS, but I'm sure many of the leading IDEs will have this feature). When I use the command line it's to save myself time.
The analytic approach is a bit more awkward, since you have no specific goals and need to make them up yourself. So you could pose questions like how a specific behaviour of the application comes about ("why does it do that when this happens?", "how does it do X?") and then try to answer those comprehensively, systematically (a format that works well for me is short snippets of code interleaved with explanations and arguments).
A bottom-up approach is generally easier, because your questions will give you information at the bottom (like specific application messages), which are generally easy to find (ag, grep). A good IDE can be helpful for navigating the code and finding call sites, especially in projects written in dynamic languages where such analyses can become kinda annoying. (However, in more awkward code bases analysers like PyCharm are quickly overwhelmed and are unable to resolve indirections)
Top-down is in my experience less useful, because there are far too many choices on each level for most applications, and the first few layers are generally the least interesting and most arcane/fragile and difficult to follow along (things like initialization sequences).
The most difficult projects are typically those relying on multiple languages, code generation and runtime mutation (reflection, on-the-fly UI generation, overly dynamic Python code are typical examples). Another frequent obstacle is excessive abstraction and indirection (implementing something that could be done in a few lines of easy to understand and reason about C using multiple C++ templates spread out over a bunch of files and a healthy dozen of advanced language features is an almost archetypal example).
Then I try to go down the main code path of some examples or the primary binary if available and just check out out how things are called/done around there.
Then run an example through callgrind and visualize the call graph in kcachegrind to get an idea of how often things are called and where and where the heavy lifting happens. That last step is optional and really depends on the type of project.
Then I use my code editor and lots of searching and call site lookups to get a better idea of how things are used.
1. If you can, get an overview from a mentor. This will make the next steps a lot easier. Get:
- design style
- high-level flow
2. Get a stack of white paper from the printer and put it in front of you along with a pen. Colored pens and scotch tape are a bonus (There may or may not have been a shortage of both printer paper and colored pens next to them when I started at my last job)
3. Open a debugger with a breakpoint on the first line of code
4. Pick a request flow and initiate a request. Let the debugger guide you through the entire request flow
5. Record the path of the flow as a sequence diagram on your paper
( BONUS ) Record the relationships between the components in the system in a class diagram
Why does this work?
There's software out there for making these diagram, so why draw them by hand? For most people, visual memory is the strongest. So, the idea is you use your strong visual and spacial memory to assist you in recalling random objects, facts. And hey, why not a codebase? And that’s why this works.
When you look at different files that the debugger guides you through, you are engaging your visual memory. You remember how the code is organized and what the files look like.
When you draw the sequence diagram you engage your spatial memory. E.g., the Router class doesn't interact with the Database class and so they are one sheet of paper apart. Visually, you can see what clusters of components work together to make larger structures. This allows you to mentally group the classes into a single concept.
The point is to get this information into your head, and not to produce a diagram on a piece of paper. If all you need is the latter, use software, of course. Seriously, when you're done, throw away the piece of paper; it will be outdated the next day anyway.
Starting to understand it involved reading our documentation on the data-flow between different components during operation, to know the purposes of the important binaries. For the really core components, we had a fair bit of documentation at the level of classes.
You'd usually end up learning the sections of particular programs that you worked on in great detail, the programs themselves as a whole in slightly less detail, getting fuzzier as you moved away from your areas of greatest experience.
I usually start by running cloc and sloccount to get an idea of the metrics of it, languages line of code estimates etc...
I progress to looking at the tests if there are any. They usually give an idea of how the authors expect things to work.
Once I have browsed some of the tests, in particular integration tests I start following how they work through the code. Your IDE of choice will help out here or failing that use ripgrep, ack, the silver searcher, searchcode server (note I run this so I am biased), sourcegraph.
One thing that I have found especially valuable is running something to determine the cyclomatic complexity of the code. Knowing which parts are complex is a good way to determine where you should focus your time.
1. Read some code
2. Try to understand how it works
Here is what I do:
1. Try to figure out what the code might be in advance using the information you have. (For example: I know nothing but the fact that it's a spreadsheet. Then figure out in your mind how the basics of a spreadsheet might work.)
2. Now read a little bit of the code. Compare with what you were thinking. If it matches, go to 3. If it doesnt match, figure out why by reading the code and by thinking more.
Note the two processes are relatively similar because step 2 of the former process is a little bit like step 1 of the later process. Just try to focus on figuring out first, read second. Figure out first, read second. It's an active approach which makes you work more, and the more you work, the faster you go - or some benefits of that sort.
I actually wonder if people do that.
At my current company it often takes 6 months for experienced people to become productive, and that's with a helping hand.
IDEs are nice, but grep remains the best tool. You tend to need to find things in XML files and config as well.
In general, I think that high quality code attracts high quality pull requests.
That's because high quality code is easy to understand and easy to change because the core structure of the code is fundamentally sound and well suited to the problem that it is solving.
When you trace your own usage footsteps like this, it's often amazing how much goes on behind the scenes that you never realized.
I enjoyed reading it many years ago.
- gource can be used to visualize the activity in a repository.
- sloccount and cloc can be used to count lines of code.
- For C/C++/C#, you can run Doxygen, and ask it to generate documentation for undocumented entities. This can make give you another perspective on the code base.
- In runtime there are various tools you can use to audit what a program does... On Linux you've got strace, lsof, wireshark and many others... On Windows you've got Process Monitor from Sysinternals, as well as wireshark.
* Read any developer contribution docs.
* Glean what info you can from the layout and naming of the source tree.
* Peruse the code and any comments and see what does what.
* Read the unit tests to see how things are expected to work.
* Peruse the issues list to see what's breaking.
* Try to get a feel for how the contributor(s) think by reading any public blog posts, etc.
If none of those approaches yield any insight, don't blame yourself; maybe instead look for a different OSS project to contribute to.
# Getting familiar with a new codebase
### Use the right tools
- grep, ack, ag, global search (Visual Assist)
- doxygen, javadocs
- sourcegraph, pfff (facebook), open-grok, SourceInsight
- Proper IDE, REPL
- chronon (dvr for java)
- SWAG (Software Architecture Group)
- Static code analysis
### Use the repository
- Find most relevant (frequently, recently edited) files
- Find dependancy graphs
- Get basic information like which languages are used for what
- Use good source control so that you don't have to worry about breaking things
- Look at commits, in general or for specific issues
- Browse the directory structure, packages, modules, namespaces etc.
- Use "blame" to see when things changed
### Ask questions
- Talk to the customer, find out the purpose of the application
- Pair up with another developer who is more familiar with the code
### Read the documentation
- Look at use cases, diagrams describing architecture, call graphs, user docs
- Understand the problem domain
- Add more documentation as your knowledge grows
- Comments and docs might be wrong!
### Browse the code
- Skim around to get a general idea and a feeling for where things are
- Look at public interfaces, header files first
- Find out which libraries are used
- Take some important public API or function in the UI and follow the code from there. Find implementations of functions, dive into related functions and data structures until you understand how the it's done. Then work your way back out.
- Use tools to quickly find declarations, definitions, references and calls of variables/functions/etc., usage patterns
- Find the entry point of the program
- Figure out the state machine of the program
- Focus on your particular issue
- Use a large, vertical screen with small font size with a pane to show file/class structure
### Take notes
- Use pencil and paper to write down summaries, relationships between classes, record definitions, core functions and methods
- Write a glossary: Function names, Datatypes, prefixes, filenames
- Document everything you understand and don't understand
- Use drawings to create a mental model
### Look at the data
- Find out how the data is stored in the database
### Build the project
- First make sure you can build it and run it
### Use the debugger, profiler and logging
- Set breakpoints, poke around the code, change variables, inspect local variables, stack traces, ...
- Watch the initialization process
- Start from main() and see where it goes
- Find hotspots with the profiler
- Set logging level to max/add logging and use the output to go through the code
### Edit the code
- Adopt the existing coding style
- Try to recreate and fix small bugs, make sure you understand the implications of the fix to the rest of the program first
- Tidy up the code according to the common standard after talking with the team
- Make the code clearer (best with tests)
- Add TODO comments
- Add comments describing what you think the code does
- Hack some feature into the code, then try to not break other stuff, build up a mental model over time, re-write the feature properly
### Use Tests
- Run the tests, make sure they are all passing
- Create new tests
- Browse the tests as an examples reference
Use a static analyzer to build a graph of the codebase.
Build an adjacency list and a graph of the imports; and topologically + (…) sort.
I used =delete on the class function, really helped!
- Count the lines of code with find | wc, get a sense for what's there, and what language it's written in. The biggest file in the project is usually worth a look -- it is often where the "meat" is. Read the function names.
- Use the program. grep for strings that appear in the UI in the source code. That's a good place to start reading. Read function names.
- strace the program. What system calls does it make when?
ltrace is also sometimes useful, although it also gives a ton of output.
- Look at header files. Understanding data structures is often easier than understanding code.
- Look at commit logs. Those are hidden "comments". And reading diffs can be easier than reading code.
- Do a "log" or "blame" on the file. How has it evolved?
- Start reading main(). This often reveals something about the structure of the program. Even just finding main() in many programs is a good exercise :) Sometimes it's a little hard to find.
- Make sure to build it. And if you can, look at the build system. How is it put together? Most build systems are pretty darn unreadable. I don't really know how to read autoconf, and GNU make is tough too. Forget about cmake :) But sometimes this can help.
I haven't gotten that far with this, but I tried uftrace recently and like it:
You can think of it like a dtrace that knows about every function in a C or C++ program.
I want to try some kind of code explorer thing. I saw this in a CppCon video and on HN:
And older ones like:
But somehow I get by with Unix tools. I think this is because I feel like building the project in a way to accomodate the source browsers might be a big pain.
Counterpoint: I think the hardest part of understanding a project is usually the build system :-) I don't have too much of a problem with reading C, C++, Python, or (sometimes) JS code. Volume is always a problem, but I can read a specific function pretty easily. But the build system is where things get ugly, in my experience.
Also, reading multi-threaded code requires some special consideration. grepping for every place that threads are started is a good idea.