
Ask HN: How to start learning large 20 year old code base? - Jiig
I&#x27;ve been tasked to try and learn how some very old C++ code base (30k plus lines) at my job, I&#x27;ve heard that its originally from the late 90s and has been updated periodically, since it preforms some very crucial processing.<p>The code is commented fairly well but is missing any sort of top-level architecture documentation or explanation.<p>I&#x27;ve started by just following the flow, but am getting very very lost. Are there any tips you have to help me wrap my head around this?
======
davismwfl
Hard to give a lot of detail without understanding a little more, like is this
a GUI tool, a utility library, an API or an algorithm etc.

But as some general advice.

1\. Find all the inputs, map their usage/affect through the code. That will
help you understand what happens when an input changes.

2\. Find all the outputs, do the same as #1. Now you understand with what
inputs the expected outputs.

3\. Trace through the calls with a given input and do a function & class map,
that'll help you see how the code interacts.

Where this gets harder is if the code is multi-threaded, or if it is a huge
monolith where there are lots of places to start. In the case of threads,
document each thread and the functionality it produces and which inputs are
shared, accessed or needed and what outputs come out. Also check the timing
against other dependencies. In the case of GUI monolith type application, pick
a piece of functionality, as an example, pick login, or app startup and just
trace all what happens, and just do this for a bunch of different smaller
pieces of functionality until you understand how the code is put together.

As a consultant I used to walk into weird shit all the time, and whether it is
I just have a knack for it, or it is my process (basics described above), I
can learn code bases quickly and be productive very rapidly. Things that make
it harder are lots of DI, especially when it is totally unnecessary for the
problem, third party dependencies you can't access and aren't well documented,
multi-threading or micro-services where they are not done well and there are
lots of interdependencies. Also event systems that are poorly architected can
make debugging issues and understanding flow super hard, so I have other
methods I use for systems like that, but it basically follows the same pattern
above.

The sure fire way to get lost fast is to try and map it all out at once. You
have to pick small pieces of functionality, map them, and build it out from
there.

~~~
Jiig
Thank you, this is very helpful. Its a decompression algorithm with a thin CLI
around it, lots of parallelization,and lots of different config options.

------
Someone
First things to check:

\- can you build it and if so, does the produced binary work? If so, look at
the makefile (or equivalent) to hunt for compilation switches. If not, spend
some time trying to make it build. If you don’t succeed, tell your manager
that this will be a lot harder (if the code doesn’t build and run, you don’t
even know whether you have all the code, IDEs may have trouble analysing it,
etc.)

\- And, given that this is a decompressor, do you have access to the
compressor, too? Chances are the makefile will give it to you. If yo don’t
have one, that isn’t a showstopper, but may make things more difficult, so
inform your manager.

\- is the code under source control? If so, look at the history. Going back to
older releases may give you an easier code base to work with (given that,
elsewhere, you say _“Its a decompression algorithm with a thin CLI around it”_
, that may help a lot, getting rid of various optimisations and config
options)

You can use various tools to visualise the call graph, but this being a
decompressor, there likely are many low-level functions you can’t tell about
what they do. If you aren’t familiar with compression algorithms, or with this
algorithm in particular, try googling the names of various functions or field
or variable names.

In the end, 30k lines of C isn’t _that_ much. It may just be a matter of
grinding through. If you browse 1,000 lines an hour (3½ seconds per line),
that’s only 30 hours, doable in a week (and a week is not much, if you
inherited the code base, and aren’t just visiting it). Just dive in, and by
the time you’ve spent 10 hours, you probably have generated some questions
that you want answered, discovered some #define’s that control compilation,
etc. Eventually, you will have to read every line, but don’t feel obliged to,
initially; just follow your instincts (and, in case the business side has some
short-time priorities, let that guide you)

On the one hand, decompression algorithms typically are of above-average
complexity, making that harder, but on the other hand, it is highly likely
that there are various CPU-specific and/or OS-specific code paths that you
(initially) can ignore, significantly decreasing your line count.

------
AnimalMuppet
You might use a tool that would document (and help you visualize) the call
graph. From that, you might get a better idea what parts are most important.

Another approach is to run it in a debugger, and just step through it,
watching how it does what it does.

30K lines isn't horrible, but you shouldn't expect to understand it overnight.
You should count on it taking at least a few weeks.

And, when you're done, you should leave behind a top-level architecture
document, and an explanation of how it operates.

------
thedevindevops
If they've taken the C++ 'interface' paradigm on, find the abstract classes
folder and map out all the virtual methods and what calls them, that _should_
give you an insight of what the major players are and how they interact

------
Irishsteve
Maybe not applicable but reading the unit and integration tests are usually
where I start

------
probinso
There are ways to turn Make files into Graphvis plots. I've had value
generating a plot, then systematically reducing complexity from the `.dot`
files. It takes a long time

