
Ask HN: How to dive into large codebase? - quantos
I have seen lots of tutorials and posts teaching about different programming languages and frameworks and stuff but I have never seen a tutorial teaching about how to dive into large codebase. I am somewhere in between beginner-intermediate level programmer and I have always wondered how programmers familiarize themselves with large codebases. There are tons and tons of open-source projects on GitHub, SourceForge, Bitbucket and ... you name it. Most of them have contributers who starts contributing after project matured and it doesn&#x27;t seem they would have red and understood whole source code of project before contributing or have they? 
Now my question is:<p>Are there any tutorials,books or other materials teaching about how to dive,familiarize,read,understand and contribute to large codebases efficiently?
======
arandr0
I actually wrote a short tutorial on this (geared towards enterprise though) a
while back[1].

For open source, I would amend it by adding that you should hang out in the
chat room -- you will pick up some answers to easy getting started questions
by osmosis and be a better known name when you are ready to submit a patch --
and read any contributing guides that outline procedures for contributors. You
should do this if you ever want to contribute even if you don't think you have
the understanding/skill level to do so yet.

Over time I've begun to understand that what really causes issues with junior
devs and established codebases is at least 50% psychological -- young good
coders think code should be like a math problem, there is a number of formulas
they know, they identify the problem, apply the formula, solve everything.
When faced with a lot of code where the problem is not well-characterized,
they start thinking they are not smart enough for the problem and the ensuing
anxiety spiral usually makes them _actually_ not smart enough to see the
forest for the trees.

[1]: [http://arandr.github.io/2015/01/17/how-i-learned-to-stop-
wor...](http://arandr.github.io/2015/01/17/how-i-learned-to-stop-worrying-and-
debug-other-peoples-code.html)

------
gjvc
This is one scenario that even the most die-hard vim/ctags ninja would benefit
from using something like a JetBrains-level IDE to navigate the class
hierarchies.

~~~
halpme
I have the best of both worlds: Eclipse IDE with a vim plugin (vrapper). Not
sure why people choose strictly one of the two.

------
tdubhro1
I've been compiling some notes on exactly this topic for the past few weeks,
your question prompted me to post them here:
[http://dubhrosa.blogspot.com/2016/07/how-to-dive-into-
largec...](http://dubhrosa.blogspot.com/2016/07/how-to-dive-into-
largecodebase-getting.html)

TL;DR: don't "dive" in, figure out the inputs and outputs and work from the
outside in; make lots of notes as you go; don't browse the code aimlessly,
write down a question you want to answer and focus on that, then write down
the answer when you think you've got it; don't get too hung up on the
editor/IDE; possibly controversial/YMMV: turn off syntax highlighting; page
through a file from beginning to end spending 5 seconds per screen when you
open it for the first time.

------
corecoder
It depends on the codebase.

Open source projects are probably a bad example, as they tend to be well (or
at least decently) written: code has cohesion and is organised in a rational
way, so that you can usually find pretty quickly the piece you are looking
for, and you can often just change the functionality you are interested in
without examining every single file.

Other codebases, like the ones that you may find in your typical enterprise
environment, have ofter much less cohesion, so that they require shotgun
surgery, and behaviour is also spread throughout the codebase randomly. If
this is the case, you'll be forced to look at more code, figure out the
relationship between pieces and maybe proceed by trial and error.

This is not a rule, as of course there are lots and lots of private enterprise
codebases well written, but I haven't seen a lot of them to date.

------
Artlav
I tend to try to get the big picture. What is the overall structure of the
program?

If it's a game, for example, then what layers are there between the user and
the display driver? Where does it set the screen mode? What is the progression
of the drawing functions, etc.

Sometimes you need only to figure out one thing. I.e. how do Bitcoin encrypt
it's keys? Trace the user input into the bowels of the client and out into
OpenSSL. In the process you would get an idea of how the onion is layered.

If it's something truly massive like ReactOS or Linux kernel, then... No idea.
In my case i just dived in and kept crashing it until i stopped crashing it.

The best way to do it is to get ahold of a developer who knows the code, of
course.

------
zzzcpan
Generally, for an unfamiliar problem domain, you just try to do it how you can
and if it feels hard - try to come up with another way to do it until it
doesn't.

For an intermediate level programmer there is unlikely to exist a problem
domain where you are proficient enough to just read the sources and understand
how everything works right away. Everything is pretty much new and unfamiliar.
So, you'll need a good high level view on how everything works, step by step.
Go with a call tracer. It would take some time to get it to work, but even
with that there is nothing, that could beat it in efficiency of digging into a
new codebase.

------
runT1ME
>Are there any tutorials,books or other materials teaching about how to
dive,familiarize,read,understand and contribute to large codebases
efficiently?

No, but I would say this is where you really want to know your tools. Be it a
debugger so you can step through a codebase to find a problem, using an IDE to
inspect parts/jump around, CTAGS/Grep for Vim/Emacs users, etc.

------
Guyag
This isn't an answer to your question, but I have recently thought that a
IDE/editor plugin to manually 'tag' lines/sections of code with a comment
would be useful when starting with a new codebase, rather than having to
remember everything. That said, I don't know how much that would negatively
impact the end goal of internalising these things.

------
entelechy
\- read api-documentation

\- check how the api on the highest level is used

\- datamine source-control eg. git-log

[https://www.youtube.com/watch?v=KaLROwp-
VDY](https://www.youtube.com/watch?v=KaLROwp-VDY)

------
bo_Olean
I usually start from presentation layer and dig deep into the data models.
Sometime, in reverse order if the codebase is fairly bigger.

