
Ask HN: Studying large code bases - rvalue
Hi HN,<p>Suppose you had to understand a large distributed system. The modules which interact together can be written in different programming languages and all expose a REST endpoint along with a documentation of what each endpoint can do.<p>For the simplicity of discussion lets consider these languages(Java, C++, Scala, Python and PHP). They could use a distributed database like solr, couchbase<p>Some of these modules can be 12factored and some with pretty bad code practices. There is no existing mechanism for monitoring but logging.<p>How would go into analyzing this vast system ? (Not just one at a time, elaborate your techniques)<p>To what degree would you trust the documentation, infrastructure, existing testing tools, environment setup tools in the process of understanding a module ?<p>What would you make sure you definately do NOT do?<p>What language specific strategy would you suggest ? (Entrypoints, configs, static analysis etc.)<p>Would you think of creating a mechanism for cloud debugging ?<p>How do you keep track of the knowledge you learn from one module and apply that as you go along ?<p>How do you keep your life together and still swim in so much complexity ?<p>Share your stack! http:&#x2F;&#x2F;stackshare.io&#x2F;<p>How do you apply these techniques to learn, update and maintain this system?<p>PS: I am asking this question in general and not for an existing system. But if you would like to give an answer with some open-source project as example, that would be great.
======
zgm
1\. Start at the simplest subsystem, and work from there.

This is especially true when working with third-party, legacy, or undocumented
code. In the absence of documentation, your only way forward is to read the
source code[0]. Find the main() of the simplest/smallest module in the system,
and begin tracing through the code. This could be with a debugger, print
statements, or the search functionality in your editor/IDE (Find Usages and Go
to Declaration in Intellij are lifesavers).

2\. Don't be afraid to break things apart.

If the simplest module in the system is overwhelmingly complex, start
commenting out parts of the code. Go until you have effectively reduced it to
"Hello, World", if you have to. From there, you can gradually add features
back.

3\. Constantly test your assumptions.

Don't assume comments do what they say they do.

Don't assume that config flag produces the behavior the documentation claims
it does.

Do take inventory of you assumptions whenever the behavior of the system
contradicts your current understanding of how it works.

You should be able to back up any claims about the system with empirical
evidence e.g. when I change A to B, X happens; if I change A to C, Y happens.

[0] [http://blog.codinghorror.com/learn-to-read-the-source-
luke/](http://blog.codinghorror.com/learn-to-read-the-source-luke/)

~~~
rvalue
Intellij does a very good job for most of my programming and I use it daily.

In the process of learning a distributed system, the process of looking at the
code as its documentation is not the ideal way. One could spend enormous
amount of time to just understand a simple business usecase.

But then just because it takes time doesn't make it untrue.

I would rather accept a system which self documents itself with every change.

The point of breaking things is all well and good but to truly understand
everything, after breaking also one needs to put the together one by one and
understand their interactions so that if any one in the pool of modules fails,
one knows what exactly would have caused it and how to be more resilient
towards failures.

~~~
joeevans1000
What do you use in IntelliJ to self document? I have a hard time reviewing
code effectively, so I'm curious about your strategies there.

~~~
rvalue
I think there is some confusion here.

What i meant was applications should be self documenting, something like
swagger comes to my mind.

------
chubot
Shell scripts!

I would write a crapload of shell scripts to start every component locally, in
the foreground. Hopefully you can just pass flags to specify ports, rather
than having to specify config files. (Though you can easily generate config
files from shell scripts)

I use tmux, so I can have all servers up in the foreground and logging to
stdout. You can arrange the windows so it's easy to get a global view.

Identify which servers are stateful and which are stateless. For the stateful
ones, it's nice to be able to save and restore their state (hopefully just
with cp and mv).

Then I would write some more shell scripts to poke at every port. For REST
services, just use curl, and possibly some JSON/XML helpers. Stateless servers
should be testable exhaustively.

Then pick some code change you want to make. Write the shell snippets to
tickle that code path.

Then write some more shell scripts to rebuild every component binary from
source. Make your code change, build, restart, and then test to see if it
works. Look at the logs for clues.

This is basically the "dynamic" approach to debugging systems, rather than the
"static" approach of staring at code. Sometimes staring at code is the right
thing to do, but if as you say the sheer volume of code is too much, then
figuring out the behavior as a black box -- tickling it with shell scripts --
is often a faster way to understand the system.

They should be scripts because you need to be able to repeat this process many
times.

Other strategies are:

\- Understand the architecture from the build system. Don't look at the source
code, look at how it's compiled. What libraries go in what binaries is a
strong indicator of their function.

\- Understand the architecture from the deployment. In this case I'm
suggesting to rewrite the deployment yourself with shell scripts (to a single
host, your dev box). But you can also try to just look at whatever the
ops/devops team does to bring up the service in a new cluster.

\- Look at any existing system tests / regression tests.

~~~
rvalue
I am completely with you with the idea of replicating the whole system locally
with help of Vagrant and making sure it is as close to production and then
start and stop stuff and observe the change

------
warcode
To understand the system start by ignoring the code.

1\. Start with all the components as black boxes.

2\. Draw the system design/connections/interfaces between them.

3\. Follow the data and uncover black boxes by reading the code.

4\. (If possible) Write tests and refactor code so that testing is possible.

------
thewarrior
Try to understand the data model. Dump all the SQL schemas , diagram them if
required and study them.

------
devnonymous
That's a whole lot of questions, I'll attempt to answer just the primary one
-- how I approach a new large code base.

For me, the absolute first thing to do is _not_ look at the codebase. I prefer
to understand the system itself (ie: understand the stack and how everything
fits in). IOW, understand what layers exist and how exactly are they stacked.
Again _without_ looking at the codebase, rather by interacting with the
system.

I do this by trying to understand the main entry points for a specific task
and follow the 'request' (for lack of a better term, but it doesn't mean a
http request) along the stack. In real terms what this means usually, is that
I'd hit the logs and see what happens when I do something.

Next (ie: by which time I know at least the major layers), I would look at the
codebase and identify sections/modules that seem like they provide the entry-
points discovered above. Next, depending on the language and tools that are
available, I start 'playing' around with the code.

I know this sounds a bit vague but all that I can tell you is this is what
works for me and I have gotten better at it over time (ie: experience counts).

------
joeld42
How I do it is play with the system as a user first, and think of small stuff
to change. It might be real features or bugfixes, but at this stage, it's
often thowaways like "make that button report the current frommet count", or
"add a new token to the file format", etc.. small changes that require
communicating from disparate parts of the system or require touching deeply
buried parts of the codebase are best. Often in a large codebase trivial
changes can be quite hard to track down. The act of digging through and
finding and making these changes give me an overall map of the codebase.

Fixing bugs is a great way to do this. If I'm new to a project, I will spend a
few days just going through the bug backlog if I can. Half the time (at least)
the bugs are not even real, no longer applicable or otherwise non-fixable, but
just tracking that down is useful (and it feels good to close tickets)

Also, learning quickly what parts of the codebase you can simply ignore and
treat as a black box, and where you need to change.

------
sudeepj
I recently changed my job and I am trying to figure out how the product works
there so here is my strategy:

1\. Knowing the actual purpose of the product(or system) is must.

2\. Using the product yourself helps a lot. This gives you end-user
perspective.

3\. Reading the code in a context is far more effective. A context here could
be a test case or a use case. More granular the scenario the better it is.

4\. If you are not able to understand a piece of code, revisit it later and
move on. In the meantime note the input and output of "that specific" code.
Many times after your move on and read other parts one starts to connect the
dots.

5\. Never assume anything!

Example: I do not know much of Linux kernel so lets say I want to know when I
do "fopen" in C where (& how) does the kernel make entries in its own data
structures.

------
parados
Have a look at David Beazley's video for a real life example:
[http://www.pyvideo.org/video/2645/discovering-
python](http://www.pyvideo.org/video/2645/discovering-python)

~~~
rvalue
thanks :)

------
zzzcpan
Even though you are asking general questions I'm going to assume your goal is
to work on a large unfamiliar code base.

Go with a call tracing tool, write one yourself if there is no such tool. This
will speed up your learning process tremendously compared to any other
approach.

And in general, if learning feels hard - you are doing it wrong, try something
different.

------
Nate75Sanders
OpenGrok has been very useful when I want to understand a new codebase. The
distributed / REST aspect is going to make it harder to use code tools,
though.

~~~
amirouche
OpenGrok seems like a light version of DMS as I understand. It was presented
at Google in 2010 [1]. It's a software analysis tool (control/data flow,
copy/paste detection, ast levenstein distance), refactor and translator
between several languages. IIRC they do distributed systems.

[1]
[https://www.youtube.com/watch?v=C-_dw9iEzhA](https://www.youtube.com/watch?v=C-_dw9iEzhA)

------
Stephn_R
I would trust documentation with a grain of salt. Very rarely do I find
intensive and accurate documentation for distributed systems. However,
documentation may be helpful for things such as stand-alone components and
REST endpoints. Those would be taken with possibly 2 grains of salt.

The one thing that you would never want to do is actually read through the
code line by line*. HOWEVER, let it be noted that I mean this as an initial
point of insertion. As a Computer Scientist, your first step should be in
attempting to ask the right questions.

Think of it like a game of 20 questions ( N questions in this case...). To do
this, one of the best things I have done is to follow the process. All
distributed systems attempt to solve one or several use cases. Define these
use cases and then analyze the code basis from there. For example, a use case
that involves a REST Endpoint can be updated your profile photo. The use case
would define a point of entry (where/how to submit a new photo) and a point of
submission (the REST Endpoint). From this, we can search for two components in
the system that facilitate this task.

=== A good word of advice in understanding any code base is that, nine times
out of 10, it is ALWAYS easier to read code once you understand its purpose.
===

> Language Specific Strategies

As for some good language specific strategies, be knowledgeable of what
Paradigm the language falls into. For example, Java code tends to have a
single point of entry for most cases. Whereas Javascript (as a functional
language) can have several entry points (or even none at all). For functional
languages, be mindful of how they are structured. Functional languages are not
object-oriented by nature but they can be if implemented properly. As such,
try to notice the key differences (i.e. prototyping).

> Cloud Debugging?

I would not consider it. Debugging should be a constant process of
development. Not even the best developers can write perfect code from the
start but with enough work and constant QA testing/Unit testing during the
development process, any code base can become easily manageable for debugging.
In addition, learning how to read Stack Traces can save HOURS of debugging.

> Tracking knowledge from Module to Module

Personally, I keep notes. I pretend as if I am writing an API document. Nine
times out of ten, most modules take an input and provide an output after some
calculation. So long as the calculation is sound, I do not worry about the
calculation and only record the call to the module itself and its return.
Other cases like classes can be tracked by the same notion.

> OH THE COMPLEXITY

It's a necessary evil. No one ever said being a developer was easy. And no one
ever said the best things in life are free and easy. I work hard for what I do
and for what I want. I love my job. I don't think I would ever consider
changing careers. Sometimes the complexity does become overwhelming and its
hard to handle. But if you wish to know my little secret, schedule yourself
some mandatory free time. And stick to the schedule. This doesn't mean try
breaking away from the computer, but just do things that help you relieve
stress. Talk to a friend and have a drink or two. Developing is tough on the
mind but it is an occupational hazard in the least.

> Updating/Maintenance

If it's open source, let the users dictate when the updates should happen. I
believe in Agile development and trusting the users. But when it comes to
updates and maintenance on a corporate system, always try and set some goals
for the change. Why are you updating/performing maintenance? Is it to improve
the speed/response time? Is it worth the cost of time/labor?

> GOLDEN RULE: Ask Questions. Be Skeptical and always tread carefully. We are
> Developers. We are Problem Solvers. Not Problem Makers.

BG: I have 6 years of experience working with Corporate level architectures
and distributed systems. And earned my BA in CS minor in Information Systems.
In my time, I have worked with multiple functional, object-oriented, and web
based languages. I have also worked as a Full-Stack Developer for roughly 3
years now.

~~~
rvalue
Thanks Stephn_R

I did not think anyone would take time to answer so many of my questions. I
agree with you on lots of points you mentioned above. Regarding the
documentation I usually trust it if the module or the system is new, but if
its old enough, i assume its outdated, architectural drift exists and only
take that as a reference of how the system was planned to be like.

This also gives a lot of use-cases to begin with and understand the control
flow. Regarding test cases, integration and regression tests help a lot in
finding failures but I feel more confident about them if written by the same
developer as the original thought for the code change is more pure with him
and the reviewer.

For knowledge tracking, I guess there is no easy way, one learns from
experience and time, but it would be great if there were some great tools to
capture thoughts and map them to a definite state in the application without
making any change in the codebase.

I have been in my academics for 6 years doing my bachelors and masters and
less than a year in my professional career. The sheer amount of stuff one
needs to know about building scalable and distributed applications is just
amazing and nerve-wracking at the same time and I will keep your secret in
mind :)

~~~
Stephn_R
Thank you for the kind words! It is daunting to see how much anyone would need
to know in order to set up distributed systems. But it is powerful to have.
And with great power comes great responsibility :)

------
adamnemecek
setup a debugger and step through common scenarios.

~~~
justincormack
"debugger" in the broadest sense. I would start by intercepting the network
traffic (at appropriate protocol level) and getting a feel for what is going
on, and if it matches the docs.

------
igorgue
Also, read this book: [http://www.amazon.com/Working-Effectively-Legacy-
Michael-Fea...](http://www.amazon.com/Working-Effectively-Legacy-Michael-
Feathers/dp/0131177052)

It helps a lot and teaches you how to use grep and other tools (a lot more
others that I no longer remember) to search and find your way through legacy
code.

------
igorgue
My regular way to go would be to go bug fixing, the more bugs you fix the more
you'd be familiar with the source code, this is what most open source projects
do, but code quality is better so it takes less time to get familiar with it.

Then if I don't have any bugs to fix, it would be to get very familiar with
the application and read the logs on every action you do, take note of where
they say they are and study and go on rabbitholes.

Whatever you do, don't try to rewrite it, they almost never work unless you
have a way too much time in hand.

