Hacker News new | comments | show | ask | jobs | submit login
Ask HN: How do you familiarize yourself with a new codebase?
407 points by roflc0ptic on June 26, 2015 | hide | past | web | favorite | 239 comments
My question is pretty straightforward: how do you, hacker news enthusiast, familiarize yourself with a new codebase? Obviously your answer is going to be contingent on the kind of work that you do.

Some background: What's motivating me to ask is that I am flirting with the idea of trying to add a couple of features to SlickGrid (https://github.com/mleibman/SlickGrid), Michael Leibman's phenomenal javascript grid widget. Unfortunately Leibman got busy and isn't actively supporting it anymore.

The codebase is something like 8k lines of javascript, so it's not ludicrously big, but I'm kind of intimidated thinking about trying to make sense of it. My first strategy is just to open up important-looking javascript files (slick.core.js, slick.grid.js) and read through for comprehension. This seems like a pretty slow way to build a mental model of the code, though. Some features I want to implement are 1. an ajax data source that doesn't require paging, and 2. frozen columns. Someone else has implemented a buggy version of frozen columns (and since abandoned the project), and I might like to use it, but I can't tell if it's buggy because it's a hard problem, or because their implementation strategy was poor (or both!). So at the moment I can't evaluate if I should implement my own, or try to fix the issues with theirs.

Picking up other people's code seems to be one of the harder tasks developers face, as evidenced by how much code gets abandoned, so I wondered if the voices of experience on here could point me in the right direction, either by talking about this problem in particular, or more generally, how you build knowledge about a new codebase.

Thanks!




I wrote some simple bash scripts around git which allow me to very quickly identify the most frequently-edited files, the most recently-edited files, the largest files, etc.

https://github.com/gilesbowkett/rewind

it's for assessing a project on day one, when you join, especially for "rescue mission" consulting. it's most useful for large projects.

the idea is, you need to know as much as possible right away. so you run these scripts and you get a map which immediately identifies which files are most significant. if it's edited frequently, it was edited yesterday, it was edited on the day the project began, and it's a much bigger file than any other, that's obviously the file to look at first.

we tend to view files in a list, but in reality, some files are very central, some files are out on the periphery and only interact with a few other files. you could actually draw that map, by analyzing "require" and "import" statements, but I didn't go that far with this. those vary tremendously on a language-by-language basis and would require much cleverer code. this is just a good way to hit the ground running with a basic understanding which you will very probably revise, re-evaluate, or throw away completely once you have more context.

but to answer your actual question, you do some analysis like this every time you go into an unfamiliar code base. you also need to get an idea of the basic paradigms involved, the coding style, etc. -- stuff which would be much harder to capture in a format as simple as bash scripts.

one of the best places to start is of course writing tests. Michael Feather wrote a great book about this called "Working Effectively with Legacy Code." brudgers's comment on this is good too but I have some small disagreements with it.


Thanks for sharing. I often get lost in large projects. Blindly jumping around is quite inefficient and frustrating.

How hard do you think it is to write a tool to draw dependencies map for a specific language?

May be there're built-in code analyzing tools in compilers for popular languages that I'm not aware of?


For golang std lib, since packages have no circular deps, it can be automatically drawn out like this:

http://lonnie.io/gostd/dagvis/

If you write your project that even has no circular deps among files and all files are small (like me in https://github.com/h8liu/e8vm), you can draw the similar graph but at a much finer granularity, like this:

http://8k.lonnie.io/


wow! What did you use to make these? It looks awesome!


Thanks.

Here are my hand-made tools to generate those stuff: https://github.com/h8liu/e8tools

Javascript version (I actually wrote this first for go std lib): https://github.com/h8liu/dagvis


dependency graphs and code analysis are huge topics. this is probably the hard part with libraries like npm or Bundler.

the graphs can have cycles, multiple references to the same dependency, multiple distinct versions of the same dependency, et cetera. it's also all very, very language-specific. in many languages, the order of the require or import statements matter; in others, they don't. you can also have something like Clojure, with several different ways of bringing in a dependency, which makes the whole issue much more fine-grained.

prior to writing these scripts, I tried to do something more ambitious with auto-refactoring tools. these exist in Eclipse and can even be graceful in Smalltalk but my own results were not so amazing. I got somewhere with regular expressions, code generation, and shell scripting. but I also built a thing in Ruby which could auto-refactor a tiny, TINY subset of the most obvious refactorings in JavaScript. it took me months, maybe harmed my sanity, and was definitely not the best code I ever wrote.

TLDR: compilers, zomg.

edit: forgot to say you're welcome. :-)



Great answer. I can't begin to estimate how many times I've considered writing something along these lines.

It might be nice to surface files that are frequently edited together as well.


I made a crappy visualization of the commits of our University project, http://codepen.io/Azeirah/pen/bdawBm

It was a 3D pong game, along with the game engine. Pretty neat to see this visualized.


Anybody knows a similar analysis tool for an SVN project? One option would be to convert the SVN to git for analysis purposes, but I'd be interested in a better solution.


never tried it myself, but you can take a look at the accompanying code "Your Code as a Crime Scene"

https://github.com/adamtornhill/code-maat


Thanks! Gonna give this a try.


Is it normal that it takes a while > 2minutes to run your program on some big java repo?

edit: took like 4 minutes, worked well!


quoting from the readme:

Caveat: Running this code against extremely large projects with very long histories (e.g. Rails) might be very slow.


That doesn't seem very long at all.


Well, it doesn't; wth.


Thanks for sharing. For someone in their early career like me, this is very useful.


Hmm... this is an interesting project. No License info though.


Added a pull request to use git ls-tree for filenames so you only check stuff in the repo, instead of find. Hopefully it will end up on an open source license of some sort.


Very simple & quite effective. Thanks for sharing.


Very cool! Thanks for sharing your scripts.


thanks for making this and sharing!


A post from last year, "Strategies to quickly become productive in an unfamiliar codebase": https://news.ycombinator.com/item?id=8263402

My comment from that thread:

I do the deep-dive.

I start with a relatively high level interface point, such as an important function in a public API. Such functions and methods tend to accomplish easily understandable things. And by "important" I mean something that is fundamental to what the system accomplishes.

Then you dive.

Your goal is to have a decent understanding of how this fundamental thing is accomplished. You start at the public facing function, then find the actual implementation of that function, and start reading code. If things make sense, you keep going. If you can't make sense of it, then you will probably need to start diving into related APIs and - most importantly - data structures.

This process will tend to have a point where you have dozens of files open, which have non-trivial relationships with each other, and they are a variety of interfaces and data structures. That's okay. You're just trying to get a feel for all of it; you're not necessarily going for total, complete understanding.

What you're going for is that Aha! moment where you can feel confident in saying, "Oh, that's how it's done." This will tend to happen once you find those fundamental data structures, and have finally pieced together some understanding of how they all fit together. Once you've had the Aha! moment, you can start to trace the results back out, to make sure that is how the thing is accomplished, or what is returned. I do this with all large codebases I encounter that I want to understand. It's quite fun to do this with the Linux source code.

My philosophy is that "It's all just code", which means that with enough patience, it's all understandable. Sometimes a good strategy is to just start diving into it.


I find it frustrating that languages features work actively against you when you're trying to understand something.

Wide inheritance and macro usage are probably the worst. Good naming can aid understanding, but basic things like searchability are harmed by this.

Of those two, macros are the most trouble. You can't take anything for granted, and must look at every expression with an eye for detail. Taking notes becomes essential.


This is my approach too. I like to understand the entire flow, from the beginning to the end. To me this is the best way to get familiar because once you dive from different entry points you start noticing the patterns and similar paths in the code to the point where you don't need to dive to those areas again, as you quickly assimilate them by simply going over multiple times.


Interesting. I take a similar approach, but then I add testing (which usually coincides with fixing some bug).

Find an entry point to the system, then make it compile, then make the test run, then just keep on nudging the code until I'm satisfied I've covered what I'm interested in.

If I stay true to my mantra "only add test code when it is absolutely necessary" (is this argument needed? pass null and find out), I find an accurate (albeit not pretty) description the flow through that procedure.

Then you commit your test and save your discovery for posterity.


I also like to find a high level interface or function and follow down. Once I get to the boom, I then start following the important data. This is particularly helpful nowadays when data frequently moves between multiple systems before seeing easily visible results.


1. I make sure I can build and run it. I don't move past this step until I can. Period.

After that, if I don't have a particular bug I'm looking to fix or feature to add, I just go spelunking. I pick out some interesting feature and study it. I use pencil and paper to make copious notes. If there's a UI, I may start tracing through what happens when I click on things. I do this, again with pencil and paper first. This helps me use my mind to reason about what the code is doing instead of relying on the computer to tell me. If I'm working on a bug, I'll first try and recreate the bug. Again, taking copious notes in pencil and paper documenting what I've tried. Once I've found how to recreate it, I clean up my notes into legible recreate steps and make sure I can recreate it using those steps. These steps are later included in the bug tracker. Next I start tracing through the code taking copious notes, etc, etc. yada yada. You get the picture.


Debugger! Surprised no one has mentioned it yet. I work in js and php, both of which I use the debugger a lot.

Set a breakpoint, burn through the code. Chrome has some really nice features - you can tell it to skip over files (like jQuery) you can open the console and poke around, set variables to see what happens.

Stepping though the code line by line for a few hours will soon show you the basics.


Debugging through the test cases in particular is a good way to decipher/dissect things, at least that I have found. Usually you can find a test case that is only for a specific component that you are interested in, and then the test case should only exercise those pieces, so there is not an overwhelming amount of information all at once.


I am surprised how few younger programmers use a debugger these days.


Debuggers and profilers are essential tools, but I was surprised when I read the book "Coders at Work: Reflections on the Craft of Programming" that a lot of the greatest programmers debugged just with a printf.


Maybe this is just my lack of experience speaking, but I find debuggers incredibly difficult to use when dealing with interactive software (which is most software I work on). Hitting a breakpoint freezes my interaction unless I configure it to print something, and configuring it to break conditionally requires knowing some obscure incantations that I can never remember off the top of my head (and that in turn require their own debugging). So it boils down to: should I spend 5 minutes figuring out how to set this conditional, print-only breakpoint? Or should I just put in a printf? The solution, at least to me, is obvious.

(With that said, when I have a nasty bug that requires squishing around in the guts of my program, breakpoints are invaluable. But it's a pretty huge hammer that only comes out on occasion.)

EDIT: I wonder how useful an editor that allows inline debugging code would be? Instead of setting breakpoints in your IDE and then configuring them in an individual window, you'd enter a "debugging editor" mode that would allow you to add debugger-only code for things like conditional breakpoints and printf statements right alongside your normal code. The original source files would not be edited, but while in this mode, it would appear as if they were. (Maybe the debugging code would show up in red.) That way, you could easily access all your local state and implement complicated queries without ever leaving the context of your code. In other words, it would be just like printf debugging, but once you no longer need to debug you'd just collapse those statements out by leaving the debugging editor mode and your original code would be unchanged. Perhaps this debugging code would not compile along with the rest of the code but would instead use the debugging hooks that normal breakpoints use, unless the conditions are particularly complex or something. Just a thought.


Not sure if you know this but if you're in javascript you can just drop a 'debugger;' statement anywhere in your code and it will function as a breakpoint.


Ah yes, I should have mentioned that this is coming from a native application developer!


I'm not a younger programmer and I don't use a debugger. Tried it several times and found it counterproductive.


Like with most useful tools, you have to use it more than "several times" to gain fluency.


Indeed, and learning to use a debugger will allow you to gain fluency in new languages and frameworks much faster that the simple, albeit powerful, printf.

I remember needing to learn Ruby and Rails on an existing closed codebase and having the debugger take me right to the core of Rails several times to understand why certain things were done in a certain way in the top level code. This allowed me to get acquainted with internals of Rails and how Ruby's inheritance system much faster as well as the existing codebase that was built on top.


How is it counterproductive exactly? The setup of the debugger itself? Or trying to figure out how it works?


It depends on what you're working on.

If you're working on, say, a huge Java codebase, then a debugger is practically essential because you've got a lot of code to navigate, and you probably want to see what flavour of objects are being passed around and how their state is being updated.

On the other hand, if you're working on, ooh, a Nodejs codebase, you're probably looking at less code, with a debugger that's much slower, and probably more functional code that is operating on data directly. Using a debugger in that case is often slower than just using print statements.


I'd argue that print statements are a debugger, albeit a custom one off debugger. Sounds like the actual debugger needs work (... or from my experience it's me who doesn't understand the debugger correctly)


That argument is a bit of a stretch. Just because you are debugging something doesn't mean you're using a debugger...


It's a complicated explanation I suppose, but I would say that among other things, I've seen my colleagues abusing debugging and since debugger is a "single threaded" and sequential approach if you will, they were losing the big picture. I try to understand the code and keep it in my memory so that I could predict the behaviour without the debugger. On a better day this would be a better explanation.


I'm not sure how you "abuse" debugging-- can you elaborate on that?

The debugger runs in as many threads as the app itself. If you're debugging C#, and your program is in 37 threads, you can debug all of them simultaneously or debug one and ignore the others.

Understanding the code and keeping it in memory sounds like a great practice, but it also seems entirely orthogonal to using a debugger.

I have the exact opposite attitude: I live in the debugger. Most of my resistance to trying "new hot" languages is their universally terrible debuggers. (Usually they either don't exist at all, or only exist in CLI form.) IMO, if you don't have a working, stable, graphical debugger, your programming language has no business being 1.x.


The debugger being sequential is a consequence of your code also being sequential at the CPU level.

Understanding the code is often a luxury we don't have, either because we didn't write it in the first place or because there are just too many moving parts each tracking their own state and interacting with everything under the sun.

This is especially true for long-running programs that do much more than just transform inputs into outputs. For example a video game would be complete hell to develop without a solid debugger. Even if you believe you understand the complete code and all its behavior the testers will always find a way to make the game crash after 2 hours of playing leaving the game's state completely broken and without a debugger you'll be scratching your head endlessly trying to replay how that happened.

In these cases the number of different states and behaviors you can get out of the system is incredibly high, easily a few orders of magnitude more than the human brain can handle.


> or because there are just too many moving parts each tracking their own state and interacting with everything under the sun.

For me at least, that's typically my cue to start refactoring. If I can't model at least the general behavior of my program in my head, then it's probably needlessly complicated.

This isn't always possible, of course (such as in the case of long-running programs, as you mentioned; it's still possible in a lot of cases, though, and splitting off functionality into smaller, easier-to-digest pieces can make troubleshooting much easier), but it's been a useful mentality for me, and has significantly reduced the need for me to use some sort of debugger or even 'print' statements to make sense of code.

This doesn't address the original question of understanding new codebases, though.


> I try to understand the code and keep it in my memory so that I could predict the behaviour without the debugger.

That's the theme I hear when prodding people who (sometimes loudly) say they don't use debuggers. They work on codebases that are either small, solo projects, or change very slowly. That way they can keep an accurate simulation of the entire program in their heads.

Meanwhile, I've always worked on codebases that were changing faster than I can keep up. The majority of my debugger time has been in code I've never seen before.


Not necessarily. For instance, my colleagues work with a fairly large codebase in java which runs on top of a framework written in scala. When they tried to hook up a debugger it drilled the whole stack top to the bottom, naturally and showed them whole lot of scala code. Problem is, these guys don't now scala, only java.


My debugger shows a full stacktrace too. I usually ignore the Django code that's its written on top of and use it in the sections I have written.

I don't find it a hindrance in any way seeing what part of the Django code called my code.


I just use printf and other stuff to dump critical variable. I also use unit test a lot.


Ah yes, the good ol' printf-debug-polluting-codebase method. What's fun after that, is when programmers all start competing to have their printf's be more visible in the sea of prints ... I find this method primitive at best. How do you manage it?


I find this method primitive at best. How do you manage it?

Don’t be primitive about it. :-)

The serious answer is: use a good logging framework. In particular, I’d suggest something that can log at selective levels of information for different parts of your system.

If your project’s culture is to instrument your code like this routinely, it can be a very useful asset, and it’s no more disruptive in practice than say adding comments or writing tests. In each case, with experience you get better at judging where to focus your efforts and how much detail is worth including by default, and if that turns out not to be enough you can always add more while you’re working in that area.

With modern tools for recording and analysing logs, there is relatively little useful information that you can easily find using a debugger but can’t easily find from good log output. On the other hand, logging has some big advantages: you can capture how your system changes over time, you can record and review concurrent behaviour cleanly, you can capture information from different parts of distributed systems that might be written in different programming languages or running on different devices.


You remove the printfs when you have fixed the problem.


It's so tempting to do this. Don't. Problems may occur again in the same area. Leave your logging statements around; configure your logging setup to skip them when you don't want them.

You're saying to delete commented-out code once the new code works. Put your code in version control instead. Logging is to once-off printfs as version control is to commented-out code.


Not so sure. Once I get to the stage where I have to stick printfs in the problem tends to be so localized and specific that once fixed I will never need to look at that area of code in that way again.


I do this far more than I should. The code ends up a mess after enough debugging like that.


In my case, aggressively reject any printf statements in code review. Get that shit out of my repo.


Without trace statements, how do you diagnose issues that don't occur locally and only appear in the production environment.


Traces are the kind of thing that should be automated anyway... You shouldn't have to write explicit trace statements in your source code, instead your language or framework should have an automated way of inserting these statements at function boundaries. Or even better, attach a debugger and reproduce the issue yourself on the prod server.

Just plain "debug log statements" I have yet to hear a good use case for. Every time people have put them in, it's because some bug called for their inclusion, then the bug got fixed, then the statements got left around after the bug was fixed. Or the bug didn't get fixed and it's a matter of "why don't you fix the bug?"

There's rare cases where there's an ongoing issue with a known bug that you don't know the fix for yet, so you drop debug statements in production code hoping to catch it. But this is supposed to be a rare, rare case.


That would be the difference between printing and logging


I think he meant to add print statements locally. You remove them before committing the code.


printf = breakpoint expression evaluation (but without the flexibility)

You're using a primitive debugger, you just don't know it.


It does have the advantage of both immediacy and continuation.

You can dump statements through a function and see it's progression when run without having to interact with it.

Throw in something like FirePHP (which allows dumping pretty much anything out as a viewable/collapsible trace, other languages have similiar) and the use case for a lot of using a full blown debugger is removed (it's still incredibly powerful when needed).

So I use both.


Codebug + Xdebug, setting a breakpoint and getting a full REPL at that execution point is far more useful to me when I'm in PHP land :)


I tend to do that mostly for concurrent programs, as I want to see what happens without blocking the entire system like a debugger would.

For medium/large programs I'll often prefer a debugger because I can easily and quickly try to diagnose the problem without having to wait for a new build to complete with the added traces.


What debugger do you use for PHP? I've yet to find one I really like.


Xdebug is de facto the only debugger AFAIK. The integration in Phpstorm is great.


http://codebugapp.com is a fantastic front-end for Xdebug. I get a lot of mileage out of this, not least because I only had to set it up once and now it works with whichever editor I'm trying out this week.


NuSphere is better than xdebug. It allows more actions and is generally better.

But I use xdebug and phpstorm. I find it better than netbeans. Search in netbeans is good, but the whole thing has a weird GUI and is slow.

Best bet is to try a bunch.


Xdebug and with phpstorm 9 you have inline display of the values held in that variable which you can expand with a click.

That is incredibly powerful (PyCharm has it as well).


xdebug and phpstorm is great.


Xdebug and Netbeans, tons of tutorials on how to set it up on google/youtube.


xdebug and Eclipse is good enough.

Configuration got a lot easier since the installation wizard has been introduced: http://xdebug.org/wizard.php


Codebug. Phenomenal Xdebug client.


Debugger++

Without a debugger you're a sitting duck!


I just crack open the source base with Emacs, and start writing stuff down.

I use a large format (8x11 inch) notebook and start going through the abstractions file by file, filling up pages with summaries of things. I'll often copy out the major classes with a summary of their methods, and arrows to reflect class relationships. If there's a database involved, understanding what's being stored is usually pretty crucial, so I'll copy out the record definitions and make notes about fields. Call graphs and event diagrams go here, too.

After identifying the important stuff, I read code, and make notes about what the core functions and methods are doing. Here, a very fast global search is your friend, and "where is this declared?" and "who calls this?" are best answered in seconds. A source-base-wide grep works okay, but tools like Visual Assist's global search work better; I want answers fast.

Why use pen and paper? I find that this manual process helps my memory, and I can rapidly flip around in summaries that I've written in my own hand and fill in my understanding quite quickly. Usually, after a week or so I never refer to the notes again, but the initial phase of boosting my short term memory with paper, global searches and "getting my hands to know the code" works pretty well.

Also, I try to get the code running and fix a bug (or add a small feature) and check the change in, day one. I get anxious if I've been in a new code base for more than a few days without doing this.


Totally agree with the point of pen/paper.

Something that compliments that approach is in-code annotation. Recently, I've recently been trying out https://github.com/bastibe/annotate.el which is pretty sweet. Check it out!


Off topic, but anyone know what font and theme (it looks like the default theme but I'm not sure) are used in the project's screenshots?


The font is PragmataPro, which I am also using. Best font ever, but expensive.


annotate.el looks pretty interesting, thank you.


I work similar to this. I love writing things in notebooks. I also like making diagrams on draw.io and printing them out for reference/writing on.


I go as far as to have a dedicated project notebook for big new projects, I write everything down that I come across that I need to remember or need to question.

I've often been dropped into codebases where there is only a month to question the previous maintainer before all the business knowledge is lost as they move on to bigger and better things. So getting all the questions/queries down asap is the fastest step to get the undocumented business logic documented.

Even when you can't ask the questions I like to turn all the unknown unknowns into known unknowns. :)


Great points Also,

A list of the kinds of broad abstractions to look for might be useful;

* Each module and it's purpose

* Every global resource (whether global variables, message names, anything that the entire system has to deal with).

* The "style" that each coder used. Even "terrible" programmers tend to have a consistent approach and understanding that approach can make code much less opaque.

I also like to page though documents more quickly than my conscious mind can follow so as to get an unconscious feel for a code base. That might be just me.


There is a significant number of answers that may interest you on Stackoverflow. Specifically: http://stackoverflow.com/questions/215076/whats-the-best-way...

Two things I do to familiarize with a code base is to look at how the data is stored. Particularly if its using a database with well named tables I can get some rough ideas of how the system works. Then from there I look at other data objects. Data is easier to understand than behavior.

The other is watching the initialization process of the application with a debugger or logger. Along those lines if your lucky (my opinion) and the application uses dependency injection of some sort you can look to see how the components are wired together. Generally there is an underlying framework to how code pieces work together and that generally reveals itself in the initialization process if its not self evident.


Side rant:

I just cannot believe people praising 'Unit Test'-ing. Fellow programmers, how exactly do you unit test a method / function which draws something on the canvas for example? You assert that it doesn't break the code?!

I see some really talented people out there who write unit test as proof that their code works without issues, that it's awesome and it cooks eggs and bacon etc. They write such laughable tests you cannot even tell if they are joking or not. They test if the properties / attributes they are using in methods are set or not at various points in the setup routine. Or if some function is being called after an event is being triggered.

My point is this: unit testing can only cover such tiny, tiny scenarios and mostly logic stuff that it is almost useless in understanding what is going on in the big picture. Take for example a backbone application like the Media Manager in WordPress. Please tell me how somebody can even begin to unit test something like that.

Unit testing is a joke. And sometimes a massive time consuming joke with a fraction of a benefit considering the obvious limitation(s).


WebKit does it like this: https://www.webkit.org/quality/testwriting.html

Adobes WebKit repository with layout tests: https://github.com/adobe/webkit/tree/master/LayoutTests

Example URLs from that repository:

https://github.com/adobe/webkit/blob/master/LayoutTests/css3...

https://github.com/adobe/webkit/blob/master/LayoutTests/css3...

https://github.com/adobe/webkit/blob/master/LayoutTests/css3...

Other browsers will have similar strategies.

I think this is essential for something with the complexity of a browser engine.


We've used image comparison tool, which produces pixel-wise diff with the expected image, exactly to verify these kind of things(we've been developing the rendering tool). In addition to this unit tests, combined with the coverage tools, allows you to find potential problems/crashes etc in your code. Different levels of testing are for different things, unit tests just one of the pieces in the equation.

Your point is just for some tiny tiny scenarios of the software you are working on.

You don't need to think about 'how could I write a unit test', you need to think about how could you improve the quality of the code, and unit tests are just one of your tools available to solve this problem.


That is pretty awesome that you've wrote a such a tool (although I can only imagine how long it took to create such a tool and how it affected the project time frame).

From a web developer's mind: the coll thing is that the tool can be further developed and taken to new directions. For example implement the capability to take snapshots of pages and see if they've changed in layout and notify the user of changes (pretty awesome for scrapping).

I totally agree, unit testing is such a small cog in the wheel of software quality that it is truly a shame how something like this takes all the scene.


> Fellow programmers, how exactly do you unit test a method / function which draws something on the canvas for example?

We don't. Canvas drawing routines are hopefully unit-tested already by their authors. We do write unit tests for calculations and logic to make sure that the values passed to some canvas function are as expected.


Some unit testing might be a joke, but not all unit testing. If you have small units-of-work they should be tested, or at least testable.

Integration testing often makes a lot more sense, though.


I believe I'm in the minority, but I think unit tests are nearly universally worthless - exceptions being those for well defined APIs (eg math). Good unit tests must have an independent oracle of truth or else you aren't testing anything. As a practical matter you should only write tests for code that materially impacts the business (or you are just wasting everyone's time ). Instead of writing regression or integration tests, which are hard (hence the need for testing), people absent mindedly write unit tests and point at code coverage.


I'll need concrete examples of good/bad tests to have any idea what you're talking about at this point. You say they're worthless but then say they're good for well-defined APIs, which is precisely what unit tests are for.


This may or may not apply to you, since i work with Perl. Typically i'm in a situation where i'm supposed to improve on code written by developers with less time under their belt.

As such my first steps are:

1. tidy/beautify all the code in accordance with a common standard

2. read though all of it, while making the code more clear (split up if/elsif/else christmas trees, make functions smaller, replace for loops with list processing)

While doing that i add todo comments, which usually come with questions like "what the fuck is this?" and make myself tickets with future tasks to do to clean up the codebase.

By the end of it i've looked at everything once, got a whole bunch of stuff to do, and have at least a rough understanding of what it does.


Please don't take this as a criticism, but how long have you been programming? I'm asking because I used to have an opinion like this when I was just starting, but after a few years I realized that changing all of the code as the first thing is one of the worst things to do.


>> i work with Perl >> user: Mithaldu >> created: 2099 days ago

> how long have you been programming?


For pay since 2005. Do keep in mind that the code i am working with usually has some sort of test suite available, and that over time i have become very good at transforming code between different forms of expression without changing the effects it causes. (Excluding memory use and performance, which is not something one usually has to consider much in Perl.)


Ah, so long enough for the advice to be based on real experience.

I was really surprised to see it, because it's exactly the way I was learning C. I switched to Linux around the same time, so I'd take some abandoned DOS program that had source code available and port it to Linux. During the process, I'd read the entire source code, make sure I understood it and reformatted everything. A few years later, the original author of one of the programs released a new version and thanks to my reformatting, it was pretty much impossible to merge. After a few similar experiences at work, I have decided to always stick with the original style of any code I touch, because any unnecessary changes are just going to make life harder for me in the future.


Three things to keep in mind here:

1. Perl is MUCH more concise than C, since we have institutionalized code sharing (see CPAN), whereas most C devs i know (and maybe i don't know too good ones) seem to at most reuse code others wrote by way of ctrl+c/ctrl+v.

2. Perl's reformatting tools are automatic. I just have a little config file that says how long my lines are supposed to be and where i'd like the spaces on my parens (inside, between the parens and arguments) and then i hit ctrl+e and boom it's done. If i need to do it to many files, find + perltidy. In your case i would've just taken his new version, automatically formatted everything in less than 5 minutes, and merged on top of that.

3. When i do this, i'm doing it with team lead consent and team concensus, in an authority position, not with random code maintained by people i never even talked to. :)


The big problem (apart from the fact that ill-formatted codebases have often much more serious problems...) is that reformatting is an excellent way of messing up your VCS' diff ability, which is extremely precious in trying to understand why things have been done the way they have.


Both points aren't possible to do with large codebases.

It will take months merely to tidy the code with the effect of making the rest of the team hate you for committing thousands of files for superficial changes. Its much more productive for everyone to simply adapt to the existing style guidelines.

Reading all of the code is only an option for the smallest of codebases. Reading code will only get you so far before you get lost in the complexity of how all the parts interact with each other.

A better approach would be to limit yourself to a subset of the codebase and start poking around with a debugger while the system is running. Then you can gradually work your way through the codebase starting from the core functionality.


If i take longer than a week to do it, then other measures are called for, like, as you mention, subsets. However you seem to underestimate just how much functionality can be implemented in how little code with Perl.


Oh I wasn't talking about Perl specifically but about languages in general. I agree that Perl projects tend to stay small enough to be quickly understood and refactored. In fact I don't remember seeing a Perl script more than a few thousands line long.

However, don't underestimate how hard it can be to understand your own Perl code when you've been away from it for 6 months :)


> However, don't underestimate how hard it can be to understand your own Perl code when you've been away from it for 6 months :)

For anyone who cares about their craft and has a certain amount of experience with Perl this is entirely untrue. I'd rather people not repeat this tired old meme. :(

Thank you! :)

vvvv


Sorry, you're absolutely right, I've seen modern perl and it was actually quite clean. I didn't mean to offend by saying this!


Do you follow this process even on code without tests? How do you make sure you're not introducing bugs in the process? Or you don't commit your changes afterwards?


It works even on code without tests, simply due to lots of experience. It helps that i also have a lot of tools available in terms of realtime syntax and sanity checkers. And sometimes if the code is sufficiently insane enough, i'll write tests to make sure.


A while back I worked with team that had just brought on some new developers. Some of the developers were eager and would learn with out human intervention. Others would require and ask for some mentoring.

Neither way of learning is right or wrong and I appreciated both groups but one thing I did remember was one guy that did exactly what you did and it was highly irritating as a project lead to see relatively random files (given whatever sprint we were on) that were considered stable to show up in the source control change list with out some consideration of discussion. I would have to waste time and look what change he made to see if it was ok.


Understandable. I'm also operating quite differently. Typically i am called into teams who are floundering already. So what i do is create one big commit with the goal of making the code easier to work with and to learn about the code base. Never do i just change random files, and never do i do it without the team lead having agreed and some sort of concensus from the rest of the team.


Well, that doesn't scale at all. I don't remember the last time I've worked on a project where even skimming all of the code would be possible in a reasonable amount of time, much less actually reading it and refactoring it.

That said, it probably does scale to the OP's 8k line code base.

Also, running the code through a formatter and refactoring it all right off the bat is a sure fire way to piss off everybody else working on the project.

In any case, my 2 cents for the OP is to not bother trying to learn the whole codebase, but instead focus on just the areas you want to enhance. For example, in the case of the new data source, find out how the existing data sources are implemented, and use them as examples for adding a new one.


I studied a lot of people doing this as part of my PhD. The thing is that there are not many answers that work well in a lot of situations. Given that though, my suggestions is to iterate on developing three views of the code:

1. The Mile High View: A layered architectural diagram can be really helpful to know how the main concepts in a project are related to one another. 2. The Core: Try to figure out how the code works with regards to these main concepts. Box and arrow diagrams on paper work really well. 3. Key Use Cases: I would suggest tracing atleast one key use case for your app.


I usually work on more traditional command line applications and daemons so my approach might be a little different to a web developer.

I always start by gauging how much source code there is and how it's structured. The *nix utility "tree" and the source code line counter "cloc" are usually the first 2 things I run on a codebase. This tells me what languages the applications uses, how much of each, how well commented it is and where those files are.

The next thing I usually do is find the entry point of the program. In my case this is usually an executable that calls into the core of the library and sets up the initial application state and starts the core loop and routine that does the guts of the work.

Once I have found said core routine I try to get a grasp for how the state machine of the program looks like. If it's a complicated program this step takes quite a while but is very important for gaining an intuitive understanding of how to either add new features or fix bugs. I like to use my pen and paper to help me explore this part as I often have to back track over source files and re-evaluate what portions mean.

Once I have what I think is the state machine worked out I like to understand how the program takes input or is configured. In the case of a daemon that often means understanding how configuration files are loaded and how the configuration is represented in memory. Important to cover here is how default values are handled etc. I actually prioritise this over exploring the core loops ancillary functions (the bits that do the "real" work) as I find it hard to progress to that stage without understanding how the initial state is setup.

Which brings us to said "real" work. Hanging off of the core loop will be all the functions/modules are called to do the various parts of the programs function. By this time you should already know what these do even if you don't know how they work. Because you already have a good high level understanding at this point you can pick and choose which modules you need to cover and when to cover them.


Whatever your IDE/editor of choice is, I think these having these three functions are critical to learning a new codebase, or even developing for that matter: 1. Go to definition 2. Find all references 3. Navigate back

This allows you to go down any code rabbit hole, figure stuff out, then get back to where you were. If you can't do those things it will take much longer to understand how things are interconnected.


Absolutely. In Emacs, I depend heavily on etags and the occasional rgrep to find my way around a fairly large project written mostly in C.

I haven't dealt with a JavaScript project large enough that I've bothered setting up tags, but I imagine something similar is available.


I start with running the tests if there are any. Typically peeling layers of the onion starting with the boundary. If there are no tests, then I'll try to write them. Then running tests in debug mode helps step through the code. If I have the luxury of asking questions to an engineer experienced with the codebase, I request a high level whiteboarding session all the while being cognizant of their time.

Some others have mentioned recency/touchTime as another signal. For large complex codebases, that may or may not always work.


When you think you understand something write a test and test your belief. If the test passes then both your knowledge and the code base are better for it. If the test fails then rewrite the test to the failure and write another test. Again you will know more and the code base will be better.

Good luck.


I feel your comment should be the top one. but I disagree with this bit:

> the code base will be better.

I'd change "will be" to "might get." because this is true if you're doing unit tests that the code base can use. but sometimes you do characterization tests, which are not worth keeping around. or you might build a couple variations on "hello world" with the unfamiliar code base, just to be sure that it works the way you think it does.


When writing a test exacerbates a too-many-tests problem, it's both a rare and and good problem to have because reading and running such tests is a more expeditious route to understanding than writing tests in the blind.


What if you wrote a test that passes at the time of writing because of how something is implemented at that time but its not actually an invariant?


Then you will learn that later when the test fails. That is better because the reason for writing the test was your belief that it was invariant. Without the test you are more likely to continue holding the mistaken belief.


Sure, but that is an incorrect test for other developers to worry about it failing.


I agree with what many others on here have said. It's also a personal thing. In general I like to try to force myself to learn only the minimum required to do what I need to do. If that philosophy sounds good to you, I would recommend taking the buggy version of frozen columns and try to fix the bugs. You may learn that the bugs are structural and you need to implement it differently, or you might be able to fix it with minimal changes. You will certainly get an understanding of the parts of slickgrid that you need to interact with to add this feature.

For the ajax data source thing, I would try to modify or extend the existing data source code to add the behavior you are looking for. As you mess around with it trying to figure out what you need to change, you will encounter the areas of the code that you need to understand.

With this sort of strategy you can avoid having to fully understand all the code while still being able to modify it. You might end up implementing stuff in a way which is not the best, but you will probably be able to implement it faster. It's the classic technical debt dilemma: understanding the complete codebase will allow you to design features that fit in better and are easier to maintain and enhance, but it will take a lot longer than just hacking something together that works.


I'm working a lot with a huge legacy codebases in C/C++. Here are some advices:

1. Be sure what you can compile and run the program

2. Have good tools to navigate around the code (I use git grep mostly)

3. Most of the app contain some user or other service interaction - try to get some easy bit (like request for capabilities or some simple operation) and follow this until the end. You don't need a debugger for it - grep/git grep is enough, these simple tools will force you to understand codebase deeply.

4. Sometimes writing UML diagrams works -

- Draw the diagrams (class diagrams, sequence diagrams) of the current state of things

- Draw the diagrams with a proposal of how you would like to change

5. If it is possible, use a debugger, start with the main() function.


4. Sometimes writing UML diagrams works

I found myself doing this more often and find it very useful. I've been using Freemind and it seems to do the trick. http://freemind.sourceforge.net/wiki/index.php/Main_Page


Yes this could work for personal things. But it is often you need to represent your understanding of the current state of affairs and propose your ideas, so the UML is a good tool to do it - everybody understand it or could learn in like 0.5 hour.


I wish I had a better answer, but I honestly just stumble around it. I typically start by trying to understand how they structured their files, then I'll start diving into the code. I wouldn't try to "understand" it completely. Just look over it until you feel comfortable enough to try to make some modifications.

Michael's code looks clean and well organized. Shouldn't be terribly difficult for someone proficient at JS.


Glad I'm not the only one who stumbles around :)

Another thing I do is try and replicate one core feature of the thing I'm trying to understand. Like others have suggested, I like using debuggers. Recently, I wondered how debuggers work. So I built the following to find out.

https://github.com/amorphid/rails-debugger_example


My approach is to break stuff. If I can break it (and I am good at finding bugs, so I usually can) then I now have a narrow focus which helps me getting "lost" in the code base.

Once I've found and fixed a few things, or if the code base is particularly small or clean that I can't find bugs to fix, I'll set about hacking in the feature I'd like.

I usually start by doing it in the most hacky way possible. That sounds like a bad approach but it narrows the search of how to implement it and means I'm not constraining myself to fit the code base that I don't yet appreciate.

In hacking that feature I'll often break a few things through my carelessness. In then trying to alter my hacked approach so it no longer breaks stuff I'll become more aware of the wider code base from the point of view of my initial narrow focus. This lets me build up the mental model.

Eventually I'll be comfortable enough I can re-write the feature in a way more consistent with the wider code base.

I don't normally start by trying to "read all the code" because that guarentees I won't understand much of it (I'm not quick at picking up function from code). I might have a skim if it is well organised, but I find the "better" written a lot of stuff is, the harder it is to grok what it is actually doing from reading it. to me, reading good code is often like trying to read the FizzBuzz Enterprise Edition[1].

I've worked on many legacy systems: I was last year implementing new features into a VB6 code base, this year (at a different job) I am helping migrate from asp webforms to a more modern system. I've found that starting with trying to fix an issue to be the best way to dive into the code base.

Use good source control so you're never "worried" about changing anything or worrying that you might lose your current state. Commit early, commit often, even when "playing around".

[1] https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpris...


I tend to use a hybrid approach, but In general I try to identify the entry point of the code which will lead me to the core datastructures and possibly event loops that act as a central hub for any other code that is called.. That is I look for some kind of dispatch pattern that Integrates the rest of the system, routing and calling different code when needed. Once you identify this "hub" you will have a good mental model and the system and its high level components. From there you can delve into different subsystems and slowly tweak and make changes to be sure a code path does what you conjecture it may do. Using a debugger is helpful at certain points to explore depth of the code.. When you can get a small tweak working as expected you probably have a decent starting model of the code base that you can easily add to.


Another thing that is helpful, especially if you don't even have knowledge of the problem domain of the codebase: Write a glossary.

As you read the code and encounter terms/words you don't know, write them down. Try to explain what they mean and how they relate to other terms. Make it a hyperlinked document (markdown #links plus headings on github works pretty well), that way you can constantly refresh your memory of previous items while writing

Items in the glossary can range from class names / function names to datatype names to common prefixes to parts of the file names (what is `core`? what belongs there?)

Bonus: parts of the end result can be contributed back to the project as documentation.


Some good pointers and links here, surprisingly they miss both my favourite approaches.

1. If it's on Github, find an issue that seems up your alley and check the commits against it. Or the commit log in general for some interesting commits. I often use this approach to guide other devs to implement a new feature using nothing more than a previous commit or issue as a reference and starting point.

2. Unit tests are a great way to get jump started. It functions as a comprehensive examples reference--having both simple and complex examples and workflows. Not only will it contain API examples but it will also let you use experiment with the library using the unit test code as a sandbox.


For clientside JavaScript, one useful way in is to run the Chrome profiler on it. That will produce a treeview of the calling hierarchy, and give you an idea of what are the code's 'hotspots' - the functions that are called from everywhere, or the functionality which dispatches everything.

This can be especially useful for event driven code (looks like SlickGrid is jQuery-based, so that definitely applies here); you can start a recording profile, carry out the action you're interested in, then stop recording, and you can then find out exactly which anonymous function is handling that particular click or scroll or drag.


This is usually how I do it for libraries:

* Read the README.

* Install it and start using it with a couple of sample cases. That will give you an idea of what it does.

* Read the test suite. This will give you a better idea of what the library does.

* Look at the directory structure. This should tell you where things are.

* Start reading the core files.

* Start looking at open issues. Try to solve one by adding a test and changing the code.

* Submit a pull request.


I think a top-down approach is pretty much the only way to do it: Start at a high level of abstraction: packages, modules, namespaces, etc and their relations. Pick one that seems related to some core functionality or central to the change you intend to make and dive deeper: interfaces and data structures within that unit and possibly other related units they depend on. Ideally, up to this point you shouldn't even have to worry about function definitions and algorithms, just declarations, types and relations.

While static typing helps a lot with this kind of exploration and navigation, I don't know of any IDEs or other tooling for any language that would really help you with it. Sure, you can probably generate UMLs or something, but it usually requires some additional tool and the output is pretty static. You can't just zoom in from a package-level view to an interface-level and then keep zooming until you are eventually shown line-by-line implementation of a specific function.

I've been thinking about this lately, and I've come to the conclusion that the way we think and reason about code is pretty far from the way our tools present it to us. I tend to think in terms of various levels abstractions and relations between units, yet the tools just show me walls of text in some file system structure (that may or may not mirror the abstractions) and hardly any relationships.


Well, I'm not very good at this either but here's what I do. I usually work on modular projects where there are hundreds of files in the project. I usually skip directly to locating the file where I've to make amends (using a lot of grep. grep for function and object definitions, grep for usage patterns, grep for checking how to implement something). Thus, I learn about the codebase as I go along.

Sure, this is not the best practice, and unsuitable for many, but it's what works for me.


If this is your modus operandi your life will be much improved with Ack. It's specifically designed for searching codebases.


I think you should also check out cscope, it can search for struct definitions and function calls easily (better than just grepping text or using ctags).


Thanks. I'm looking at it right now, and it looks really cool.


Amen. Once you've had ack, you never go back.


Isn't ag a better option these days?


My typical workflow for checking out new open source projects:

- find . -type f

- find . -name \* .ext | wc -l (get an idea of complexity)

- git log (is this thing maintained?)

- find . -name \* .ext | ctags

- find main entry points, depending on platform and language

- vim with NerdTree enabled

- CTRL-], CTRL-T to jump to browse tags in vim

Generally a lot of find, grep and vim gets me started.


Get it running locally and then see what happens when you delete some stuff, especially stuff that you don't understand when reading through the code.


I work on AOSP, which is a fairly larger code base. During the early years, the documentation on the internals of Android was close to non-existent. Plenty of tutorials in Mandarin/Cantonese, but not many in English.

A good way to get hang of the code base was to read it (usually using a tool like sourcegraph [0], pfff [1], open-grok [2], doxygen [3], javadocs [4]). Although a lot of people have argued that code is to be not treated like literature [5], but in this case, there was no choice.

The second step was to see if assumptions about the what the code does is correct. This is usually achieved by adding log statements, writing sample apps, and debugging in general.

Repeat the steps above, over and over again.

Checklist:

1. No matter what you do, you absolutely need to document everything you understand / misunderstand about the code base.

2. Never underestimate value of having a different pair of eyes look at code you have hard time reasoning about.

3. Be in constant search for resources (like books, blogs) available on the code / topic of your interest. You'd learn amazing amount by reading through other people's analysis. Stackoverflow is a great start. Heck, you can even ask well thought-out questions on Quora/Stackoverflow.

4. Hang out on related IRC channels / community mailing lists. For things written in esoteric languages such as OCaml, I found these to be pretty helpful.

5. You could blog about it, share the information you know over email lists, setup wikis; and people who know better would correct you. Its a win-win.

Good luck.

[0] http://sourcegraph.com/

[1] https://github.com/facebook/pfff

[2] https://opengrok.github.io/OpenGrok/

[3] http://www.stack.nl/~dimitri/doxygen/

[4] http://www.oracle.com/technetwork/articles/java/index-jsp-13...

[5] http://www.gigamonkeys.com/code-reading/


You got a little bit lucky with this project because there's a decently built-out test suite. I would start by digesting the tests because if they're good, you'll be able to see the mechanics about how the exposed interfaces in the code work, and this should also give you a good idea if changes you're making are breaking the expected workflow or not.

From my experience, there are really two ways that learning a new codebase can happen. One is that there's an existing test suite that's fairly comprehensive, and you can learn a lot by examining the tests, making changes to add features / make bug fixes, and then validate that work by rerunning the tests and adding new ones. That's really a great place to be as someone unfamiliar with a new codebase. The other is that there are no tests, and you inevitably need to rely on people familiar with the code, and make peace with the idea that you're going to write bad code that breaks things as you learn the depth of how the project works.


I'm working with somebody else's code more often than writing something new from scratch. It takes some time to get used to that, but it's very far from the hardest tasks developers face.

A couple of things that I typically do:

- Start with a fully working state, i.e. setup your environment, make sure tests (if there are any) are passing. If you can't get things to work properly, that's your first issue to investigate and fix.

- Don't try to understand all of the code at once. You don't need it yet. I'm assuming you want to take over the project for a particular issue. So just focus on that and ignore the rest of the code. If you ask any senior developer about something in their project, there is a great chance they will not remember the exact details, but know where in the code to look at. Aim to get at that level, not memorizing how everything works on the lowest level.

- Don't make any changes to the code that you don't understand. I have a recent example of this. Yesterday I was trying to find a bug in the Phoenix database, which was failing to start after an upgrade. I have never seen the code in my life. After some debugging I realized it's doing something with an empty string that shouldn't be empty. The obvious "solution" is to add an check if the string is empty and be done. Don't do that. Understand exactly why is the problem happening and only do a change like that after you are sure of all the implications. This has two effects, you are not introducing new bugs and you are learning about the codebase. At the end, the fix from my example was just a simple "if", but without understanding how is it ending up with an empty string, I might have caused more problems than I fixed.

- Use the VCS a lot when figuring our why something is done they way it's done. Use "blame" to see when things have been changed, read through the logs, etc. This is one of the main reasons why I don't like people rebasing/squashing their commits before merging. There is so much information they are throwing away this way.

- Adopt the coding style of the existing code. Don't try to push your style, either by having inconsistent style in different parts of the code or re-formatting everything. It's just not worth it.

- Don't be afraid to change things that need changing. There is nothing worst than making a copy of some module, call it v2 and then having to maintain two versions. If you are afraid to make a change in the existing code, make yourself familiar with the part of the code first.


I probably won't say anything new here. Last five years, I do the following in order to get my foot wet with new project (some projects I worked with contains more than two millions lines of code):

1. Just make sure I can build project;

2. Play around with services/application (just run, send some requests, get response);

3. Pick up simplest case (for example, some request/response);

4. Find breakpoints (for debugging) somewhere connected with this simplest case (for example, which stopped somewhere when I send request) and setup them in debugger. Usually, I find place where to put breakpoint by just searching keyword associated with my request;

5. Play around with these breakpoints while performing simplest case (for example, sending request) and try to find out call graph;

6. Try to change code and see what happens;

After I do this stuff several days/weeks, I become more and more familiar with the project.


A very simple method that helps me is to make sure I tackle a new code base on a large monitor, vertically oriented, with small font size. Add to that a pane that shows the file/class structure. Seeing more at once helps ground me in the types of interactions in the code and the code landscape.


I take it out for dinner and drinks. Spend some time getting know about it and where it comes from and what it does for a living. Then after we're a few cocktails in we get all philosophical. Really start asking the hard questions like, "Why do I even exist? Is any of this real or is it all some weird virtual world?"

We become fast friends and feel like we really understand each other.

But days pass, and each encounter feels less magical. It's almost like we having nothing in common. Like we're from two completely different worlds. One where its stuck in the past and one where I'm ambitious and excited about the future.

After awhile we don't really speak to each other anymore, and after some pretty ugly fights at work that get too personal... I rewrite it.


I've worked in a lot of legacy code bases. Here's my approach: * Skim around to get a general idea of what components are involved. * Try to understand that one module/class that keeps getting used a lot or is really important. * I mentally trace through that code, as if I'm a debugger. * Most importantly, I write down my discoveries/understanding as I go to help me retain this idea. * Re-skim with my new understanding and/or reorganize the code to be more concise or simpler. Depending on how ambitious you are, you might try to keep these changes. But with legacy code, it typically breaks as a result.

Every code base takes time to digest all the information. Sure the information passed your eyes, but is it committed to memory?


Drawings will help tremendously. Extract the big masses, their respective interfaces to each others and the means through which they communicate. This will help build a mental map of the code and reduce the cognitive load needed to understand each separate part.


If the project does not have circular dependency, it can be automatically drawn from the code like this:

http://lonnie.io/gostd/dagvis/

or this:

http://8k.lonnie.io/


What did you use to generate the grahps?


This answer is going to be rather unortodox and might get downvoted but this is how I do it:

I just skim throught all the sources, then somehow I am able to point approximate file and line of code where a specific question might be answered.

This might sound "out there" but I realized during college I had the ability to recall the approximate location of specific information I needed from a textbox If I just skimmed through the whole book at the start of the semester.

For years I did this out of intuition, then about 10 years ago I took a course named "photoreading" and to my surprise they were teaching my "ability" but with clear steps so anybody could use it effectively.


This is underrated - folks are scared to just read the whole thing. Most of our thinking isn't conscious. The sooner you have an impression of the whole code, the sooner you can start having insights. I read the whole thing, every time I start with a new code base that I'm going to be spending time with.


I personally like to reverse engineer functions within a certain codebase to better understand what is happening.

For example I would start by looking up out a basic example of that codebase and for each of the function calls go through the files and see what is happening. This gives me an idea of how the code base is written and how it works. It also gives a clear understanding of the level of separation/specificity of the different functions.

Disclaimer: not very experienced so there might be better ways to familiarising ones self with a new codebase, this is just one way of doing it and it has worked for me in the past.


I generally just start by fixing small bugs in different areas of the system. I find that debugging various areas of the system help me understand them better and allow me to start forming a cohesive picture in my mind.


The most critical step is to get the lib in your workflow , preferably with (build-introspect-debug) capabilities. This increases the upfront time to start, but leads to much quicker "code understanding in my opinion.

TL;DR; Start with the minimum exposed surface area of the project (API), dig through these functions first. Definitely know the initialization sequences the library needs.

This is my approach concerning JS projects or for dealing with other peoples code in general.

First, I make a mental model of what I want to do. !important. Then I write the smallest wrapper needed to start fledging out points where "separation-of-concern" happens.

At this point I should have an idea of what the other persons libraries expose as API. I also should have an idea of what can be done with a unmodified library, and what would need patching.

Then comes monkey-patching the lib at individual function levels with a healthy dose of TODO markers and NotImplemented Method signatures.

By this point I should have a good picture of what goes on in the library apart from what gets exposed and would probably have forked a branch by now.

This strategy has been useful not just for JS projects but bigger codebases of java/scala libraries like Lucene Core/Solr or Play framework, Django in the python realm and to limited success with Research code releases like Stanford Core NLP.


I like to use interactive debuggers like gdb (for C) or pdb (for python) for that.

You first have to localize a region (function) you want to study, then you reach one of its execution with a breakpoint, or a conditional breakpoint.

Then, you inspect:

- the callstack: in which condition the function was call

- the parameters / local variables

- the subfunctions: in both tools, you can manually call any (reachable) function, try different parameter values and check the result. Pay attention through to the side effects!


I'm currently involved in a project (with 3 others) for my MSc in Computer Science in which we aim to take Google Native Client (a browser extension for Chrome which sandboxes untrusted, native code, downloaded from the web and executed inside your browser) and use it on the server-side to sandbox an HTTP server. Since almost all documentation is catered towards developers who wish to write untrusted native code that runs inside the browser, or browser vendors (this part of the documentation is quite incomplete) who wish to include Native Client in their browser, we're pretty much stuck in the dark.

First, we read the Native Client papers (http://www.chromium.org/nativeclient/reference/research-pape...) to understand how Native Client sandboxes untrusted code. We then looked at the tests in the Native Client source repository to see how to run untrusted code within a Unix process. We're yet to be able to debug executables via GDB for reasons we don't quite understand - so at present we:

1. Set NaClVerbosity to 10 and trace the system calls and functions invoked in the tests 2. Run "grep -r" in the src folder to find the source files for each of the functions invoked then read and understand the code for each 3. Insert our own calls to NaClLog in the source code to read the state of variables and to validate our hypotheses of paths of execution within Native Client

For example, just this afternoon we found out how to send data via inter-module communication instantiated from the trusted code to the untrusted code. We first thought this wasn't possible - and that communication had to be initiated from the untrusted code, handled in the form of a callback function in the trusted code. However it simply turned out we had set the headers incorrectly in that the first four bytes of the header should be 0xd3c0de01. What's crazy is that we haven't yet understood what these bytes mean - so we're back in the Native Client source code to try and see why it works.

This probably sounds like a rant about Native Client and the Native Client developers. However, the complete opposite is true. The folks on the Native Client Discuss forum have been very helpful and have been more than happy to answer our questions. Quick shoutout to mseaborn: thank you for your help!!!


Build it, if there is something to build. Scripting languages usually don't have builds but JS minification and dependencies installation could be a build. Find and read the code paths that perform some recognizable action. Run tests, read them. Add a new feature with tests or pick an open issue and fix it. You're going to have to debug something and that will give you more insight in the inner workings of the code.


I see a lot of comments talking about code that is in a repository. And that is great, if you have it available. There have been many, many times where our team is handed an application that is broken (or has a bug) and asked to fix it. In many, many of those cases we don't have access to the original repository, or there wasn't one.

We generally approach it with heavy customer/owner involvement at first. We need to know what the application's intended purpose is. It is sort of like a lightening BA session. We get what the application should do, and what it isn't doing properly out of this session (and more importantly, what it should be doing instead).

Our first step: get it into a repo.

Now that we have an understanding of what the application's intended purpose is, we can dive into the code. We don't have any analysis tools (but if there are some that people could recommend, I'm all ears) outside of our IDE (Visual Studio). We generally look for the last-modified date as an indicator of what needed work most recently. Of course, we don't have file history so we don't know exactly what changed, but it gives us a rough idea of what was worked on and when.

Next we usually try and use the application in our development environment. We chase each action a user takes in the code to determine what is the core/central part of the application. After that, we try to determine the cause of the problem (and while we are at it, we generally do a security review of the code).

It takes time, and is painstakingly nuanced and very boring. But I'm not sure what other options we have in such cases. As I said, I'm all ears as to what other might do in these situations.


The first thing I do is try to get a handle on the libraries it pulls in (maybe spend a day just going through the high-level readme material for each one). That will usually tell me where to start looking for the entry points where I might want to start modifying things. After that, I give myself a series of small functionality changes to implement, kind of like capturing a bunch of little flags. After doing that for a bit I usually have a decent idea of how things work, and it's easier to go forward, at which point I can dig into the relevant parts of the codebase with more confidence.

The first few mods are inevitably disgusting hacks, so don't pick anything you want to keep for your first couple of goals. It is pretty easy to go back and do them right once you've got your head around the rest of the project if you do end up wanting to keep them though.

I've used this method on some decently large C++ and javascript projects (around 100k-200k lines) and it works pretty well for me. I don't learn very well by just reading the code, but doing the little mods seems to make it stick.


If it's not obvious just by looking at how the directories are structured, and files are named, generally I find that everything is (or should be) relatively easy to understand if you start from the perspective of a user.

1) Read docs for how to USE the library if they exist

2) Review example code that describe how a person would use the library to accomplish tasks.

3) In order to start diving in, find a specific example that does something interesting, then hop in from there. Read the code within the methods / functions the user calls, then the functions / methods called inside those, etc.

4) As you dig deeper you may start finding that you understand, or you'll start building up your own hypotheses like "If I change X to Y in this function then something different should happen when I call it". Try it out, and see if your hypothesis is correct.

After a few iterations of doing something like this you'll probably start getting an idea of how the code is structured and where you'd need to go in order to make the changes you'd like to make, or add the features you want to add.


I pick a function or an outcome and type out the pseudocode stack traces leading to that in notepad.

I include function names and the names of the variables passed as parameters. But, no braces or other syntax. Almost always omit branches/variable decls/error checking. Include all interesting function calls along the path, but omit any branches/function bodies that lead off the desired path. Inline callbacks as function calls with addition notation. If the process has separate steps that aren't a single call/callback tree, start a new tree with the note "then later..."

To do this, I have to start from the line of code that enacts the outcome and determine the backtrace with a combo of debugger stack traces and examining the code for branches/callbacks of interest.

But, when it's complete, I'll have the start-to-finish process of some complicated task in the code --usually on a single screen of text. It's a tremendously better use of my short term memory to scan over that than to constantly bounce around the actual code base.


When I have a new code base that I'm unfamiliar with and need to understand it quickly, I'll go line-by-line and add comments about what I believe to be the intended behavior. As I gain more knowledge I'll update the comments. For me explaining something I've learned helps me commit it to memory better, and makes sure I really did comprehended what I just read.


For a large c/C++ code base, I use an editor called SourceInsight. This is the most invaluable tool for navigating code I've come across in my 3 year career as a software developer. I work in a very large software company, and there are several code bases running into millions of lines of C/C++ code. My previous team had 60,000+ files, with the largest file being about 12k loc.

If you have access to logs from a production service / component, I find TextAnalyzer.net quite invaluable. I take an example 500 mb log dump - opened in TextAnalyzer.net and just scroll through the logs (often jumping, following code paths etc), while keeping the source code side by side. This allows me to understand the execution flow, and is typically faster than attaching a debugger. If it's a multi-threaded program, the debugger is hard to work with - and logs are your best friend. You are lucky if the log has thread information (like threadId etc)


I love wrapping my brain around large codebases in my spare time. I wrote an application for help me download source code repositories in git, svn and mercurial and keep them in sync:

http://vcspull.readthedocs.org/en/latest/

I keep the applications I want to study in a YAML file (https://github.com/tony/.dot-config/blob/master/.vcspull.yam...) and type "vcspull" to get the latest changes.

You can read my ~/.vcspull.yaml to see some of the projects I look over by programming language. You can setup your config anyway you want (perhaps you wanted to study programming languages implementations, so have ~/work/langs and cpython, lua, ruby, etc. inside it.


I don't do code reading or comprehension study. Reading code is boring. I typically create a list of small tasks that I want to achieve with the project. If the task is big, I break it down into smaller tasks. Then I rank the tasks from easy to hard. This way, I can start learning about the codebase and achieve my tasks.

In your case, frozen columns seems to be a hard feature. So I would start with ajax data source. I'd start with a simple SlickGrid example and get it to run. Then go find how SlickGrid sets up data source. Expand that piece of code to add ajax data source. Once I finished ajax data source, I'd dig into frozen columns.

If you are working on a new codebase and worry about bugs, you just give yourself more stress. Bugs (that are not yours) are expected. If they aren't blocking your task, ignore them. Most likely, they aren't relevant to what you are trying to do.


Document the codebase, in my experience it helps.

In case of JavaScript you’d probably use something like JSDoc. Describe your units and make the tool automatically create beautiful HTML out of that. You don’t have to document everything at once but be sure to lay the groundwork, automate documentation build process, and in general try to make maintaining the docs effortless (for yourself and for others). Take some existing well-documented JavaScript codebase as an example.

This’d make a great contribution already: SlickGrid’s codebase is somewhat poorly documented, which is a barrier to the involvement of interested developers.

As you write the docs weak spots in existing implementation will come to your attention, helping you figure out what to fix first.

One downside is that writing down and structuring your knowledge in easy for others to grasp way is a challenge in itself, though arguably a useful exercise.


It depends on the language, the libraries, the tooling, etc.

My dayjob is with a Ruby on Rails consultancy. Said dayjob involves familiarizing myself with a lot of different codebases. My strategy here is rarely to try and digest the whole codebase all at once, but rather to focus on the portions of code specific to my task, mapping out which models, controllers, views, helpers, config files, etc. I need to manipulate in order to achieve my goal.

The above strategy tends to be my preference for most complex projects. The less I have to juggle in my brain to do something, the better. I tend towards compartmentalizing my more complex programs as a result. For simpler programs (and portions of compartmentalized complex programs), I just start at the entry point and go from there.

Languages with a REPL or some equivalent are really nice for me, especially if they support hot-reloading of code without throwing out too much state. Firing up a Rails console, for example, is generally my first step when it comes to really understanding the functionality of some Rails app. For non-interactive languages, this typically means having to resort to a debugger or writing some toy miniprogram that pulls in the code I'm trying to grok and pokes it with function calls.

For some non-interactive languages, like C or Ada, I'll start by looking at declaration files (.h for C and friends; .ads for Ada) to get a sense of what sorts of things are being publicly exposed, then find their definitions in body files (.c/.cpp/etc. for C and friends; .adb for Ada) and map things out from there. Proper separation of specification from implementation is a godsend for understanding a large codebase quickly.

For a rigorously-tested codebase, I'll often look at the test suite, too, for good measure. When done right, a test suite can provide benefits similar to specification files as described above; giving me some idea of what the code is supposed to do and where the entry points are.


I've written scripts to read files and match function calls to their definition/body and output text "trees"; but the process deserves some better visualization, navigation of dependency graph/comprehension specific highlighting. I'd be interested in trying an IDE that can do this.


First, have in your mind what the function of the chunk of code is. If it's not important to the system, skip it, don't read it. If it is important to the system, take a guess how you think it should work, how you would probably implement it if you were the original develop. Then begin reading it.


At least to me there is no specific method, I work mainly with Java, since your specific case is JavaScript it may not even apply.

If the problem is some bug and there are stack traces that is my starting point, debugger and a few breakpoints chosen from the trace and then follow the stack and from there I start knowing how it is structured, and then the next bug and so on (fixing them of course) For code where I need to add features things get a little more tricky, but there is always some entry point, a web-service invocation, some web page, and try to understand what it is currently doing, again using the debugger to follow the calls and how the data is changed (sometimes even going into libraries).

Reading the docs if there are any is also a good place to start.

Once again, use the debugger a lot, makes it easier to understand than just reading the code.

(edit: formatting)


I try to seek out the data structures first. If I need help doing it, I either run a profiler or insert some debug prints to get an idea of what parts of the callstack are "hot" and then progress from that to discovering the data. (Languages that don't require type signatures everywhere often have this problem of hidden structures.)

Once I know what the data is I can look at the code with an eye towards maintenance of data integrity. I might still need some "playtime" to grok the system but the one truism of large software is that data is always getting shoved from one big complicated system to another, and I can usually identify boundaries on those systems to narrow the search space.

(the exception to this is if you have code that leaks global state across the boundaries. Then much swearing will occur.)


I enjoyed this presentation by Allison Kaptur on how to understand CPython better:

http://pyvideo.org/video/3465/exploring-is-never-boring-unde...

While it is focused on CPython, most of the techniques are applicable elsewhere. It also mentions a great article by Peter Seibel (http://www.gigamonkeys.com/code-reading/) that discusses why we don't often read code in the same way we would literature.

Essentially, as the complexity of software has grown people have been forced to take a more experimental approach to understand software even though it was created by other people.


Try to fix/change/adjust something in the front-end and look back from there... Although this can be frustrating depending on the codebase, but your best bet in learning something new is to try to do something... even something small. If you want to go the extra mile, add comments to stuff that doesn't make sense as you go, and tag things for refactoring with Todo's and corresponding tickets.

Going a step farther still would be to add to the user documentation as you go...

Do something small, and iterative, and go out from there... for that matter, just getting a proper build environment is hard enough for some projects... automate getting the environment setup if it's complex. I've seen applications with 60+ step processes for getting all the corresponding pieces setup.


The first thing I try to do is to understand the directory structure, ie where should I be looking for files? Hopefully there should be a standard structure that's used. After that I'll typically try to dig in and fix a minor bug or two. This is especially helpful if you can narrow down the part of the codebase you're working on. I also recommend using an IDE like WebStorm which will give you the ability to jump to a function definition and will help you find the functions you're calling.

One thing I do NOT recommend is changing the code style, unless you're ready to take full ownership of the project. It can make it much harder for the project owner to merge in and if there are any lingering PRs those will typically need work to merge in properly.


I use debuggers a lot for that purpose. It really helps to find the code paths for specific operations. Instead of reading code file by file, just setup a debugger, set a few breakpoints to the code, perform an operation and follow the read application code through through the paths.


A proper IDE can go a long way towards understanding a large codebase. It will be able to index everything so you can really quickly jump around the project–being able to jump directly from a method call to its declaration without momentarily context switching to search for where it lives is very valuable.

As you start to add to a project the IDE can also prove valuable in discovering how everything fits together, since it will provide smart and helpful completions with docstrings, method signatures, types etc. This can really help you start writing new code a lot faster.

Also, an IDE will usually also have a decent UI for running the code with a debugger attached, which can be incredibly useful for understanding the changing state of a running program.


As someone who hates debuggers and is a fan of "learning by doing", I make heavy use of console.log() or similar, and I start putting breakpoints all over the code that print out sentinels ("hey, I'm in this part of the code") and data ("the content of this variable are: XXXX").

Then I run the app and put it through its paces, while watching the output in another console.

If there's some code that doesn't make sense, I use console.log() heavier in that section, to help me fully understand what it does. Once I have that level of understanding, I then write some comments in the code and commit them so that other contributors may benefit in the future.


This codebase is documented and well structured. I would simply being by tackling the issues on github first and sending pull requests. No need to take over it right away. After you feel comfortable reading the code and knowing where is what, you can ask to become a maintainer.

I'd try to fix it using the same style used in the codebase. This way anybody else reading, maintaining ,or using it won't have to make sense of the new style. Pay attention to how each method is defined. They are very readable. Very few traces of complex one line statements.

Most importantly, be patient. You won't be any good with it in less than 2 weeks of constant tinkering. Good luck.


This is not a comprehensive answer, but it's additive.

If you're looking at a large Go codebase with many packages, I find it helpful to visualize their import graph with a little command [0].

Here are results of running it on consul codebase:

    $ goimportgraph github.com/hashicorp/consul/...
http://virtivia.com:27080/cehy9dnqaq92.html

[0] https://github.com/shurcooL/cmd/tree/master/goimportgraph


I use CodeMap[1] which is a kind of google maps but where the countries are the code, and CodeGraph[2] which helps to understand code dependencies at different granularities (package, module, files, functions). [1] https://github.com/facebook/pfff/wiki/CodeMap [2] https://github.com/facebook/pfff/wiki/CodeGraph

disclaimer: I am the author of those tools.


I do divide-and-conquer. Find some part or feature of the tool you know from an outsider perspective, and then try to find it within the code. Then work backwards from there. Maybe even try to fiddle with it to change how it works, and see what happens.

I think reading each file or reading the data structures is more difficult because you have no familiarity as to what is going on and you have no knowledge of why things are structured as they are, so it'd end up like reading a math paper straight down: memorize a ton of definitions without knowing why, until you finally get to the gist of it.


I first try to familiarize myself with the high level design/org of the code base, going through the README, other docs, looking at the test code if any and just generally scanning the important files/modules etc.

Then I prefer to jump into fixing any existing issue. Working on fixing an issue teaches a lot, more fixes, then features, rinse, lather, repeat.

While this post talks about fixing compiler bugs, the overall steps are much replicable: http://random-state.net/log/3522555395.html


I like to try two impractical tasks (impractical in the sense that they might not be possible, which is fine).

1. Access some data in the highest level component from one of the lowest level components

2. Access some data in one of the lowest level components from one of the highest level components

In a lot of cases, good architecture will prevent one or both of these from being possible, but identifying how data flows through the app seems to be a good way to understand the general architecture, limitations and strengths of most apps. These two tasks give concrete starting points for tracing the data flow.


I usually skim the code to get an idea of patterns and organization, get it working in a local environment and then run/step-through the code. This usually gives a good idea of what different pieces do.


I've had to ramp up quickly on a number of projects so far during my career, and I can tell you there's no substitute for simply reading the heck out of the code. Yes it takes discipline to go through code line-by-line, and at times may seem pointless or like its "not sticking". But persistence here pays dividends.

The first read-through is not about comprehending everything. It's about exposing your mind to the codebase and getting it to start sinking into your subconscious. It's kinda like learning a new piece on the piano.


First build and run. See what it does. Check what it does and what I think it does, see how they differ.

Start from main() and start from the one click event(or any end-game action). Try to connect the two.


I try to compose a formal model and algebra of the codebase - quite informally, mind you. Takes a bit of pen and paper and a few caffeinated drinks usually.

People really do learn quite differently and everyone needs to find their mode of learning - there is no one single true way. This is one of the most important skills in software development, IMO. Once you learn how you learn you can apply it to most new contexts.

I write stuff down because for me that-the process of writing seems to be the most effective way to learn.


While this generally works best for larger code bases, I tend to start reading through open bugs/tickets and find things that appear easy. Then I will assign them to myself and do what I can to fix it or at least track it down.

Generally I find it hard to just start reading through packages, source, functions, etc. and find it much easier to try and solve some sort of problem. By tracking and debugging a particular issue through to the end, I find a learn a lot about the codebase.


git grep.

I search for strings that appear in the frontend (or generated HTML source, or whatever), and then I use a search tool (git grep) to find where it comes from. And then I the same search tool again to trace my way backwards from there to where it's called, until I find the code that interests me.

And then I form a hypothesis how it works, and test it by patching the code in a small way, and observe the result.

Oh, and don't forget 'git grep'. Or ack, or ag, or your IDE's search feature.


I find Sublime's Cmd-Alt-↓ (goto definition) to be very useful since you jump straight to the source code for that function/class/method. When you grep you may also get all the usage instances which can be quite a lot of noise.


Other than the many great answers here I will frequently start by doing cleanups of the codebase.

I'll start reading the files using any of the strategies mentioned here and looking for things I can cleanup. Formatting, Simple Refactors, Normalizing Names.

These are all things that are comparatively easy to do and safe but force you to reason about the code you are reading. Asking yourself what you can refactor or fix the naming for is a deent forcing function for actually understanding the code.


I found call tracers to be the most efficient way to do this kind of thing. It could be as simple as a perl script inserting printfs on every call and every return, since not every compiler supports instrumentation.

Simply digging through code, tests or reading commit messages in an unfamiliar code base takes at least an order of magnitude more time.

EDIT: tried call graphs too, better than reading through code, but still require you to understand and filter out a lot of unnecessary information.


I have recently jumped onto working for a very huge codebase at work. In general here are a few tricks that helped me. 1) Look at the unit tests and see the flow of the code 2) Try to make a mental picture of how the code is organized(doing it on paper is more helpful) 3) Every codebase has few core classes that do lot of heavy lifting, talk to other contributors and ask them to point you to these.2) also helps you achieve this. Good luck.


I only recently developed this skill a little.

The Ruby application server I looked at was for doing social network feeds. Posts/Likes/Comments go in, feeds come out.

I followed some common code paths for things such as posting a comment and getting a feed. I would write the stack trace down on paper as I went.

It also helped that I happen to know that this ruby server used wisper and sidekiq. This way I didn't overlook single lines of code such as 'publish: yada yada'


On that note, could someone recommend a tool for automatically generating the graph that shows the class dependencies/hierarchies in a Java code base? I'm sure there are good tools out there, but all the ones I tried so far (JArchitect, CodePro Analytix, SonarQube) don't seem to have a good graph layout engine.

I'd like to print out a big graph and stick it to the office walls so I'll have a good view of the logical structure.


If it's code that I need to understand in intimate detail, I actually trace through the code keeping notes with pen and paper. I complement a simple reading of the code with actually exercising the code with test data and a debugger. I go through a few iterations, each time learning a little more about what is important and what can be safely ignored, until I eventually build up a Gliffy diagram of the important parts.


If it's in Haskell, I start cleaning up and refactoring datatypes.

Like changing some function like:

   Text -> Text -> IO ()
into:

   ServerHost -> Path -> IO ()
Changing the types will naturally lead you through the codebase and help you learn how everything fits together via the type errors.

In any language I'll try to read the project like the Tractatus.

In stuff that isn't Haskell? Break stuff and run the tests.


When you find interesting pieces of code, look at the commit that brought it to life. Commits contain precious gems of information: you'll understand what files are related, who worked on which parts of the codebase, how the commit was tested, related discussions, etc.

Some people use graphical tools to visualize a codebase (e.g. codegraph). It can help you understand what pieces of code are related to each other.


This is one of the reasons I've always thought that each project should have a minimal developer documentation that should include the project's scope, how it's structured, what are its main components and how they are connected etc. This would help a lot a future developer to faster start working on the actual project and reduce the initial time spent on figuring what is all about.


When adding support for small new features or fixing bugs on large codebases the answer is: you don't [1].

You do not need to familiarize yourself with the full codebase at the start. It's too time-consuming and mostly not worth the effort. Set up an objective and go for it slashing your coding axe around until it works.

[1]: Unless you have an special interest or you are assumed to familiarize with the codebase.


I'd speak to the last person who worked on it face to face with a whiteboard and a marker handy. Get a brain dump ASAP. Even if the person no longer works there, you can take some time to contact them for a lunch. Most people would not say no to this type of request (especially if you're buying). Just make sure you have questions ready so you don't waste their time.


One idea is to use Linux's `perf` to sample stack traces, as the program is running, over a minute or so and see where the code flows.


IMO, the one tool you can't do without is grep.

My typical strategy is to get the project running, then just get to work. Start fixing bugs, and adding requested features. Use the code around you as a guide on what is right and wrong within that company, and forge forward. When you are unsure of something turn to grep, find some examples, and keep going.


I try to work backwards from the public api to get a sense of the operations that are supported by the system. A trick I picked up from a thoughtbot training video a couple years ago for Rails applications is to look at the routes file. If you work with webapps, the routes generally define the things that people can do.


This routes-trick is my starting point as well on web apps.

The next place I try to understand is the persistence layer - be it the a database scheme or models working against remote APIs. Building a mental model of the the data (around which the app surrounds) serves as a map for the rest of the code tour.


I gave a talk on this subject at At the Frontend conference in Denmark recently. Take a look: https://vimeo.com/129469530 It goes over general techniques and then drills down into the React JS code base.


Related question of programmers.se: http://programmers.stackexchange.com/q/6395/436

> What tools and techniques do you use for exploring and learning an unknown code base?


Assuming there is some form of bug list associated with it that is often my preferred way to learn a new code base.

Try to fix a bug and you'll soon find yourself having to learn how the code involved works, and with a goal your focus will be better than just reading through the code flow.


I use source navigator to understand the code base. I wish someone will keep improving it, especially the font etc under linux is not looking impressive, under Windows it's all I need. I'm unsure if other tools can provide as many functions for code base analysis.


One thing helps me enormously: I sketch a class diagram as I explore the code. Here's an example:

https://s1.whiteboardfox.com/s/494b923d01d7ad05.png


Brute force: Choose a new feature to implement and start looking for the place to write your first line of code.

This is probably not the best way to approach this, but I am somehow ADHDish and I need a clear task to avoid perpetual diving in the codebase.


Write characterization tests for modules, see what inputs produce which outputs. Then you have the start of unit tests.

Programming with unit tests really helps. And it points out where certain parts are too entangled and bound to implementation.


"You cannot understand a system until you try to change it." ~ Kurt Lewin


Back when I did Java, using static analysis tools like findbugs, then going and fixing all the issues found was a good way to get coverage of the codebase... I'm sure for JS there must be similar analysis tools.


Read it until I can identify which fad of the moment the author was following.


This is a particularly insightful (if snarky) comment.

The project I just had to refactor had a DSL that was completely unnecessary and had a ton of business logic tangled up within the domain language itself. I ended up being able to remove the DSL and parser completely in lieu of using simple config files and extracting the business logic into middleware.

The DSL was probably originally created due to some Slideshare presentation that had just hit the top of HackerNews 3 years ago. The original devs molded the problem to fit the ability to use a DSL and cool parsing library rather than figure out what the most suitable design for the problem was.


I don't.

Take the extreme programming approach. Don't try to familiarize yourself with a new codebase all at once. Start small. Work on a small ticket. It will, organically, help you assimilate what's happening.


If there is a bug list handy, I find tackling a few small ones is often an excellent way to get to know a codebase. It also gives some good insight into the codebase's quirks and oddities.


Start with smaller bugs & try to fix them. Bugs help you to focus your understanding on very small parts of code/paths. This helps in time spent vs output vs confidence.


Best thing by far is to find someone familiar with the code and spend 15-30 minutes with them in person or by phone. That should be possible in the vast majority of situations.


Try doing some profiling. It'll take you through some of the more heavily used parts of the code, is useful in and of itself, and provides a target / some focus.


Good resource for Code Spelunking: http://www.codespelunking.com/


Pick a class and new it up from a unit test. You will quickly find out where the dependencies are, and how tightly coupled things are.


The first thing I do is turn on db and http request logging. Sometimes this alone can be quite a challenge.


I always read the tests first. (If there are no tests, I don't take the job. Life is too short.)


Pick couple of bugs and fix them. Best way to familiarize yourself with a new codebase.


Answer: with great difficulty.


Break it one line at a time.


sourcegraph.com can help.


looks useful , thanks


read tests, and then start writing tests for things.

something usually comes up.


Build and run the project locally

Then I write unittests


Fix a bug. Repeat.


I Write unittests


grep -r "function()" .


I've spent the last year rebuilding a huge business-critical system from scratch (along with one other engineer). Yes, usually complete rewrites are a Bad Idea®, but in this case product and business decided it was the only way to move forward because the system was in maintenance hell and it was way too difficult and risky to add new features. I discovered why as I learned the architecture, business logic and features of this behemoth pile of spaghetti. Here's what I recommend to do if you're in a similar situation, whether it be a large and great project or a large and horrible project...

- Get a functional dev environment set up where you can mess around with things in a risk-free manner. This includes setting up any dev databases and other external dependencies so that you can add, update and delete data at will. There's nothing that gives more insight than changing a piece of code and seeing what it breaks or alters. Change a lot of things, one at a time.

- Dive deep. This is time consuming, but don't be satisfied with understanding a surface feature only. You must recursively learn the functions, modules and architecture those surface features are using as well until you get to the bottom of the stack. Once you know where the bottom is you know what everything else is based on. This knowledge will help you uncover tricky bugs later if you truly grok what's going on. It will also give you insight as to the complexity of the project (and whether it's inherent to the problem or unnecessary). This can take a lot of time, but it pays off the most.

- Read and run the tests (if any). The tests are (usually) a very clear and simple insight into otherwise complex functionality. This method should do this, this class should do that, we need to mock this other external dependency, etc.

- Read the documentation and comments (if any). This can really help you understand the how's and why's depending on the conscientiousness of the prior engineers.

- If there's something that you really can't untangle, contact the source. Tell him what you're attempting, what you tried, exactly why and how it's not working as you expect, and ask if there's a simple resolution (I don't want to waste your time if there's not). You may not get an answer, but if you've done a lot of digging already and communicate the issue clearly you might get a "Oh yeah, there's a bug with XYZ due to the interaction with the ABC library. I haven't had time to fix it but the problem is in the foo/bar file." You may be able to find a workaround or fix the bug yourself.

- When you do become comfortable enough to add features or fix issues, put forward the effort to find the right place in the code to do this. If you think it requires refactoring other things first, do this in as atomic a manner as possible and consult first with other contributors.

- Pick a simple task to attack first, even if it's imaginary. Get to more complicated stuff after you've done some legwork already.

There are other minor things but this is generally my approach.


Overwrite functions in dynamic languages (like JavaScript) with some "dump all arguments code" and call/return the original function, to get a quick glimpse in the code. Though this doesn't work with closures without some extra eval tricks.




Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: