
Ask HN: I have to analyze 100M lines of Java – where do I start? - user1241320
As part of a huge &quot;let&#x27;s see what&#x27;s going on here and re-build this from scratch&quot; they dumped the whole code repository on me and my team.<p>We&#x27;ve started parsing it and tried to work on extracting abstract syntax trees and all that.<p>Any idea would help us a great deal.<p>Thanks.
======
sergiosgc
For rewrite from scratch projects, I always start by identifying the use cases
covered by the application. You don't need the code for that. Just run the
application and identify what it is that it does. Then, work backwards. For
each use case, use the existing code as specification of the use case
behavior.

At 100 million lines, I'd suspect this is either an extremely large project,
where a rewrite from scratch is inadvisable, or that there is a code generator
at work. If it is the latter, you want to analyze the code generating source,
not the end result.

Anyhow, generically, for a first contact with a new code base, code coverage
tools are a good start, as is a call graph debug run of the project. It'll let
you spot dead code as well as hot code (code being called at every run of the
application). It'll highlight the important and non-important code parts,
allowing you to read less code and get a grasp on the architecture.

~~~
fleitz
Yup, start the profiler up, use the application, check the call graph.

At 100 MLOC the code base is probably a complete mess.

Also, simian is probably your friend as it will identify large chunks of
duplicate code.

As well source control can be your friend, the older the source is the more
likely it is to contain useful code. The files with the most changes will
usually be where the bugs are.

~~~
jacquesm
Why did this get downvoted? It's actionable and on topic.

------
goshx
10 years ago I worked on a large project to re-write a code base written in C.
Our approach was to forget about the code and document everything it did from
the user's perspective. Once everything was mapped out we decided on what we
were going to keep, modify or remove, and then started building everything
from scratch. You can always go back to the original code to see how a
particular feature was implemented and perhaps re-use the same logic.

~~~
burnte
Having done projects like this before as well, this is the best method.
Knowing WHAT needs to be done today and tomorrow is far more important that
knowing HOW it was done before. The how is only important once you know what
you need to do.

~~~
k__
lol, so true...

Funny thing is, often the users tell you "I used it for XYZ" and it was...

...never been written for XYZ

...never DID XYZ, all the results/numbers were trash, but no one noticed

------
michaelvkpdx
With a codebase like that, it's better to look at it through the users' eyes,
rather than trying to reverse engineer the business from the code. Things that
look like bugs in the code may actually be features for the users, or may have
been absorbed so long ago that they've fundamentally changed the nature of the
business.

You don't need to understand the whole codebase. It will take years. Best to
focus on what the users need and analyze small chunks. If it's truly 100M
lines, there's not going to be any semblance of consistency in the code.

You can also slap New Relic on it and you may be amazed at what you learn,
right away.

Don't waste too much time trying to understand all the code. Focus on a couple
issues first, make some hypothesis, and then see how well your understanding
of the code fits the bigger picture. Refactor and repeat.

~~~
user1241320
Any human line-by-line/application-by-application analysis is (for this
particular discussion) out of the scope.

The size of the thing and the way we thought we were going to work is quite
different.

For instance, suppose we produce AST for all the routines/pieces of
logic/you_name_it we wanted to then find similar patterns or clusters that
would give us hint to then work on a "pareto-like" way.

As already stated it's not ONE project, it's an old (but still running),
poorly-documented codebase produce in decades around this big firm we work
for.

~~~
jerven
Don't try to figure out how the code does what does yet. Figure out what
systems exists inside it:

    
    
      1.  What kind of modules?
      2.  Which servers/hardware?
      3.  Which databases/datastores?
      4.  What systems talk to what?
      5.  What test systems exist or existed?
      6.  Which api/frameworks where used?
      7.  Who is currently working on them/maintaining it?
      8.  Is anyone left who used to?
      9.  Why is a rewrite on the table?
      10. Is there any way you can work on smaller pieces at a time?
      11. What are the pain points of the current users (will tell you what area to focus on)?
      12. Can you document what comes in and out?
    

In my experience with such large code bases, there is never one way to do
things. i.e. I once worked on a smaller system with 4 ways to talk to the same
database. On one with 100 million lines I would expect even more ways to rome
;)

If you do want to go down the static analysis path, start with existing tools
before trying to build your own. If needed get external help for this.

A 100 Million lines of code is not so bizarre. The project I work on is
currently about 300,000 lines and a project some 300 times larger is quite
imaginable for me.

~~~
jacquesm
Code complexity does not increase linearly.

100 M lines is stupendous.

~~~
jerven
Not really, my project is 3 FTE for 8 years. Double it to 16 years and then
multiple the number of developers by 100.

Consider a large enterprise having 300 developers in multiple teams, I am not
at all surprised that they can manage to write 100 million lines or so. Also I
think this is really a system of systems, and in my experience probably has
large parts developed by the lowest bidding firms. Which when software is
developed for 10 years or more means more than one way to do the same thing.

Also having to deal with lots off ancient systems and working around weird
bugs probably fixed years ago. You know things like bugs in Java 1.2 on HP-UX
and stuff like that, or errors in Oracle 7i etc...

Plus functional duplication because team A did not know subteam C2 build the
same thing...

Editing my comment instead of replying to the excellent comment by @jacquesm
as hn does not allow me to reply to the reply.

Actually completing the transfer of a codebase like that is unlikely to a new
team without much much more of a handover. But some high/middle frustrated
with the current system manager asking a team to start rebuilding before it
gets shut down a few months/years later is very possible. An other plausible
option is a corporate take over... But then I would expect a very experienced
team to work on it who do not need to ask HN for this kind of thing.

I personally have been in a situation where code moved between companies and
no documentation or old developers where available. Not as large as this only
a 1 or 2 million lines of code/xml. But I am no longer surprised by the stupid
acts that large corporations can perform.

And this must be systems not just one. You can't build a single jar file of
50+ millions of lines of code than could have loaded in a JVM around 1.3.1 on
even high end hardware for the time.

~~~
jacquesm
How realistic does it sound to you that a codebase that size would be
transferred to a new team without any of the old team, outside of a hack of a
bank or a reversal of some outsourcing decision or something like that?

Typically the value of such a codebase is determined by the quality of the
team maintaining it and the degree to which it is documented.

Complexity of software constructs is not linear and enough books have been
written about simply multiplying and dividing manyears and lines of code that
I don't think we need to hash that out all over again. See 'the mythical man-
month' and many similar books and articles.

------
logn
Source to UML: [http://www.architexa.com/](http://www.architexa.com/)

Getting call paths: [https://github.com/gousiosg/java-
callgraph](https://github.com/gousiosg/java-callgraph)

Line coverage from instrumented jars:
[http://emma.sourceforge.net/](http://emma.sourceforge.net/)

For this type of request, I'd push back and say, let's identify very small
parts of this and begin rewriting those one at a time in an isolated project.
Kind of an agile rewrite that will combine the legacy project with the slowly
rewritten one. Use the tools to identify parts of the project than can be
isolated. Build new interfaces or services to let the old project communicate
with the new one. Get a history of the source repository to see where recent
edits are and prioritize those to be rewritten first (presuming they want a
rewrite to lower maintenance costs).

~~~
ramon
nice links! :)

------
bjackman
You haven't really described your goals: What do you want to extract from your
analysis? Metrics to tell you what's "wrong" with the existing code base? Some
sort of model of the system's semantics?

~~~
user1241320
We'd love to know what these lines do. For example what part of this codebase
deals with the DB and what part does not. And then go deeper.

The final goal is to re-do what these lines do :(

~~~
grey-area
100, 000, 000 lines of code is a huge amount and would take you over 1000 days
just to read at 1 line per second, and 1000 man-years to fully understand. If
your final goal is to rewrite all of it you are probably doomed to fail. You
should first ask yourself (and your clients) some simple questions about why
this insane project has been dumped on you and what the goal is:

What is the order of priority of services - which services/apps are critical,
and which are not very important?

Which services actually need to be rewritten and which are working just fine?

Which services have a clearly defined interface and can be rewritten?

Which tests are in place to test the existing services, and which will you
have to write?

I wouldn't touch the code till you have answered those questions, and once you
have those answers, having some sort of overview of code coverage etc is going
to seem less important, because it will become obvious which bits need to be
touched first (the ones that are both mission critical and broken), and which
bits you can easily isolate.

You will find it very very hard to show concrete progress if you try to change
all of this code at once, in a global way (for example by tidying up every
single reference to a db to use a new db interface, or things like that). If
you do, you'll never reach your final goal, and end up spending months tidying
up without actually delivering value to the business.

~~~
johan_larson
Yes, a "rewrite" is the wrong way to think about this project. A typical
programmer might be able to produce 10K lines of production code in a year,
which means it would take 10,000 engineer-years to rewrite it all.

The OP needs to think in terms of stewardship, not complete reconstruction,
and make improvements by small steps as a gradual process.

I talked about a similar problem here: [http://short-
sharp.blogspot.ca/2012/08/fixing-broken-codebas...](http://short-
sharp.blogspot.ca/2012/08/fixing-broken-codebase-part-i.html)

------
rashthedude
What type of application has 100 million loc's? Windows 7 has 40 million lines
of codes so I'm wondering why type of application/software it is.

~~~
gerhardi
I have a quite strong feeling that certain complex automated systems within
financial services / insurances domain could reach those LOC levels. Including
all the frontend side, internal backend logic, possible web services, internal
tools, tests, tens to hundreds of interfaces to different kinds of external
services, report generation, libraries, etc.

~~~
user1241320
BINGO!

~~~
jacquesm
BINGO all you want but if you're at liberty to disclose such things (or to
confirm them) you should have included it in your original write up.

~~~
thound
Sure, 'cause if he/she had, you would already have found a solution! No need
to be rude nor arrogant, even more if pointlessly so.

~~~
jacquesm
No, it's just that such information makes a huge difference to those that try
to make sense of the question, especially a poorly defined and somewhat
confusing question as posed here.

~~~
kjs3
How exactly does knowing it's an insurance/finance application _magically_
change the answer to "how to get my head around 100m lines of Java code"? Is
there some "financial java code analysis" tool or technique that's so
completely distinct from "engineering java code analysis" or "healthcare java
code analysis" that it deserves a snotty, condescending retort?

~~~
jacquesm
An 'Ask HN' with this level of input is not a guessing game where the people
that try to help have to play 'bingo' with the asker for information that may
or may not be important for the answer. If there is context available and the
asker is free to talk about that context then it should be supplied up-front.

There is nothing snotty or condescending about that, but there is something
very weird about this whole thread, I wished I could put my finger on it.

------
cyrillevincey
Before rebuilding anything piece of software from scratch, I would give a
serious look at this amazing bunch of wisdom:
[http://www.joelonsoftware.com/articles/fog0000000069.html](http://www.joelonsoftware.com/articles/fog0000000069.html)

~~~
jebblue
Great article, it hits at the core point. There's an old saying, don't throw
the baby out with the bathwater.

------
EtienneK
1) Focus on the functional use-cases and not code.

2) Identify integration points to other systems and ask why they are there

3) Realize that a "big-bang" rebuild never works and that it's better to break
up the system into smaller pieces and replace them piece by piece.

------
mml
Funny, I have 50,000 individual Java apps to analyze. I started with a
copy/paste detector. Pmd has a free one. Good luck!

~~~
tcopeland
Wow, a blast from my distant past... a link to the copy/paste detector:

[http://pmd.sourceforge.net/pmd-5.1.3/cpd-
usage.html](http://pmd.sourceforge.net/pmd-5.1.3/cpd-usage.html)

------
sp332
As a first pass, try deleting as much code as possible :) If there are files
or whole projects that aren't needed anymore, they're just slowing down your
analysis. Also some dead-code analysis could be helpful, at least in broad
strokes. You could instrument the code with a test coverage tool, then run the
code instead of the tests to see what code gets reached.

Edit: You could also look for duplicated code, and quickly refactor that to
just be in one place.

------
radicalbyte
Do you and your team have experience of Java development?

Your question sounds like something that someone with either no real
experience and/or no experience of an object-oriented language would ask.

100 million lines is a lot of code. Why do you need to "parse it to extra the
AST"? That's crazy.

Do you have the original design documents and architectural documentation? If
you do, read it.

~~~
klibertp
Downvoted.

Even if the design docs would exist, it would take months to read them,
without any guarantee that they correspond to the reality.

Meanwhile, automated analysis of actual code can give you at least high-level
overview of the codebase and maybe a hint where to start digging. Getting AST
is a first step required for most automated tools to do their work.

EDIT: I acted too rashly and downvoted your post before I realized what are we
really talking about. Sorry about this. I still am convinced that automatic,
static analysis of the code is the way to go, but you obviously don't deserve
a downvote for having different opinion. I'll try to make it up to you by
being more careful in the future :)

~~~
radicalbyte
Hahaha, no problem.

Static Analysis would be my second step, but first I'd have a look at the
architectural documentation. I can't imagine that a project of this size
wouldn't at least have a Powerpoint explaining the structure and concepts of
the code.

Then it's time to start using tools.

------
xradionut
Here's a suggestion I haven't seen: Unless you have full management support, a
skilled team, valid business reasons for this conversion, and expectations of
succeeding, consider moving to another company/job.

You've been given the task of digital archeology/septic cleanup. Unless you
like the tedium and stank, it's not going to bode well...

------
dugmartin
Understanding the "shape" of a codebase is something I've always been
interested in and I started building a tool to help me understand and traverse
code here:

[http://sherlockcode.com/](http://sherlockcode.com/)

However I don't think it would scale to 100M lines of code. I have run Linux
through it and it was acceptable (both in run times and browse times). At 100M
lines of code you need some way to see an overall "map" of the codebase and
then drill in to the bits you are interested in. Just linking via symbols like
SherlockCode does is too micro of a view.

There are a lot of interesting visualization tools out there both commercial
and academic. I don't have any Java specific ones to recommend but a quick
Google search for "java code visualization tools" shows a lot of promise.

~~~
dprice1
This thing seems like a good start but I have a bug report for you. For me at
least, it's a deal-breaker. When browsing a source file, (firefox 32.0 on
macos) pageup/pagedown/spacebar and up/down arrows do not scroll the code,
even when the code pane has focus. Pressing any of these give focus to the
search box. I need to be able to use keyboard navigation at least for
scrolling.

------
myang
A very rough estimate: assuming you have 10 experienced developers on the
team, each can read and comprehend 1000 lines of code per hour. Given a
10-hour workday, the team can digest 100000 lines of code per day. To finish
just reviewing the code, it will take 1000 days, about 2 years and 8 months.
Not sure how much time you have and when you would expect to deliver the final
product. On the other hand, if you can find out the use cases and even take a
look at the current product, you then may not have to review the source code
but just go ahead implementing the features.

------
FollowSteph3
I don't think you can just do a cold re-write of that size without domain
knowledge. I would first try to refactor the existing system just to reduce
the code size. That big a system probably has horrific code and you can easily
shrink it quickly. Just finding duplicate code will have an impact. Pulling
out to open source systems like file utilities based code.

Basically I would first try to reduce the size of the problem while trying to
get domain expertise. I wouldn't consider a rewrite at this stage...

------
Koziolek
1\. Configure Jenkins builds 2\. Add PMD - code analyzing 3\. Add Sonar - code
analyzing (they has different rules than PMD) 4\. Use Archeology 3d >
[https://github.com/pslusarz/archeology3d](https://github.com/pslusarz/archeology3d)
to visualizing your code stats.

But before you start just pray to Omnissiah
([http://warhammer40k.wikia.com/wiki/Machine_God](http://warhammer40k.wikia.com/wiki/Machine_God)).

------
jacquesm
Callgraph. Then document the larger chunks, working your way down.

It's like having a map versus having no map at all.

And 100M lines? Are you sure there is no code generator at work here?

~~~
user1241320
It's code that's been developed and it's been running for decades, I'm afraid.

~~~
venkyk
developed in java and running for decades? how long has the language been
around?

~~~
arethuza
I started using Java in early '95 (having begged a copy from someone at Sun)
and I seemed to be one of the first people writing stuff outside of Sun.

~~~
jebblue
I started in August 1995 when I first heard about it, which I thought was when
they released it but it may have had a variety of trickle releases.

~~~
arethuza
I was working on something _slightly_ similar (embedding a VM into a browser)
during '94/'95 and was a bit miffed when I first heard of Java....

However, I did think Java was rather good and when I co-founded a start-up in
mid '95 we positioned ourselves as a "Java company" \- which was no bad thing
in the long term as we were in a reasonable position when Netscape, Novell and
IBM later decided they wanted to support it. Indeed our 2 round of VC
investment was led by Novell - quite unusual for a UK company at the time...

~~~
sgt101
I was doing Java in a very mainstream UK company in 1998; but we were leading
edge and it became "standard" in 2000ish. By 2003 there were 100's of
developers using it where I worked.

------
sgt101
First think is to go find the key users (start from the CEO and work down) and
find out what is important that it does to them. Map that.

Find anyone technical who is still around and can talk sensibly about it and
find out what they think is important. Map that.

Use anything automated to map what it's up too (calling...) and find out where
the core of it is.

You may know what is important by this time, you will be able to make some
sort of start...

------
weinzierl
Large projects tend to accumulate lots of unused cruft. Coverage tools like
emma/jacoco but I have successfully used UCDetector [1]. It's not bullet proof
but it helped me lot when analyzing code to remove the unused parts, even if
there are false positives sometimes. [1]
[http://www.ucdetector.org/](http://www.ucdetector.org/)

------
fiatmoney
The hardest part is often figuring out what the inner loop actually looks
like. The best way to find it is to hook up a profiler, and look at a bunch of
stack traces. That'll let you find the most common entry points and calling
patterns, which will go a long way towards understanding it.

------
mokeefe
Read this, then try to convince them not to re-build from scratch:
[http://www.informit.com/articles/article.aspx?p=1235624&seqN...](http://www.informit.com/articles/article.aspx?p=1235624&seqNum=3)

------
ufmace
A few ideas that have worked for me in the past:

Map the control flow. This code/app/whatever is doing something in production
right now. What tells it to start? How does the control flow from the start
point to the stuff that takes in data to the stuff that writes the output or
does whatever this app does? Whatever the options for how it works are, where
are they set, how do they make it into the core of the application to affect
whatever it does?

Map the data flow. Input must be coming into this thing somewhere. Find where
it reads it in, where it writes it out, and how it gets from one to the other,
what data structures and methods it passes through on the way.

------
atlantic
1) determine the use cases of the different applications involved (ie, what is
each one used for, how does it fit into the company's workflow)

2) treat each app as a black box, understand the major data flows involved
(which data sources is it interacting with? is it doing reads or writes? which
tables)

3) treat each app as a black box, and try to understand the interactions
between each app and any external components (other apps, web services, etc)

4) identify the overall architecture, determine the class hierarchy for each
app, identify the major classes and functionality

By this stage you should be ready for a rewrite. At no point do you need to go
into the code in any great depth.

------
mping
There is no programmatic way of doing this. You need to have guys with domain
expertise to help you through. Obviously, the effort to fix/migrate will
always be proportional to the time it took to create such a mess.

This is 100MM lines, it never will be easy. I would take some time to create
the tooling to do this. Say, create a tool to add some bytecode to generate a
pretty callgraph. Then, I'd run the use cases or functionalities individually,
and save the callgraph somewhere. But in the end, you will always need domain
knowledge expertise to guide you through the logic of it.

------
aragot
From the "decentralized web"/Agile spirit: Keep the original app online,
separate it in several functional domains, and replace them progressively,
month after month. This way, each iteration is a small manageable chunk,
functional experts can have a complete understanding of their own scope, and
the result is a set of independant scalable webapps with a clearly defined
scope.

... assuming you have webapps.

------
darrelld
Start at the main function. See how things get setup and walk through the code
from there. Keep notes on the structure and flow of things (if any). There
isn't really an easy way to do this unless it had been documented properly
before.

AstroGrep is a good Windows based tool that allows you to search within file
so you could use it to find which files spit out a particular output to
screen.

Not sure what you mean by using ASTs though.

~~~
jacquesm
AST = abstract syntax tree.

With 100M lines your process will take many years.

~~~
eklavya
Am I missing something or are you saying that making an AST like a compiler
will help you understand a huge codebase better and faster?

~~~
klibertp
"making an AST like a compiler" will give you a semantic - as contrasted with
textual - understanding of the code. Especially with how much data Java
encodes in its source, this makes it a very good base for running automated
analysis and visualisation tools, perhaps written by yourself.

In general having AST is always better than having a plain text file, unless
you want to read it. But then you can easily dump AST back to text whenever
you want.

Yeah, making AST will help you analyse your codebase programmatically which in
turn will let you understand the codebase better and faster. This is some very
basic programming knowledge, I think. Or is it not? Some comments here don't
know what AST even is - is this the state of PL knowledge in the mainstream?
Lisp and Smalltalk people would be very, very sad if it was so.

~~~
jacquesm
I disagree an AST would help with a project this size, it's just unmanageable.

You'd be better off to start with just the build scripts and build tools.

ASTs are great for increasing understanding of _much_ smaller projects but for
something this size you'd likely end up with very little to show for your
effort except the crashlogs of your tools.

You need to go 'coarse' before you can go 'fine' on something this magnitude.

This is not a 3 week project, just mapping the thing properly will take
(man)years.

~~~
klibertp
> something this magnitude

Yeah, I started commenting before the realization of how HUGE this thing would
be hit me, sorry :)

~~~
jacquesm
np.

I think your sorry would be better directed at this guy:

[https://news.ycombinator.com/item?id=8257519](https://news.ycombinator.com/item?id=8257519)

------
PeterisP
Divide and conquer.

Find a way to split it into something like 10 pieces of 10m LOC each in a way
where you can understand and [re-]document the data and control flow between
them.

Repeat with further subdivision, as much as you have people.

Then, if you really need to re-build this from scratch, do it per component -
first, make automated integration tests for its functionality, and only then
attempt to rebuild that part of the system.

------
joshdance
I don't know if you want to re-write it. Having to identify all the use cases
for a system that large will be horrific. Especially since people have build
onto the broken ways. And a large customer will require that it be exactly
like it was before so you have to recreate the old broken way and the new
shiny way. Who decided it needed to be re-written?

------
bilalhusain
Please elaborate. Why is there need for extracting ASTs? Is it a single 100M
line of Java source file? AFAIK Java has limitations on method size. If code
is already organized into file and methods try to come up with some sort of
UML representation. (I am assuming you are trying to understand the code base,
not profiling or doing code analysis.)

~~~
MaxBarraclough
Seconded.

> I have to analyse [...]

> We've started parsing it and tried to work on extracting abstract syntax
> trees and all that.

Why? How will this help? What are you really after?

~~~
user1241320
Not a single line. The whole Java codebase.

~~~
MaxBarraclough
That doesn't answer my question.

How does parsing ASTs help accomplish the goal? What _is_ the actual goal
here? What is meant by 'analyse'?

------
markc
Several years ago I saw an impressive demo of an analysis and refactoring tool
for large Java codebases called SonarJ (now Sonargraph) by hello2morrow. There
are a few other tools in this category (jdepend, agilej, jarchitect). They can
give you dependency graph visualizations to help untangle the spaghetti and
grok the higher-level structure.

------
V-2
As for your first step - with a project that size (and thus of substantial
age, I assume), there's surely tons and tons of dead code. I'd throw it away
first. Slim the thing down. The very process of identifying unused code will
already familiarize you roughly with the conceptual "shape" of the application

------
ramon
If you're a UML guy here's a great option [http://www.altova.com/umodel/uml-
reverse-engineering.html](http://www.altova.com/umodel/uml-reverse-
engineering.html)

------
Igglyboo
What are you actually trying to accomplish? "Analyze" is a broad term.

------
bettynormal
with 100 million lines of code I would: 1\. Find out what is still used,
remove the rest. 2\. Split code into stand alone supportable units.
Applications/Libraries etc.. 3\. Rank units in order of new requirements and
what code will need to be changed. 4\. Divide code between teams. 5\. Get code
to build, pass any tests and match the last released versions. 5\. go back to
management and get them to let you hire lots of people. A person per million
lines would be very low... 6\. Learn code in order of need.

------
stuaxo
Start with a static analysis tool - it will find lots of small possible bugs,
by fixing them you will get good coverage of the code and insight into the
structure + it will be better at the end.

------
andrewljohnson
The code length has to be overstated, by including libraries, generated files,
or data files. Is any real code that long?

I bet the core java code the team actually wrote is 2 orders of magnitude
smaller.

~~~
pacofvf
Yeah, the LOCs in one of the biggest banks in the world is close to 10M
(including: Trading[eq & ficc] , Credit, Mortgages, Wholesale, Wealth
Management, Quant, Risk, and also UIs, tests , reports , recons, interfaces) ,
I don't think a task like that was given to single person, usually with that
base of code a big IT consulting company is hired.

------
jhawk28
Look into Structure101 ([http://structure101.com/](http://structure101.com/))
to use static analysis to see the structure of the application.

------
bosky101
if you can run the program, i highly recommend searching HN for strace.

eg: "whats that program actually doing. start with strace"
[https://blogs.oracle.com/ksplice/entry/strace_the_sysadmin_s...](https://blogs.oracle.com/ksplice/entry/strace_the_sysadmin_s_microscope)

i've successfully reverse engineered messaging protocols, written drivers for
a different language, and ported large projects just by trying to see what it
does over the file system, network.

~B

------
mattgibson
Just curious: what application needs 100M lines of anything?

~~~
user1241320
It's not just one application. It's many of them.

~~~
eCa
I agree with kubiiii.

Since a rewrite is in the card (hopefully) there is something wrong with the
entire system.

* Identify where the applications interact with each other.

* Identify the most problematic applications.

* Rewrite those (starting with the smallest) while trying to keep the interfaces between applications constant.

And involve end users as much as possible.

------
ramon
On Eclipse [http://www.nwiresoftware.com/products/nwire-
java](http://www.nwiresoftware.com/products/nwire-java)

------
jebblue
jvisialvm comes with the JDK, I'd start there with profiling:

[http://visualvm.java.net/profiler.html](http://visualvm.java.net/profiler.html)

Edit: Adding, I'd set some judicial breakpoints in the hot spot areas
identified through profiling along with some System.out.println's (or better
dump to a flat file database, SQL can be used to work wonders for analysis
even for flat file data).

------
mschaef
You need to have a clear understanding of the point of the analysis before you
analyze anything. What, specifically, does your team have to produce? How much
time and how many people do you have to complete the work? If you're the
leading edge of an effort to rewrite 100MLoc, my presumption is that your
deliverable is mainly a 'gross anatomy' of the system... a basic description
of the major structural components and how they interact with each other. If
that's the case, I'd start by looking at the build scripts and the modules
they build. Try to make a comprehensive list of major components. You'll get
it wrong initially, but you'll need a starting point.

The next thing I'd do is take the top level list of modules and start
assigning it to individual people within the team. Their responsibility is to
produce some kind of top level description of how the individual modules work.
A big part of this phase of the effort should be meetings or informal
conversations as the per-module analysis progresses. As your team talks among
itself, you should be able to find commonality between modules, communication
links, etc. The key at this point is to keep it high level, and avoid getting
too bogged down in the details. With this much code, there are plenty of
details to get bogged down in. As a result, you'll probably have some
mysteries about how the code actually works beneath various abstraction
layers. Make and update a list of these 'mysteries' and keep it next to your
team's list of modules. As you work through the list of modules, some of these
will solve themselves, and some will be so obviously important that it's worth
a detailed deep dive to really understand what's happening. Either way, there
will be times that you have no idea what's going on in the codebase and you'll
just have to trust that you'll figure it out later.

One final comment I'd like to make is that, as silly as SLoC is as a measure
of the size of a software system, you're looking at a large software package.
(Bigger than Windows, Facebook, Linux, OSX, etc.) If you take each line of
code to have cost $5-10, then the system arguably cost $1B to build in the
first place.

Because of the size of the system, you shouldn't expect your analysis work to
be easy, fast, or cheap. Buy the tools you need to do the work. This means
technical and domain training, software, hardware, process development, new
staff,... basically whatever you need to make the work happen. You're at the
point where long term investments are highly likely to pay off, because your
scope is so large and your timeline is entirely in front of you.

I'd also highly recommend working this problem from two angles. You can
understand the existing system by looking at the code, but you also need to
clearly understand the system requirements from the 'business' point of view.
If you're doing bottom-up analysis, then some other group needs to be doing
top-down. Along those lines, you should also start to thinking about
deployment strategies. I highly recommend avoiding a big bang deployment of
that large of a system, so there will be some period of time when you're
liable to be running both the 'old world' and the 'new world' systems at the
same time. Think about how you want to do that...

There is lots to think about here, because this is a complex problem.
Hopefully, I've given you at least a little bit to think about. Good luck.

------
andrewchambers
rebuild 100M lines from scratch? sounds impossible to me.

You can re implement the applications that are causing problems maybe one at a
time maybe.

I don't understand what you want to get syntax trees for, but it sounds like
you are gonna need to store them in a database and do queries on it if there
really is info that you need.

------
dmead
why would you need the ast?

if given a large chunk of code to maintain i'll usually run doyxgen on it to
generate the xmlish kind of chart that it makes. at least it gives me a
roadmap to start, but it's not super great.

~~~
jebblue
I used this on some projects, works well. Building with the Dot diagrams is a
good visualization, in the final HTML documents choosing Classes | Class
Hierarchy shows them.

------
mbrodersen
Read "Working Effectively with Legacy Code".

------
sgt101
Also - what is the history of getting into this mess?

------
DonPellegrino
check out [https://github.com/facebook/pfff](https://github.com/facebook/pfff)

------
clavalle
I would build a general profile of the application and then drill in as needed
rather than try to grok the whole buffet at once.

If the idea is to rebuild the application I would start at the beginning: what
is the input and output? What does the user see? What are the various service
hooks? How are they called? When are they called? Why are they called?

Then I would look at how the overall code is organized. What modules are
there? Are there core utility modules that seem to be called by everything
else? What are those doing? What are the most used business function modules?

Then I would look at the build process. What external dependencies are there?
What are they used for? Are there modern alternatives? What about internal
dependencies? Does the build process look organized and sane or a chaotic mess
cobbled together over the years?

Do you have logs? What is the most utilized part of the application?

Then I would look at the database. What tables seem to be the most important
(if you could get usage stats from a running and used application that could
help, but otherwise you could look at which tables are keyed off of the most)?
What data is most critical? What modules interact with that data? What tables
are essential for supporting this data?

Answering these questions will start to fill out a nice 30,000 ft view of the
application and how it is actually used.

You are going to get the most bang for your re-implementation buck by
identifying and replacing often used utilities (especially if they are custom
built or built before a good de-facto standard was formed for that particular
task) with modern, well known, alternatives. Then follow the execution path of
the most often used modules and the modules that work with the most critical
data and work down the list.

With a 100 million line application, you are looking at many years to
understand all of it and many years to re-implement. To get anything useful in
a reasonable amount of time you are going to have to boil it down as much as
possible, then break what's left down into independent functional areas and
tackle it an area at a time.

The code is important, but if it were me, I'd try to analyze how the users and
processes work before I'd dig into the nitty gritty of the code too much if at
all possible. I'd build the smallest functional unit from what I deem to be
the most important and critical module(s) trying to cut as much cruft from the
application and database as possible. I'd get users and processes to start
banging on the new app as soon as possible. I'd keep the old application up
and running and available to analyze (not for the users but for the developers
and analysts) as the team works down the most often used parts. I would not
try to analyze the whole mess in one go beyond finding waypoints as described
above. If possible I'd also try to get users to understand that the old way is
not necessarily the right way. Much pain has been caused trying to make new
systems work exactly like the old systems when the new systems don't face the
same constraints. It is just too tempting to say 'make it work like it did'.

------
mathusan
box

------
slermukka
You could try to reduce the lines of code by removing alle the useless
whitespaces.

