Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: I have to analyze 100M lines of Java – where do I start?
87 points by user1241320 on Sept 2, 2014 | hide | past | favorite | 122 comments
As part of a huge "let's see what's going on here and re-build this from scratch" they dumped the whole code repository on me and my team.

We've started parsing it and tried to work on extracting abstract syntax trees and all that.

Any idea would help us a great deal.

Thanks.




For rewrite from scratch projects, I always start by identifying the use cases covered by the application. You don't need the code for that. Just run the application and identify what it is that it does. Then, work backwards. For each use case, use the existing code as specification of the use case behavior.

At 100 million lines, I'd suspect this is either an extremely large project, where a rewrite from scratch is inadvisable, or that there is a code generator at work. If it is the latter, you want to analyze the code generating source, not the end result.

Anyhow, generically, for a first contact with a new code base, code coverage tools are a good start, as is a call graph debug run of the project. It'll let you spot dead code as well as hot code (code being called at every run of the application). It'll highlight the important and non-important code parts, allowing you to read less code and get a grasp on the architecture.


This, 100x this.

You and your team don't just have to build an understanding of the code (e.g. frameworks, patterns, DB's, etc.) but the app itself so you can truly understand it's purpose.

Your rewrite won't (hopefully?) also be 100mm lines so being able to understand the high level purpose of the app completely and then diving deep from there, you may (hopefully) find many places where the system can be simplified.

Are you really not exaggerating? 100mm lines of source? Yiiiikes...


Yup, start the profiler up, use the application, check the call graph.

At 100 MLOC the code base is probably a complete mess.

Also, simian is probably your friend as it will identify large chunks of duplicate code.

As well source control can be your friend, the older the source is the more likely it is to contain useful code. The files with the most changes will usually be where the bugs are.


Why did this get downvoted? It's actionable and on topic.


I could be wrong, but I think the "M" means thousands in this case. I know it means thousands when dealing with inventory UOMs. 100 million lines just seems like a lot, even for a code generator, and not something that would be turned over to a single team.


I don't think there are that many companies with ~10^8 LOC code bases in the world, let alone for a single product[1]:

Facebook(webapp): ~310^7 LOC Linux Kernel: ~210^7 LOC Windows XP: ~5*10^7 LOC My impression is that 10 million LOC codebase is relatively common... but past that the size of the organization (company or volunteers) needed becomes a major sorting criteria.

[1] http://www.wired.com/2013/04/facebook-windows/


Yes but these project are no dumped to a team to re-write them all the often, are they?

The way I get this, his team is limited in personnel and resources and kinda overwhelmed by the project. I'm sure Facebook/Linux (OS devs)/Microsoft are not exactly out of developers.


All the more reason to sharpen the good old resume... if there was not a typo in the OP numbers, this smells a lot like an intentional career killer.


Errata. Numbers should read as: Facebook(webapp): ~3 * 10^7 LOC. Linux Kernel: ~2 * 10^7 LOC. Windows XP: ~5 * 10^7 LOC.


I realize Facebook is not a trivial software, but seriously - this large? I heard this before and I just don't get it.


Identify the users. Figure out who actually does anything with the system. They'll be invaluable as you try to determine what part of the system are active.

Look for a test suite. You'll need one once you start making changes, to keep from breaking anything. If necessary create one based on actual jobs run of the system. You want integration tests, for big parts of the system, rather than unit tests.

Once you have a test suite, start with dead code analysis. Any codebase as big as this one will have a lot of accumulated cruft that is just getting in the way. Delete it.


10 years ago I worked on a large project to re-write a code base written in C. Our approach was to forget about the code and document everything it did from the user's perspective. Once everything was mapped out we decided on what we were going to keep, modify or remove, and then started building everything from scratch. You can always go back to the original code to see how a particular feature was implemented and perhaps re-use the same logic.


Having done projects like this before as well, this is the best method. Knowing WHAT needs to be done today and tomorrow is far more important that knowing HOW it was done before. The how is only important once you know what you need to do.


lol, so true...

Funny thing is, often the users tell you "I used it for XYZ" and it was...

...never been written for XYZ

...never DID XYZ, all the results/numbers were trash, but no one noticed


I will totally second that approach. Rather spending time on reading code line by line, you should have your team spend time (as a tester) to figure out use-cases. One by one. The end result of those use-cases should be a user-case/requirement document which you can feed into developers cycles to start building from scratch.

You can break the steps in more agile method and have 3 sub-teams.

1. Figuring out use cases 2. Developers 3. Testers


This. Also from a business- (and CYA-) perspective, this gives you a punchlist of functionality to give to executive management, which can be used for anything from doing a proper scoping exercise to actually giving you a metric to show progress against.


Indeed. Code is useless without the people who wrote it being around. It's not the code that's valuable, it's the experience of the programmers who it that's valuable.


With a codebase like that, it's better to look at it through the users' eyes, rather than trying to reverse engineer the business from the code. Things that look like bugs in the code may actually be features for the users, or may have been absorbed so long ago that they've fundamentally changed the nature of the business.

You don't need to understand the whole codebase. It will take years. Best to focus on what the users need and analyze small chunks. If it's truly 100M lines, there's not going to be any semblance of consistency in the code.

You can also slap New Relic on it and you may be amazed at what you learn, right away.

Don't waste too much time trying to understand all the code. Focus on a couple issues first, make some hypothesis, and then see how well your understanding of the code fits the bigger picture. Refactor and repeat.


Any human line-by-line/application-by-application analysis is (for this particular discussion) out of the scope.

The size of the thing and the way we thought we were going to work is quite different.

For instance, suppose we produce AST for all the routines/pieces of logic/you_name_it we wanted to then find similar patterns or clusters that would give us hint to then work on a "pareto-like" way.

As already stated it's not ONE project, it's an old (but still running), poorly-documented codebase produce in decades around this big firm we work for.


Don't try to figure out how the code does what does yet. Figure out what systems exists inside it:

  1.  What kind of modules?
  2.  Which servers/hardware?
  3.  Which databases/datastores?
  4.  What systems talk to what?
  5.  What test systems exist or existed?
  6.  Which api/frameworks where used?
  7.  Who is currently working on them/maintaining it?
  8.  Is anyone left who used to?
  9.  Why is a rewrite on the table?
  10. Is there any way you can work on smaller pieces at a time?
  11. What are the pain points of the current users (will tell you what area to focus on)?
  12. Can you document what comes in and out?
In my experience with such large code bases, there is never one way to do things. i.e. I once worked on a smaller system with 4 ways to talk to the same database. On one with 100 million lines I would expect even more ways to rome ;)

If you do want to go down the static analysis path, start with existing tools before trying to build your own. If needed get external help for this.

A 100 Million lines of code is not so bizarre. The project I work on is currently about 300,000 lines and a project some 300 times larger is quite imaginable for me.


Code complexity does not increase linearly.

100 M lines is stupendous.


Not really, my project is 3 FTE for 8 years. Double it to 16 years and then multiple the number of developers by 100.

Consider a large enterprise having 300 developers in multiple teams, I am not at all surprised that they can manage to write 100 million lines or so. Also I think this is really a system of systems, and in my experience probably has large parts developed by the lowest bidding firms. Which when software is developed for 10 years or more means more than one way to do the same thing.

Also having to deal with lots off ancient systems and working around weird bugs probably fixed years ago. You know things like bugs in Java 1.2 on HP-UX and stuff like that, or errors in Oracle 7i etc...

Plus functional duplication because team A did not know subteam C2 build the same thing...

Editing my comment instead of replying to the excellent comment by @jacquesm as hn does not allow me to reply to the reply.

Actually completing the transfer of a codebase like that is unlikely to a new team without much much more of a handover. But some high/middle frustrated with the current system manager asking a team to start rebuilding before it gets shut down a few months/years later is very possible. An other plausible option is a corporate take over... But then I would expect a very experienced team to work on it who do not need to ask HN for this kind of thing.

I personally have been in a situation where code moved between companies and no documentation or old developers where available. Not as large as this only a 1 or 2 million lines of code/xml. But I am no longer surprised by the stupid acts that large corporations can perform.

And this must be systems not just one. You can't build a single jar file of 50+ millions of lines of code than could have loaded in a JVM around 1.3.1 on even high end hardware for the time.


How realistic does it sound to you that a codebase that size would be transferred to a new team without any of the old team, outside of a hack of a bank or a reversal of some outsourcing decision or something like that?

Typically the value of such a codebase is determined by the quality of the team maintaining it and the degree to which it is documented.

Complexity of software constructs is not linear and enough books have been written about simply multiplying and dividing manyears and lines of code that I don't think we need to hash that out all over again. See 'the mythical man-month' and many similar books and articles.


Productivity in LOC/developer/day drops dramatically with increase of the team and code size. Every change requires extensive testing, figuring out interdependency, etc. If codebase is bad, it can take multiple days to make 10 lines change.


When you go above the 1m-or-so LOC it gets much easier to move to more data-driven designs.

Of course the OP could be including all test cases, data etc in his LOC; in which case you could easily reach the hundred-million LOC mark...


Source to UML: http://www.architexa.com/

Getting call paths: https://github.com/gousiosg/java-callgraph

Line coverage from instrumented jars: http://emma.sourceforge.net/

For this type of request, I'd push back and say, let's identify very small parts of this and begin rewriting those one at a time in an isolated project. Kind of an agile rewrite that will combine the legacy project with the slowly rewritten one. Use the tools to identify parts of the project than can be isolated. Build new interfaces or services to let the old project communicate with the new one. Get a history of the source repository to see where recent edits are and prioritize those to be rewritten first (presuming they want a rewrite to lower maintenance costs).


nice links! :)


You haven't really described your goals: What do you want to extract from your analysis? Metrics to tell you what's "wrong" with the existing code base? Some sort of model of the system's semantics?


We'd love to know what these lines do. For example what part of this codebase deals with the DB and what part does not. And then go deeper.

The final goal is to re-do what these lines do :(


100, 000, 000 lines of code is a huge amount and would take you over 1000 days just to read at 1 line per second, and 1000 man-years to fully understand. If your final goal is to rewrite all of it you are probably doomed to fail. You should first ask yourself (and your clients) some simple questions about why this insane project has been dumped on you and what the goal is:

What is the order of priority of services - which services/apps are critical, and which are not very important?

Which services actually need to be rewritten and which are working just fine?

Which services have a clearly defined interface and can be rewritten?

Which tests are in place to test the existing services, and which will you have to write?

I wouldn't touch the code till you have answered those questions, and once you have those answers, having some sort of overview of code coverage etc is going to seem less important, because it will become obvious which bits need to be touched first (the ones that are both mission critical and broken), and which bits you can easily isolate.

You will find it very very hard to show concrete progress if you try to change all of this code at once, in a global way (for example by tidying up every single reference to a db to use a new db interface, or things like that). If you do, you'll never reach your final goal, and end up spending months tidying up without actually delivering value to the business.


Yes, a "rewrite" is the wrong way to think about this project. A typical programmer might be able to produce 10K lines of production code in a year, which means it would take 10,000 engineer-years to rewrite it all.

The OP needs to think in terms of stewardship, not complete reconstruction, and make improvements by small steps as a gradual process.

I talked about a similar problem here: http://short-sharp.blogspot.ca/2012/08/fixing-broken-codebas...


> The final goal is to re-do what these lines do

That is quite possibly a huge mistake. (And a very costly one too!)


At a guess, understanding and ability to fix/change, that's the usual with projects like these, bringing the code back into a state of maintenance which could be described as 'under control'.


What type of application has 100 million loc's? Windows 7 has 40 million lines of codes so I'm wondering why type of application/software it is.


I have a quite strong feeling that certain complex automated systems within financial services / insurances domain could reach those LOC levels. Including all the frontend side, internal backend logic, possible web services, internal tools, tests, tens to hundreds of interfaces to different kinds of external services, report generation, libraries, etc.


BINGO!


BINGO all you want but if you're at liberty to disclose such things (or to confirm them) you should have included it in your original write up.


Sure, 'cause if he/she had, you would already have found a solution! No need to be rude nor arrogant, even more if pointlessly so.


No, it's just that such information makes a huge difference to those that try to make sense of the question, especially a poorly defined and somewhat confusing question as posed here.


How exactly does knowing it's an insurance/finance application magically change the answer to "how to get my head around 100m lines of Java code"? Is there some "financial java code analysis" tool or technique that's so completely distinct from "engineering java code analysis" or "healthcare java code analysis" that it deserves a snotty, condescending retort?


An 'Ask HN' with this level of input is not a guessing game where the people that try to help have to play 'bingo' with the asker for information that may or may not be important for the answer. If there is context available and the asker is free to talk about that context then it should be supplied up-front.

There is nothing snotty or condescending about that, but there is something very weird about this whole thread, I wished I could put my finger on it.


Before rebuilding anything piece of software from scratch, I would give a serious look at this amazing bunch of wisdom: http://www.joelonsoftware.com/articles/fog0000000069.html


Great article, it hits at the core point. There's an old saying, don't throw the baby out with the bathwater.


1) Focus on the functional use-cases and not code.

2) Identify integration points to other systems and ask why they are there

3) Realize that a "big-bang" rebuild never works and that it's better to break up the system into smaller pieces and replace them piece by piece.


Funny, I have 50,000 individual Java apps to analyze. I started with a copy/paste detector. Pmd has a free one. Good luck!


Wow, a blast from my distant past... a link to the copy/paste detector:

http://pmd.sourceforge.net/pmd-5.1.3/cpd-usage.html


As a first pass, try deleting as much code as possible :) If there are files or whole projects that aren't needed anymore, they're just slowing down your analysis. Also some dead-code analysis could be helpful, at least in broad strokes. You could instrument the code with a test coverage tool, then run the code instead of the tests to see what code gets reached.

Edit: You could also look for duplicated code, and quickly refactor that to just be in one place.


Do you and your team have experience of Java development?

Your question sounds like something that someone with either no real experience and/or no experience of an object-oriented language would ask.

100 million lines is a lot of code. Why do you need to "parse it to extra the AST"? That's crazy.

Do you have the original design documents and architectural documentation? If you do, read it.


Downvoted.

Even if the design docs would exist, it would take months to read them, without any guarantee that they correspond to the reality.

Meanwhile, automated analysis of actual code can give you at least high-level overview of the codebase and maybe a hint where to start digging. Getting AST is a first step required for most automated tools to do their work.

EDIT: I acted too rashly and downvoted your post before I realized what are we really talking about. Sorry about this. I still am convinced that automatic, static analysis of the code is the way to go, but you obviously don't deserve a downvote for having different opinion. I'll try to make it up to you by being more careful in the future :)


Hahaha, no problem.

Static Analysis would be my second step, but first I'd have a look at the architectural documentation. I can't imagine that a project of this size wouldn't at least have a Powerpoint explaining the structure and concepts of the code.

Then it's time to start using tools.


Here's a suggestion I haven't seen: Unless you have full management support, a skilled team, valid business reasons for this conversion, and expectations of succeeding, consider moving to another company/job.

You've been given the task of digital archeology/septic cleanup. Unless you like the tedium and stank, it's not going to bode well...


Understanding the "shape" of a codebase is something I've always been interested in and I started building a tool to help me understand and traverse code here:

http://sherlockcode.com/

However I don't think it would scale to 100M lines of code. I have run Linux through it and it was acceptable (both in run times and browse times). At 100M lines of code you need some way to see an overall "map" of the codebase and then drill in to the bits you are interested in. Just linking via symbols like SherlockCode does is too micro of a view.

There are a lot of interesting visualization tools out there both commercial and academic. I don't have any Java specific ones to recommend but a quick Google search for "java code visualization tools" shows a lot of promise.


This thing seems like a good start but I have a bug report for you. For me at least, it's a deal-breaker. When browsing a source file, (firefox 32.0 on macos) pageup/pagedown/spacebar and up/down arrows do not scroll the code, even when the code pane has focus. Pressing any of these give focus to the search box. I need to be able to use keyboard navigation at least for scrolling.


A very rough estimate: assuming you have 10 experienced developers on the team, each can read and comprehend 1000 lines of code per hour. Given a 10-hour workday, the team can digest 100000 lines of code per day. To finish just reviewing the code, it will take 1000 days, about 2 years and 8 months. Not sure how much time you have and when you would expect to deliver the final product. On the other hand, if you can find out the use cases and even take a look at the current product, you then may not have to review the source code but just go ahead implementing the features.


I don't think you can just do a cold re-write of that size without domain knowledge. I would first try to refactor the existing system just to reduce the code size. That big a system probably has horrific code and you can easily shrink it quickly. Just finding duplicate code will have an impact. Pulling out to open source systems like file utilities based code.

Basically I would first try to reduce the size of the problem while trying to get domain expertise. I wouldn't consider a rewrite at this stage...


1. Configure Jenkins builds 2. Add PMD - code analyzing 3. Add Sonar - code analyzing (they has different rules than PMD) 4. Use Archeology 3d > https://github.com/pslusarz/archeology3d to visualizing your code stats.

But before you start just pray to Omnissiah (http://warhammer40k.wikia.com/wiki/Machine_God).


Callgraph. Then document the larger chunks, working your way down.

It's like having a map versus having no map at all.

And 100M lines? Are you sure there is no code generator at work here?


It's code that's been developed and it's been running for decades, I'm afraid.


developed in java and running for decades? how long has the language been around?


19 years. But sun was pushing it quite aggressively at enterprises so it is not rare to see big java installations that are now 15+ years old.

And those are not the most fun to work on.


I started using Java in early '95 (having begged a copy from someone at Sun) and I seemed to be one of the first people writing stuff outside of Sun.


I started in August 1995 when I first heard about it, which I thought was when they released it but it may have had a variety of trickle releases.


I was working on something slightly similar (embedding a VM into a browser) during '94/'95 and was a bit miffed when I first heard of Java....

However, I did think Java was rather good and when I co-founded a start-up in mid '95 we positioned ourselves as a "Java company" - which was no bad thing in the long term as we were in a reasonable position when Netscape, Novell and IBM later decided they wanted to support it. Indeed our 2 round of VC investment was led by Novell - quite unusual for a UK company at the time...


I was doing Java in a very mainstream UK company in 1998; but we were leading edge and it became "standard" in 2000ish. By 2003 there were 100's of developers using it where I worked.


How does that stop you from making a callgraph?

(On the off chance that you don't know what that is: http://en.wikipedia.org/wiki/Call_graph)


It doesn't. I was just answering about the code generator thing you mentioned.


100M lines of java code developed 'line-by-line' would make it one of the largest software projects that I've ever heard about.

Without telling you directly that you should disqualify yourself (after all I don't know you), if you don't have the knowledge about the tools employed to deal with medium sized projects (say up to 1M lines) how on earth will you deal with 100 times as much?


> 100M lines of java code developed 'line-by-line'

If it's not generated, then it may be "versioned" that way. I saw projects where the entire codebase was copied over to new directories tens of times, and no previous "version" was ever deleted.

But, after giving it some thought, 100 mloc is really MUCH, like in 50x more than anything I directly worked with. It does sound kind of improbable, but hey, it could happen :)


So ... is (some of) this 100M lines the output of a code generator which has since itself been lost? If so, then it seems like a good idea to take a good hard second look for it: scour every nook and cranny for source code, object code, documentation or any surviving evidence (old developer emails?) of how it worked.


Java was apparently introduced in 1995, and thus has not quite been around for "decades". Close, but 19 years is not quite "decades". LOL

Presumably somebody didn't write 100M lines of code on Day 1.


First think is to go find the key users (start from the CEO and work down) and find out what is important that it does to them. Map that.

Find anyone technical who is still around and can talk sensibly about it and find out what they think is important. Map that.

Use anything automated to map what it's up too (calling...) and find out where the core of it is.

You may know what is important by this time, you will be able to make some sort of start...


Large projects tend to accumulate lots of unused cruft. Coverage tools like emma/jacoco but I have successfully used UCDetector [1]. It's not bullet proof but it helped me lot when analyzing code to remove the unused parts, even if there are false positives sometimes. [1] http://www.ucdetector.org/


The hardest part is often figuring out what the inner loop actually looks like. The best way to find it is to hook up a profiler, and look at a bunch of stack traces. That'll let you find the most common entry points and calling patterns, which will go a long way towards understanding it.


Read this, then try to convince them not to re-build from scratch: http://www.informit.com/articles/article.aspx?p=1235624&seqN...


A few ideas that have worked for me in the past:

Map the control flow. This code/app/whatever is doing something in production right now. What tells it to start? How does the control flow from the start point to the stuff that takes in data to the stuff that writes the output or does whatever this app does? Whatever the options for how it works are, where are they set, how do they make it into the core of the application to affect whatever it does?

Map the data flow. Input must be coming into this thing somewhere. Find where it reads it in, where it writes it out, and how it gets from one to the other, what data structures and methods it passes through on the way.


1) determine the use cases of the different applications involved (ie, what is each one used for, how does it fit into the company's workflow)

2) treat each app as a black box, understand the major data flows involved (which data sources is it interacting with? is it doing reads or writes? which tables)

3) treat each app as a black box, and try to understand the interactions between each app and any external components (other apps, web services, etc)

4) identify the overall architecture, determine the class hierarchy for each app, identify the major classes and functionality

By this stage you should be ready for a rewrite. At no point do you need to go into the code in any great depth.


There is no programmatic way of doing this. You need to have guys with domain expertise to help you through. Obviously, the effort to fix/migrate will always be proportional to the time it took to create such a mess.

This is 100MM lines, it never will be easy. I would take some time to create the tooling to do this. Say, create a tool to add some bytecode to generate a pretty callgraph. Then, I'd run the use cases or functionalities individually, and save the callgraph somewhere. But in the end, you will always need domain knowledge expertise to guide you through the logic of it.


From the "decentralized web"/Agile spirit: Keep the original app online, separate it in several functional domains, and replace them progressively, month after month. This way, each iteration is a small manageable chunk, functional experts can have a complete understanding of their own scope, and the result is a set of independant scalable webapps with a clearly defined scope.

... assuming you have webapps.


Start at the main function. See how things get setup and walk through the code from there. Keep notes on the structure and flow of things (if any). There isn't really an easy way to do this unless it had been documented properly before.

AstroGrep is a good Windows based tool that allows you to search within file so you could use it to find which files spit out a particular output to screen.

Not sure what you mean by using ASTs though.


AST = abstract syntax tree.

With 100M lines your process will take many years.


Am I missing something or are you saying that making an AST like a compiler will help you understand a huge codebase better and faster?


"making an AST like a compiler" will give you a semantic - as contrasted with textual - understanding of the code. Especially with how much data Java encodes in its source, this makes it a very good base for running automated analysis and visualisation tools, perhaps written by yourself.

In general having AST is always better than having a plain text file, unless you want to read it. But then you can easily dump AST back to text whenever you want.

Yeah, making AST will help you analyse your codebase programmatically which in turn will let you understand the codebase better and faster. This is some very basic programming knowledge, I think. Or is it not? Some comments here don't know what AST even is - is this the state of PL knowledge in the mainstream? Lisp and Smalltalk people would be very, very sad if it was so.


I disagree an AST would help with a project this size, it's just unmanageable.

You'd be better off to start with just the build scripts and build tools.

ASTs are great for increasing understanding of much smaller projects but for something this size you'd likely end up with very little to show for your effort except the crashlogs of your tools.

You need to go 'coarse' before you can go 'fine' on something this magnitude.

This is not a 3 week project, just mapping the thing properly will take (man)years.


> something this magnitude

Yeah, I started commenting before the realization of how HUGE this thing would be hit me, sorry :)


np.

I think your sorry would be better directed at this guy:

https://news.ycombinator.com/item?id=8257519


Yeah I only came to know about AST around a year back. I guess I am not just that competent but I can not imagine someone trying to make sense of a codebase using ASTs. Mostly it's read the code/modify the code/debug the code for me.

I have written some Scheme and I still can't say I need to screw around with ASTs. May be I will be enlightened some day?

A UML diagram for this level of hugeness would be a really useful thing according to me, much much better than an AST.


I probably overreacted, sorry for this - working on programming languages and programming tools is not what most of programmers do (obviously) and that's the only domain where you need to know about what you can use AST for.

As for this:

> A UML diagram for this level of hugeness would be a really useful thing

we actually agree 100% here. What I mean is that having AST is meaningless by itself, but you need AST if you want to generate UML diagram from the code. Or generate a callgraph. Or find similarities or duplication in the code. Or indeed perform any kind of automatic code transformation.

So extracting AST is a first step to developing your own tools for working with a codebase. And with a codebase of this size you just have to write your own tools, adapted to the nature of this particular codebase. So while "trying to make sense using ASTs" really is a bit hard to imagine, trying to make sense of a codebase using all the tools AST enables you to write is what I had in mind.


No, an AST will not help but a call graph certainly would (it shows you how the various routines are organized in graph form, who calls who).

An AST for 100M lines would be absolute madness, a call graph just might work and I'm somewhat hoping that it turns out to be either a ton of generated or duplicated code.

I also wonder if the OP isn't out of his depth based on the question(s) asked.


Divide and conquer.

Find a way to split it into something like 10 pieces of 10m LOC each in a way where you can understand and [re-]document the data and control flow between them.

Repeat with further subdivision, as much as you have people.

Then, if you really need to re-build this from scratch, do it per component - first, make automated integration tests for its functionality, and only then attempt to rebuild that part of the system.


I don't know if you want to re-write it. Having to identify all the use cases for a system that large will be horrific. Especially since people have build onto the broken ways. And a large customer will require that it be exactly like it was before so you have to recreate the old broken way and the new shiny way. Who decided it needed to be re-written?


Please elaborate. Why is there need for extracting ASTs? Is it a single 100M line of Java source file? AFAIK Java has limitations on method size. If code is already organized into file and methods try to come up with some sort of UML representation. (I am assuming you are trying to understand the code base, not profiling or doing code analysis.)


Seconded.

> I have to analyse [...]

> We've started parsing it and tried to work on extracting abstract syntax trees and all that.

Why? How will this help? What are you really after?


Not a single line. The whole Java codebase.


That doesn't answer my question.

How does parsing ASTs help accomplish the goal? What is the actual goal here? What is meant by 'analyse'?


Several years ago I saw an impressive demo of an analysis and refactoring tool for large Java codebases called SonarJ (now Sonargraph) by hello2morrow. There are a few other tools in this category (jdepend, agilej, jarchitect). They can give you dependency graph visualizations to help untangle the spaghetti and grok the higher-level structure.


As for your first step - with a project that size (and thus of substantial age, I assume), there's surely tons and tons of dead code. I'd throw it away first. Slim the thing down. The very process of identifying unused code will already familiarize you roughly with the conceptual "shape" of the application


If you're a UML guy here's a great option http://www.altova.com/umodel/uml-reverse-engineering.html


What are you actually trying to accomplish? "Analyze" is a broad term.


with 100 million lines of code I would: 1. Find out what is still used, remove the rest. 2. Split code into stand alone supportable units. Applications/Libraries etc.. 3. Rank units in order of new requirements and what code will need to be changed. 4. Divide code between teams. 5. Get code to build, pass any tests and match the last released versions. 5. go back to management and get them to let you hire lots of people. A person per million lines would be very low... 6. Learn code in order of need.


Start with a static analysis tool - it will find lots of small possible bugs, by fixing them you will get good coverage of the code and insight into the structure + it will be better at the end.


The code length has to be overstated, by including libraries, generated files, or data files. Is any real code that long?

I bet the core java code the team actually wrote is 2 orders of magnitude smaller.


Yeah, the LOCs in one of the biggest banks in the world is close to 10M (including: Trading[eq & ficc] , Credit, Mortgages, Wholesale, Wealth Management, Quant, Risk, and also UIs, tests , reports , recons, interfaces) , I don't think a task like that was given to single person, usually with that base of code a big IT consulting company is hired.


99M loc for dealing with timezones. Before it was a library call.


And they knew what they were doing.


Look into Structure101 (http://structure101.com/) to use static analysis to see the structure of the application.


if you can run the program, i highly recommend searching HN for strace.

eg: "whats that program actually doing. start with strace" https://blogs.oracle.com/ksplice/entry/strace_the_sysadmin_s...

i've successfully reverse engineered messaging protocols, written drivers for a different language, and ported large projects just by trying to see what it does over the file system, network.

~B


Just curious: what application needs 100M lines of anything?


It's not just one application. It's many of them.


I agree with kubiiii.

Since a rewrite is in the card (hopefully) there is something wrong with the entire system.

* Identify where the applications interact with each other.

* Identify the most problematic applications.

* Rewrite those (starting with the smallest) while trying to keep the interfaces between applications constant.

And involve end users as much as possible.


Why not (sounds easy eh?) go one application at a time? Are there interactions between apps? Maybe you can start with documenting all the interaction (~ api). Then you'd go deeper in each app.



jvisialvm comes with the JDK, I'd start there with profiling:

http://visualvm.java.net/profiler.html

Edit: Adding, I'd set some judicial breakpoints in the hot spot areas identified through profiling along with some System.out.println's (or better dump to a flat file database, SQL can be used to work wonders for analysis even for flat file data).


You need to have a clear understanding of the point of the analysis before you analyze anything. What, specifically, does your team have to produce? How much time and how many people do you have to complete the work? If you're the leading edge of an effort to rewrite 100MLoc, my presumption is that your deliverable is mainly a 'gross anatomy' of the system... a basic description of the major structural components and how they interact with each other. If that's the case, I'd start by looking at the build scripts and the modules they build. Try to make a comprehensive list of major components. You'll get it wrong initially, but you'll need a starting point.

The next thing I'd do is take the top level list of modules and start assigning it to individual people within the team. Their responsibility is to produce some kind of top level description of how the individual modules work. A big part of this phase of the effort should be meetings or informal conversations as the per-module analysis progresses. As your team talks among itself, you should be able to find commonality between modules, communication links, etc. The key at this point is to keep it high level, and avoid getting too bogged down in the details. With this much code, there are plenty of details to get bogged down in. As a result, you'll probably have some mysteries about how the code actually works beneath various abstraction layers. Make and update a list of these 'mysteries' and keep it next to your team's list of modules. As you work through the list of modules, some of these will solve themselves, and some will be so obviously important that it's worth a detailed deep dive to really understand what's happening. Either way, there will be times that you have no idea what's going on in the codebase and you'll just have to trust that you'll figure it out later.

One final comment I'd like to make is that, as silly as SLoC is as a measure of the size of a software system, you're looking at a large software package. (Bigger than Windows, Facebook, Linux, OSX, etc.) If you take each line of code to have cost $5-10, then the system arguably cost $1B to build in the first place.

Because of the size of the system, you shouldn't expect your analysis work to be easy, fast, or cheap. Buy the tools you need to do the work. This means technical and domain training, software, hardware, process development, new staff,... basically whatever you need to make the work happen. You're at the point where long term investments are highly likely to pay off, because your scope is so large and your timeline is entirely in front of you.

I'd also highly recommend working this problem from two angles. You can understand the existing system by looking at the code, but you also need to clearly understand the system requirements from the 'business' point of view. If you're doing bottom-up analysis, then some other group needs to be doing top-down. Along those lines, you should also start to thinking about deployment strategies. I highly recommend avoiding a big bang deployment of that large of a system, so there will be some period of time when you're liable to be running both the 'old world' and the 'new world' systems at the same time. Think about how you want to do that...

There is lots to think about here, because this is a complex problem. Hopefully, I've given you at least a little bit to think about. Good luck.


rebuild 100M lines from scratch? sounds impossible to me.

You can re implement the applications that are causing problems maybe one at a time maybe.

I don't understand what you want to get syntax trees for, but it sounds like you are gonna need to store them in a database and do queries on it if there really is info that you need.


why would you need the ast?

if given a large chunk of code to maintain i'll usually run doyxgen on it to generate the xmlish kind of chart that it makes. at least it gives me a roadmap to start, but it's not super great.


I used this on some projects, works well. Building with the Dot diagrams is a good visualization, in the final HTML documents choosing Classes | Class Hierarchy shows them.


Read "Working Effectively with Legacy Code".


Also - what is the history of getting into this mess?



I would build a general profile of the application and then drill in as needed rather than try to grok the whole buffet at once.

If the idea is to rebuild the application I would start at the beginning: what is the input and output? What does the user see? What are the various service hooks? How are they called? When are they called? Why are they called?

Then I would look at how the overall code is organized. What modules are there? Are there core utility modules that seem to be called by everything else? What are those doing? What are the most used business function modules?

Then I would look at the build process. What external dependencies are there? What are they used for? Are there modern alternatives? What about internal dependencies? Does the build process look organized and sane or a chaotic mess cobbled together over the years?

Do you have logs? What is the most utilized part of the application?

Then I would look at the database. What tables seem to be the most important (if you could get usage stats from a running and used application that could help, but otherwise you could look at which tables are keyed off of the most)? What data is most critical? What modules interact with that data? What tables are essential for supporting this data?

Answering these questions will start to fill out a nice 30,000 ft view of the application and how it is actually used.

You are going to get the most bang for your re-implementation buck by identifying and replacing often used utilities (especially if they are custom built or built before a good de-facto standard was formed for that particular task) with modern, well known, alternatives. Then follow the execution path of the most often used modules and the modules that work with the most critical data and work down the list.

With a 100 million line application, you are looking at many years to understand all of it and many years to re-implement. To get anything useful in a reasonable amount of time you are going to have to boil it down as much as possible, then break what's left down into independent functional areas and tackle it an area at a time.

The code is important, but if it were me, I'd try to analyze how the users and processes work before I'd dig into the nitty gritty of the code too much if at all possible. I'd build the smallest functional unit from what I deem to be the most important and critical module(s) trying to cut as much cruft from the application and database as possible. I'd get users and processes to start banging on the new app as soon as possible. I'd keep the old application up and running and available to analyze (not for the users but for the developers and analysts) as the team works down the most often used parts. I would not try to analyze the whole mess in one go beyond finding waypoints as described above. If possible I'd also try to get users to understand that the old way is not necessarily the right way. Much pain has been caused trying to make new systems work exactly like the old systems when the new systems don't face the same constraints. It is just too tempting to say 'make it work like it did'.


box


You could try to reduce the lines of code by removing alle the useless whitespaces.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: