Hacker News new | past | comments | ask | show | jobs | submit login
Undebt: How We Refactored 3M Lines of Code (yelp.com)
245 points by vivagn on Aug 24, 2016 | hide | past | web | favorite | 138 comments

For Java, IntelliJ has a built-in version of this called "structural search and replace" [0]. This is incredibly useful when a library changes an API or you need to refactor a lot of similar code.

This feels relatively safe in Java because tooling can staticly know a lot about your code (and can know for sure that a particular call site is the method or class you're targeting). I've be terrified to do it in python without a very thorough test suite.

[0] https://www.jetbrains.com/help/idea/2016.2/structural-search...

IMO the main reason Python standard library is so wildly inconsistent. They don't really have the tools to migrate stuff painlessly and the 'batteries included' approach with weak versioning means you can't change stuff without breaking everyone who upgrades a python version.

It's not about tooling. For example, one issue you can't solve with better tooling: how would you update all the textbooks that have working code examples? Do they all need to monitor for changes and release minor versions as well? That'd be terrible for the community.

The impact of backwards-incompatibility is often worse than you expect. If anyone should know that by now, it's the Python core. Or perhaps the folks that faithfully waited for Perl 6.

Even if they changed the standard library from version to version... the result would be that people would stop using the standard library, migration tools or no tools. Nobody really wants to deal with being pinned to a specific minor version and no older/newer - especially libraries.

The correct solution is to version the standard library separately from the language and allow for versioned dependencies, then a new language/VM update doesn't imply a new library that breaks everything, and vice versa.

But python grew up in an enviroment where this sort of thing was not practical, and the batteries included is actually a good approach for what python tries to do - scripting. It just doesn't scale well in to maintainability.

Once you have the standard library split apart, you might as well split it up, though, and then you don't really have a standard library any more. You could go the Haskell Platform route... which isn't a wonderful idea.

> Haskell Platform route... which isn't a wonderful idea

It's a bad idea, on short timeframes. But it's a hell of an adaptability bonus that ensures the language will keep improving.

Just like Haskell's extension system, or its loose dependency on Prelude, or its multiparadigm emulation.

I keep hopping something better than Haskell appears and people move on - but it probably won't happen any time soon, as soon as something better gets traction, Haskell will simply devour it and keep growing.

I think if Python was developed from scratch now you would have a very small classic standard library and then stuff like HTTP server/client and JSON parsers would be separate libraries handled by package manager, but since python is a scripting language it would make sense to ship with some packages by default (so not a standard library, but say core packages) - this would let you version the core packages like any other package, but it would still let you run scripts without internet access/need to pull random dependencies for a script run.

That is pretty much the Haskell Platform. Haskell people have issues with it because it contains a whole bunch of really useful packages, but essentially pins them at old versions globally. Upgrading them, then, is a global thing, and if someone depends on an older version... you're stuck.

The solution to this, of course, is sandboxing - never install libraries globally, only on a project-specific basis, and then their dependencies can override global ones. But it's fiddly to get the UX right - in Python, managing that involves two separate tools. And you'd need to create a project directory to get a repl with some library in it, unless you had extensions to the repl to install things temporarily - and then you'd likely have two separate UXes for installing things, whether you're doing it for a REPL or a project.

Then people will complain that you need to specify dependencies just to use leftPad function.

Even with a consistent standard library, you can't really be confident that you've properly refactored all usages of a class or function unless you have 100% test coverage -- and even if you did, it would be difficult to automate the refactoring the way you could with a static language.

I don't think your theory is right (I simply don't think it has been given enough polish), but even so, the other side of the coin is that you can relatively easily hack in temporary migration paths. For instance, a function can examine the parameters it is given and convert them to the latest API, spewing out a warning.

Backwards compatibility is mostly an attitude problem.

Yeah if all you have is straight calls to functions, as soon as you do stuff like assign function to a variable, method calls you need to do sophisticated code analysis and that's not even touching the untraceable stuff like string/dynamic access, monkeypatching, etc. etc.

Refactoring in python is bad even if you restrict yourself to "sane" code (no metaprogramming and abusing dynamic stuff so the tools can follow what you're doing) if you need something that will work for everything out there it's just impossible to do.

"...time that could be better spent working on new features and shipping new code"

Can we please stop putting forth this idea that features >>> reliable product? The amount of dev time that a company will save from removing technical debt will likely be more than the extra sales the company will get from a new feature. I look forward to the day where the executive team comes to the developers and ask why they are working on features instead of cutting down technical debt.

> I look forward to the day where the executive team comes to the developers and ask why they are working on features instead of cutting down technical debt.

I had a Philippino friend who told me of his early experience in a Japanese development outfit - the team got praised by management (maybe not CEO) on performance improvements and code reduction. At the time I thought those priorities would never have flown in the US (though was of like thinking).

So this mindset does/did exist.

Do you know what sort of project/product the team produced? I find that to be a key indicator of what sort of work management expects.

If are building something that will be sold, particularly in a competitive market, management will almost always believe that new features trump reducing technical debt. To a product manager, even four weeks of technical debt reduction sounds like "no new features for a month."

>To a product manager, even four weeks of technical debt reduction sounds like "no new features for a month."

Yeah, but if you can't implement new features in under a month because the code is such a mess, then maybe spending a month cleaning it up will allow you to do more features in a shorter amount of time, thus increasing revenue faster than just hacking more features into the existing crappy codebase. Obviously it depends on the codebase, but you have to look at both long and short term goals to make good decisions.

Sometimes it's a struggle to get blessing for four days of technical debt reduction...

There is an old Joel On Software blog post I was reading recently that talked about a methodology at Microsoft they called "Zero defects", where fixing known bug _alway_ had priority over working on new features.

<google google google>

Here - pont 5 in this post: http://www.joelonsoftware.com/articles/fog0000000043.html

Interesting. I recently read an article[1] where John Romero talks about the culture of early id Software and he mentions something similar:

> As soon as you see a bug, you fix it. Do not continue on. If you don’t fix your bugs your new code will be built on a buggy codebase and ensure an unstable foundation.

Looking back to the codebases i've worked with, this advice seems extremely wise.

[1] http://www.gamasutra.com/view/news/279357/Programming_princi...

I guess he forgot about this when making Daikatana? :-) Good advice nonetheless. It's a bit far-fetched, but reminds me of the "if you can do it in less than fives minutes, do it now" thing.

Obligatory mention of 'Masters of Doom'. I think Romero may have been a better programmer than first-time large project manager, but he wouldn't be the only one :)

That's exactly what "Lean" is all about. When you spot an issue "pull the Andon cord" and stop the line, deal with the issue then and there. You then potentially conduct Kaizen to address the process to ensure the issue doesn't happen again.

At core all of Agile Software Development is aiming to mirror Lean in the manufacturing world. At the core of which is the concept of minimizing work-in-progress. Because WIP may result in waste if you subsequently find a problem that renders the WIP null and void. Hence the drive for continuous flow to minimize WIP and a "stop the line" mentality to spot and address issues when they arise.

I'm often amazed at how many people "doing" agile through sprints don't understand that the sprint is just an artificial mechanism to minimize WIP.

Not dealing with technical debt when it occurs is just increasing WIP.

That's by far the best way to work.

It's hard to explain to someone who hasn't experienced it how much easier it is to develop in a bug free code base.

I challenge that; I don't think a "bug free code base" actually exists. Joshua Bloch has a great article about this which I think may be of interest to other readers: https://research.googleblog.com/2006/06/extra-extra-read-all....

To paraphrase: We programmers need all the help we can get, and we should never assume otherwise. Careful design is great. Testing is great. Formal methods are great. Code reviews are great. Static analysis is great. But none of these things alone are sufficient to eliminate bugs: They will always be with us. A bug can exist for half a century despite our best efforts to exterminate it. We must program carefully, defensively, and remain ever vigilant.

I don't think Spolsky was saying their product is bug free, just that they won't tolerate having known bugs.

Not even quite "won't tolerate having known bugs" - it's fundamentally acknowledging that you will at times find bugs in your code base - but that when you do discover bugs you'll prioritise fixing them above adding new features or hitting pre-set project deadlines.

It's a great philosophy or methodology, but it can be very hard to execute on 100%. In the real world there are deadlines and resource requirements and impatient clients and produce launch plans with set dates and trade shows starting next week and the Thanksgiving/Xmas launch opportunity and and and.

There exist, however, code bases where any refactoring or feature addition exposes a latent bug that has to be fixed before proceeding. Additionally, fixing that bug often exposes another one.

We didn't have a bug free code base, but it was rare that someone found an actual bug. As I remember it, it happened maybe once or twice a month in an 8 person team.

At other places I've worked, I wouldn't raise an eyebrow if I found 3 bugs in a day, just trying to get other stuff done.

That bug exists because it's written in a language where + is permitted to silently do surprising things, for reasons that made sense as a performance optimization for general-purpose computers in the '70s and embedded systems in the '90s (the original target of Java) but do not make sense for general-purpose computers today.

Better languages are possible. Provably correct software is possible. We really can eliminate bugs.

> Provably correct software is possible.

Even then, the software is provably correct according to some specification, which can still have bugs. No technology can fully prevent logic bugs.

You're technically correct, but in practice, proving that software does confirm to a formal specification is, empirically, a good way of drastically reducing defects.

Often when people mention provably correct software what they mean is software that don't crash and always produces output for correct input. It usually doesn't mean that the output is correct...

> Provably correct software is possible.

Ah cool, so you've solved the halting problem then?

You don't need to solve the halting problem to write provably correct software (that is, software that conforms to a formal specification). The halting problem (and Rice's theorem) come into play when you are trying to deduce things about arbitrary programs written in turing complete languages.

It's like the difference between being able to express proofs in maths, versus having a method of deciding the validity of arbitrary propositions. We can certainly write programs that we can we can prove things about, but given an arbitrary program, we can't always tell whether it will do the right thing.

It was "solved" in 1940 (proven in 1967) by the simply typed lambda calculus.

They didn't say it had to be Turing complete.

It's not about turing completeness. Deducing arbitrary properties of programming written in primitive recursive languages is also undecidable. This is why type inference in dependently typed languages is undecidable.

Also Joel wrote another article where he says that rewriting the code from scratch (removing technical debt or switching to a new popular framework) is generally a bad idea: http://www.joelonsoftware.com/articles/fog0000000069.html

Improving and perfecting the code is a task that can take infinite amount of time.

I like this policy, but it doesn't directly help to reduce the technical debt.

I want to believe we can someday gave the same policy for bad abstractions.

Bad abstractions are what really makes code difficult to maintain in my opinion.

And what allows many of those bugs to exist in the first place. The best interface is one which cannot be used incorrectly.

"The amount of dev time that a company will save from removing technical debt will likely be more than the extra sales the company will get from a new feature."

Not always true (though it often is!), but particularly not always true in the timespan that the company needs to get sales in...

The problem is that this is very hard to measure. You can't really measure how much time you save on new features after you fixed the old spaghetti-code.

technical debt will not always manifest itself as instability/bad product.

It might just affect development time of things interacting with the bad, old code - or make bugs harder to fix Changes to old nasty code will probably be hacky under the mentality "we're going to refactor this anyways, no point in writing clean code /now/"

Most old code (i've worked with) will function fine - it has for a while after all.

There's a pervasive idea that software is just a laundry list of features. This idea isn't just wrong, it's dangerous. You can pick lots of examples of software where the competitors are in theory feature comparable but the experience of using one is vastly superior to the other, with a consequent huge impact on market success.

How best would you go about objectively convincing the executive team that it's better in the long run to invest a bit in stability now?

Very new to talking with execs and I don't know where to begin.

Agreed. If you want to spend more time working on features, don't accrue technical debt - or pay it off early.

This may or may not be true. Just depends.

Yes! 3M is way way way too much, and too few viscerally feel this. Yelp is far from the only offender, of course.

yes, please! can we then also get rid of the notion that there exists a linear design space called "features", and a linear error space called "bugs"? architecture, protocols, security, etc. can all be considered dimensions.

How do web applications explode out to 3 Million lines of code? Yelp, to me, looks like a typical CRUD app and I would have been surprised if it were more than 100,000 lines of code. The software I develop is pretty large and typically doesn't surpass 40,000 sloc written in-house (i.e. excluding third party libs).

Does anyone here maintain such large codebases? Are they truly that big or are people just counting third party code and generated stuff?

I think you underestimate the complexity of things that happen under the hood or simply out of sight: back-office apps, integration systems for data import/export, backups, alerting. I'm not so familiar with Yelp, but I'd bet they're interfacing with a crap load of additional stuff, including probably in-house tools to follow leads, facilitate reviews, handle their ad programs, etc...

I've commonly seen codebases explode the 1M LoC mark. Not necessarily for a single component, but if you have multiple systems interfacing with each other it's really quite common for business applications. That's obviously excluding libs.

Frameworks are a bit responsible for this in my opinion (note: not saying frameworks are bad, but it's a side-effect), as you'd often either have some additional configuration, boilerplate code, or generated code.

Also, it your application is long-lived (think decades), it's even less surprising: new engineers come and go, and it gets harder and harder to touch the things that were maintained by the previous key-holders... So you had a new stone here and there, polish a turd here and there, but you don't really untangle the mess that sits right at the middle, because it's just way too dangerous (or so you think). And it goes on and on. And it's aggravated by the fact that, as the project grows larger, the barrier for entry for new developers get higher: it takes longer and longer to understand the system/platform in depth, and many never even try to get there.

Processes and security concerns affect this too: you're often only allowed to fix something which has a ticket assigned to it, originating from a business user. Touching anything else is a big no-no, as it would mean QA has to re-test all the things that could be impacted. Of course we can argue whether that's actually the case and how proper testing would mitigate this, but you see my point...

I maintain a line-of-business webapp that could be mistaken for a typical CRUD app, but actually has a lot of business logic enforced in code.

That's stuff you don't have to hardcode -- you can pull it out into a 'rules engine' (at the expense of an additional runtime dependency) or push it further down into, say, database stored procedures (I can hear some of you shudder). But for us, the rules rarely change, or change at a pace that's acceptable to keep up with.

Also, there are a lot of views and specialized interfaces tailored for particular workflows. In several cases, the data underneath of them is the same, but there are different UIs -- that adds LOCs considerably.

I wouldn't draw the distinction at 'hardcode' or some alternative. Rather the question is which language expresses the business rules well enough (or even best).

If your application language also does a decent job of expressing the business logic, that's great.

I don't get the rules engine thing. It is still code. I rather have them hardcoded with proper source control and do frequent releases.

May be when you have rules that change every hour back and forth.

The difference is, I think, that you might want to be able to ad-hoc configure the rules without changing code. Going beyond that, you might want to allow people who are not developers to add/remove/change rules around business logic. Building all of this up can be quite an effort depending on how complicated the rules get.

I don't know exactly what Yelp does, but assuming that they have listings for restaurants that aren't their customers, one cause could be that they take in data from lots of sources. Ideally, that is all in the same format, with an enforced data scheme.

However, if you require that, you'll notice that very little data manages to make it through your entry port. Few suppliers will want to bend their system for you to comply with your rules, and those who do will find inconsistencies in your specs or just ignore them (different encoding, not sending state abbreviations separately from city names, forget to encode HTML entities, encode them twice or thrice (my personal record find here is four levels of HTML encoding in data stored in an excel file), etc)

So, you end up with quick and dirty hacks that convert the suppliers format (or, rather, what you _think_ it is, as the supplier won't be able to tell what their format is, either) to yours, fix some egregious errors, etc.

Hundreds of data suppliers, code with tables mapping their codes to yours, or that map known errors in their input to corrections, 'smart' programmers who notice that they can replace that 10,000 entry table with corrections with 20 lines of code and a 1,000 entry table, except that, a few months on, that 20 lines have ballooned into 1,000 that nobody understands, so that this code can't be used for handling the data from a new supplier anymore, and the line count starts increasing rapidly.

On top of that, once you operate world-wide, you'll learn the joy of differences in addresses. Does a country have states? Zip codes? If so, where does one specify them in an address? If you want to localize that in your app (in yelp's case, people may want to show an address to a taxi driver. For that, it would help if the address followed local conventions) line count skyrockets.

> On top of that, once you operate world-wide, you'll learn the joy of differences in addresses. Does a country have states? Zip codes? If so, where does one specify them in an address? If you want to localize that in your app (in yelp's case, people may want to show an address to a taxi driver. For that, it would help if the address followed local conventions) line count skyrockets.

This is not the first time this issue has been faced. Why would they reinvent the wheel instead of just using libraries and conventions?

This is something that grows slowly on you, and it is a very hairy problem. By the time you realize the scope of your problem, finding a library that can be shoe-horned into your code and doesn't have huge regressions on your data is hard.

Also, assuming a robust generic solution exists, it will almost certainly be slower than a customized one. It is very tempting to think "but our data is relatively clean, we don't need the full feature set of that library that makes imports take an hour longer to run".

Yelp would likely wind up forking such a library, if they used one, when it shows bugs.

There's a layer on top of a number of i18n tools at Yelp, yep.

When you don't spend the time to refactor and groom the codebase, actively seek to reduce the complexity, it grows. And it grows exponentially. Once you reach the point where you're afraid to change something because it might break something unrelated, you just add new code all the time.

I've worked at a couple of places that went this way (joined after they already had millions of lines).

It was code duplication. One example I saw was a many thousand line css file included 4 times in a row. They were all different generations and some were customer specific so it was an impossible task to try to consolidate.

That was a common theme. Enterprise stuff where one customer needs slightly different behaviour so code is duplicated (maybe thousands of lines).

The first time you take that road, you've doomed yourself. With a good / small team they can manage it- but imagine when you've got 20 devs sitting there of varying experience. The precedent is set and the solution for anything mildly complex will forever be, make a copy.

Oh man, I would love to work at a place like that. If only there was a "Sandi Metz" specialization in software development. The worst part might simply be managing the egos of people whose work is refactored.

You can find these places everywhere. Look for jobs where the software department isn't the core business and you'll find small groups of devs bogged down trying to implement another customization among code that looks like

    case 42: //todo: Make sure database agrees
before each calling what are essentially copies of otherwise identical thousand line functions.

Because the "risk" of actually changing code for any client where things already "just work" is a testing risk, and testing is nearly always the bottleneck at these places in my experience, if they've even managed to hire a dedicated tester yet.

Of course if you apply the Joel Test you're likely avoiding these places. But they can be rewarding places to work.

Rewarding in which way? Getting to really improve the codebase? Compensation?

I work at exactly one of those places right now, and "rewarding" is absolutely the last word I would use.

Truly that big. Imagine even 25 full time developers working on a website for several years.

If you look at Yelp and see a simple crud app you're not looking hard enough.

Is the 3M across all applications they developed? That may be much more than just the customer facing part. Plus like others have said, it's also years of development; I've heard a lot of stories that a lot of the high-profile companies and applications (paypal, ebay, linkedin, skype, etc) have huge and rather bad codebases.

I work on a product in payments, specifically a client-side checkout application. It's essentially responsible for collecting data from the user (billing and shipping addresses, affordability information, etc.), presenting the various payment methods and displaying up sell stuff after completing the purchase. Just the client side application is about 250000 lines of code, not including third party stuff.

It feels like applications grow with the team that develops it. At our peak, we were around 10 people working on it, and it's just so hard to do meaningful refactoring when large parts of the codebase are being modified by 10 people in parallel. Not only the technical challenge, but also getting everyone to agree on and embrace patterns and structure.

They probably only add code and never refactor or delete it (and by the way I don't think it is always bad idea if the code is organized). So a large team can write a lot in several years.

This code might also include HTML templates and CSS or JS files that can easily grow large.

This could be including backend tools not seen by simple user...

definitely also tests, deployment and generic tools

I'm looking at one right now with 2m+ lines. It's more surprising that they don't have closer to 10m to 20m lines by now (3m feels smallish for their scale).

Also it's easy for business apps that have been around for 10+ years and continually worked on during that time to get to 1m-2m+ lines of code... and jump from 2m to 3m is easy too.

Take a look at yelp.github.io.. That might help illustrate how much more is going on behind the scenes.

We have an in-house configuration system that generates configurations for about ~14k heterogeneously configured applications (1-3x 3-20 line config file per app) and that system is has about 451k lines in it.

Of course not all lines are code. About 140k lines are flat file database of configurations, and the other ~300k lines is code/templates.

300k lines of code to manage 14k applications. lol.


Actually did a more thorough analysis cleaning out comments and what not. The database is more manageable ~86k lines and template generation code is around ~47k lines. There's a crap ton of whitespace and comments.

Yelp has 4154 employees and probably 50% of them write code. Imagine you have 100'000 customers who all want 2 custom features. Large organisations naturally develop large codebases. Yelp probably has 50M LOC and more. Probably most startups could do with 10x smaller codebases: http://www.paulgraham.com/avg.html

I have been watching an e-commerce open source product grow over the last four years. The product is getting really big. Part is the amount of features but the other part is that there is a lot of code duplication. Developers are focused on adding features. No one seems to care about removing code. I suppose they do a lot of copy and paste. It just grows and grows.

I am responsible for a 1.1m LOC application. It's a whole bunch of separate DLLs.

Previous gig I worked at had a decade old codebase for just the web app that was a bit over 2.5 million according to one of the chief engineers. That wasn't even the complete product just a small piece.

Lol I had exactly the same question :)

It would be nice if some research institution would pay for the rehabilitation of some huge, bloated, ancient, but relatively unimportant app. Ideally by independent teams in parallel.

Just to get some real data on what works, rather than anecdotes from veterans.

Your comment opened my eyes. Considering the importance of software development in today's world ( and tomorrow's), the fact that those best practices or code management technics are found mostly in blogs, instead of scientific papers with proper experiments, tells a lot.

There is a strong field exploring this subject, but it's called 'software verification'. Here's a paper on automated defect detection [0], which is effectively a study on how to write good tests.

[0] https://homes.cs.washington.edu/~mernst/pubs/mutation-effect...

I don't follow. What does it tell?

Have you heard of the term/phrase "bro science" with respect to weight lifting? I'd draw an analogy between them.

Clarification: what I mean is there is a lot of ad-hoc stuff in these blogs that may be narrowly true or applicable to a given project or subset of a domain, but generally there is either significant evidence against some of these blogs' contents or the contents themselves have absolutely no formal, rigorous support.

I see your point, thanks. But the fact that not a lot of research is done in this field doesn't tell you much about the validity of the non-scientific blogosphere findings, right?

Nice analogy btw.

I switched to Scrum, check out my GAINZ!

I enjoy refactoring, I'd help with something like this.

I think the pain in refactoring is not the refactoring itself, it's the weeks spent creating a mental map of a foreign and architecturally broken codebase (but which sorta works). Even to write the non-reg tests before even considering the refactor, much time must be spent becoming an expert in something that will die soon.

Where do I sign up?

Seriously, this sounds like exactly my kinda gig.

Unfortunately I envision something like that just being bikeshedded to all hell by crowds like this which would undoubtedly trivialize the results.

I have my doubts that a refactor that consists merely of more complex search and replace actions is a true refactor. Also, this reads more like an ad for their Python tool than about any lessons learned during this refactoring.

It seems to me this is only going to handle the most trivial kind of technical debt. This kind of tool can't manage the way you organized your codebase, for instance. There's more to refactoring than find-and-replace.

Well even if it just gave you some reasonable abstraction over regex, at this kind of scale it seems useful. Plus I bet they had excellent test coverage, which really makes refactoring much easier.

I would also be interested the thought process in deciding what functionality to refactor. Did you review the code and identify areas before unleashing your tool on it?

With 3M lines of code gone, it must be terrifying to feel that it may have broken something. How did you ensure that it is still working as before?

Edit: Grammatic corrections.

I had to deal with many sizable codebases of legacy code over the years and even answered a similar question once on StackExchange. Apparently people liked the answer: http://programmers.stackexchange.com/questions/155488/ive-in...

The question was within a somewhat specific context (team of scientists, visual programming environment, etc...) but I use the same approach for all projects. Never reached 3M LoC for a single projet or system component though. More like 1.5M. Anyways, that process works for me. Maybe it does for others.

One approach that's probably quite easy to implement is to generate test suite code coverage by file, and only do the substitutions in files with sufficient coverage (where your definition of "sufficient" might vary, of course).

This smells as being a need that comes as a consequence of using a dynamically-typed language. Because the example given seems to be just getting rid of the usage of a certain method, to replace with a new one. In a statically typed language, e.g. C#, you just mark the old method with an [Obsolete] attribute and go fix all the warnings. (Granted, a tool that replaces all these usages is also useful, but to me, there are much more complex ways of technical debt than just obsolete methods.)

In VS2015, showing all references to methods, props, classes, etc. is just one click away. Unfortunately it's not available in the community version.


Use MonoDevelop/XamarinStudio.

Exactly, this is just basic functionality of Visual Studio now, this whole operation could be done in a few seconds without issue. Fix any build errors, run tests, job done.

IMO a much better approach is JSCodeShift, which works based on the AST: https://github.com/facebook/jscodeshift

“Let a 1,000 flowers bloom. Then rip 999 of them out by the roots.”

This is paraphrasing of Chairman Mao:

"The policy of letting a hundred flowers bloom and a hundred schools of thought contend is designed to promote the flourishing of the arts and the progress of science"

And the ripping out by roots part brings labor camps to mind:

"After this brief period of liberalization, Mao abruptly changed course. The crackdown continued through 1957 as an Anti-Rightist Campaign against those who were critical of the regime and its ideology. Those targeted were publicly criticized and condemned to prison labor camps."


Related: http://coccinelle.lip6.fr/

Coccinelle is used extensively by Linux kernel developers for a whole tonne of things like this.

* Newline at EOF

* Double quoted docstring

* Remove unused imports.

These things are largely cosmetic.

Indeed what i was thinking. A big refactoring is often structural, and changes the program in a bigger way. In Java you can actually easly move classes around by drag-n-drop and they code will refactor. In python this is impossible.

A big refactoring often means reducing the total amount of code you write to make it more readable. In Java it's usually going to be about 2x what the equivalent in python would be.

Most Python IDEs (PyCharm and PyDev, for example) do that for you, even if it's not "drag-n-drop" like in Java...

This is a big part of using java for me. I use the refactor tools all the time.

Having an intern write a good blog post for the engineering blog is a great recruiting move.

Or a horrible statement at the level of company interest in tech debt - a problem given to the interns.

Refactoring... "puts a massive drain on developer time; time that could be better spent working on new features and shipping new code"

This is the wrong way to think. Refactoring will save time by making code faster, more reliable and making it easier to build those new features in the first place. Looks like their biggest issue is bad technical management, not deprecated code.

Was the 3M LOC refactoring for yelp.com? The article doesn't say. How could possibly a review site have 3M LOC?

The front-facing system is the smallest part

The admin and back-office system is deeper. Beyond the reviews you also have: events, mailing list management, profile management, messages, i18n, search, etc

Example: http://engineeringblog.yelp.com/2015/10/how-we-use-deep-lear...

Edit: still, 3Mi lines is massive. However, I think there's something that contributes significantly: HTML and CSS

Hmm, well even with all those, the amount of code just blows my mind!

Once I was talking to someone from SAP. He told me about their Netweaver framework is 1 billion LOC and 40k SQL tables. That does not include any application, yet.

The biggest reason for the bloat: People are not allowed to change code. You only add code. Now try to add a button to some GUI without changing a single line of the existing code. At this moment I understood, why enterprises produce those overengineered AdapterFactorySingletonDecoratorBridge stuff.

Reminds me of Ubisoft claims that its Assassin's Creed crouching mechanic had 6M LOC

If you pay coders to code, they will give you code, and if you track performance based on their output, the code remains to justify their cost. And it's just hard to delete good work. So everything grows in businesses that hire tons of (good) coders. Pay them to refactor it and they'll do that too :)

The amount of code you have in a company is usually a function of the amount of engineers you have.

Sad. Hopefully that function is sub-linear.

Google has a similar tool called Refaster


The patterns that Undebt removes could be non-existent in the code base, but the conceptual, algorithmic design decisions and architecture could be all bad and that is what the actual technical debt is. Bad code patterns are just a tiny slice of the problem in most cases.

If any of you guys are interested plotting total lines of code changes, do checkout https://github.com/kaihendry/graphsloc

It's like codesearch [1].

[1] https://github.com/google/codesearch

The fact that you need a find and replace tool to make sensible changes to your codebase indicates that it's hugely out of control already.

Interesting, I have always ( wrongly ) remember yelp as a Ruby Rails shop. Does anyone know the stack behind yelp?

It is mostly Python, both the monolith and SoA. A few services are Java, e.g. for talking to Lucene. The github site actually gives a reasonably complete view of the stack via the various tools we have for integrating with various parts of the stack. https://yelp.github.io/

Is there a Haskell version for this?

Haskell code (as with LISP code) does not need refactoring. It always comes out as a distilled, pure and crystalline elixir of wisdom. :)

Why pyparsing, not ply or regex?

Never used ply, but why not regex? Regex cannot easily match recursive structures, and can't be easily composed. Pyparsing gives you more structure and ready building blocks. It actually can use regex as one of the term objects.

Ply tries to implement lex/yacc only, so it's far from a nice interface really...

Or specifically: it's trivial to match `specific_func(any_expression, {capture, these, values})`, but it's practically impossible with regex and painful to post-process with ply.

Jamie Zawinski <jwz@netscape.com> wrote on Tue, 12 Aug 1997 13:16:22 -0700:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

"I've noticed that when I get quoted in .sig files, it's never any of the actual clever things I say all the time. Usually it's something dumb." --Jamie Zawinski

regexps are error prone better use ASTs see Facebook's jscodeshift

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact