
Undebt: How We Refactored 3M Lines of Code - vivagn
http://engineeringblog.yelp.com/2016/08/undebt-how-we-refactored-3-million-lines-of-code.html
======
necubi
For Java, IntelliJ has a built-in version of this called "structural search
and replace" [0]. This is incredibly useful when a library changes an API or
you need to refactor a lot of similar code.

This feels relatively safe in Java because tooling can staticly know a lot
about your code (and can know for sure that a particular call site is the
method or class you're targeting). I've be terrified to do it in python
without a very thorough test suite.

[0] [https://www.jetbrains.com/help/idea/2016.2/structural-
search...](https://www.jetbrains.com/help/idea/2016.2/structural-search-and-
replace.html)

~~~
rubber_duck
IMO the main reason Python standard library is so wildly inconsistent. They
don't really have the tools to migrate stuff painlessly and the 'batteries
included' approach with weak versioning means you can't change stuff without
breaking everyone who upgrades a python version.

~~~
vertex-four
Even if they changed the standard library from version to version... the
result would be that people would stop using the standard library, migration
tools or no tools. Nobody really wants to deal with being pinned to a specific
minor version and no older/newer - especially libraries.

~~~
rubber_duck
The correct solution is to version the standard library separately from the
language and allow for versioned dependencies, then a new language/VM update
doesn't imply a new library that breaks everything, and vice versa.

But python grew up in an enviroment where this sort of thing was not
practical, and the batteries included is actually a good approach for what
python tries to do - scripting. It just doesn't scale well in to
maintainability.

~~~
vertex-four
Once you have the standard library split apart, you might as well split it up,
though, and then you don't really have a standard library any more. You could
go the Haskell Platform route... which isn't a wonderful idea.

~~~
rubber_duck
I think if Python was developed from scratch now you would have a very small
classic standard library and then stuff like HTTP server/client and JSON
parsers would be separate libraries handled by package manager, but since
python is a scripting language it would make sense to ship with some packages
by default (so not a standard library, but say core packages) - this would let
you version the core packages like any other package, but it would still let
you run scripts without internet access/need to pull random dependencies for a
script run.

~~~
vertex-four
That is pretty much the Haskell Platform. Haskell people have issues with it
because it contains a whole bunch of really useful packages, but essentially
pins them at old versions globally. Upgrading them, then, is a global thing,
and if someone depends on an older version... you're stuck.

The solution to this, of course, is sandboxing - never install libraries
globally, only on a project-specific basis, and then their dependencies can
override global ones. But it's fiddly to get the UX right - in Python,
managing that involves two separate tools. And you'd need to create a project
directory to get a repl with some library in it, unless you had extensions to
the repl to install things temporarily - and then you'd likely have two
separate UXes for installing things, whether you're doing it for a REPL or a
project.

------
wRastel27
"...time that could be better spent working on new features and shipping new
code"

Can we please stop putting forth this idea that features >>> reliable product?
The amount of dev time that a company will save from removing technical debt
will likely be more than the extra sales the company will get from a new
feature. I look forward to the day where the executive team comes to the
developers and ask why they are working on features instead of cutting down
technical debt.

~~~
bigiain
There is an old Joel On Software blog post I was reading recently that talked
about a methodology at Microsoft they called "Zero defects", where fixing
known bug _alway_ had priority over working on new features.

<google google google>

Here - pont 5 in this post:
[http://www.joelonsoftware.com/articles/fog0000000043.html](http://www.joelonsoftware.com/articles/fog0000000043.html)

~~~
epidemian
Interesting. I recently read an article[1] where John Romero talks about the
culture of early id Software and he mentions something similar:

> As soon as you see a bug, you fix it. Do not continue on. If you don’t fix
> your bugs your new code will be built on a buggy codebase and ensure an
> unstable foundation.

Looking back to the codebases i've worked with, this advice seems extremely
wise.

[1]
[http://www.gamasutra.com/view/news/279357/Programming_princi...](http://www.gamasutra.com/view/news/279357/Programming_principles_from_the_early_days_of_id_Software.php)

~~~
sotojuan
I guess he forgot about this when making Daikatana? :-) Good advice
nonetheless. It's a bit far-fetched, but reminds me of the "if you can do it
in less than fives minutes, do it now" thing.

~~~
pjmorris
Obligatory mention of 'Masters of Doom'. I think Romero may have been a better
programmer than first-time large project manager, but he wouldn't be the only
one :)

------
Kendrick2
How do web applications explode out to 3 Million lines of code? Yelp, to me,
looks like a typical CRUD app and I would have been surprised if it were more
than 100,000 lines of code. The software I develop is pretty large and
typically doesn't surpass 40,000 sloc written in-house (i.e. excluding third
party libs).

Does anyone here maintain such large codebases? Are they truly that big or are
people just counting third party code and generated stuff?

~~~
niftich
I maintain a line-of-business webapp that could be mistaken for a typical CRUD
app, but actually has a lot of business logic enforced in code.

That's stuff you don't _have_ to hardcode -- you can pull it out into a 'rules
engine' (at the expense of an additional runtime dependency) or push it
further down into, say, database stored procedures (I can hear some of you
shudder). But for us, the rules rarely change, or change at a pace that's
acceptable to keep up with.

Also, there are a lot of views and specialized interfaces tailored for
particular workflows. In several cases, the data underneath of them is the
same, but there are different UIs -- that adds LOCs considerably.

~~~
huherto
I don't get the rules engine thing. It is still code. I rather have them
hardcoded with proper source control and do frequent releases.

May be when you have rules that change every hour back and forth.

~~~
manacit
The difference is, I think, that you might want to be able to ad-hoc configure
the rules without changing code. Going beyond that, you might want to allow
people who are not developers to add/remove/change rules around business
logic. Building all of this up can be quite an effort depending on how
complicated the rules get.

------
vintermann
It would be nice if some research institution would pay for the rehabilitation
of some huge, bloated, ancient, but relatively unimportant app. Ideally by
independent teams in parallel.

Just to get some real data on what works, rather than anecdotes from veterans.

~~~
bsaul
Your comment opened my eyes. Considering the importance of software
development in today's world ( and tomorrow's), the fact that those best
practices or code management technics are found mostly in blogs, instead of
scientific papers with proper experiments, tells a lot.

~~~
skrebbel
I don't follow. What does it tell?

~~~
sidlls
Have you heard of the term/phrase "bro science" with respect to weight
lifting? I'd draw an analogy between them.

Clarification: what I mean is there is a lot of ad-hoc stuff in these blogs
that may be narrowly true or applicable to a given project or subset of a
domain, but generally there is either significant evidence against some of
these blogs' contents or the contents themselves have absolutely no formal,
rigorous support.

~~~
skrebbel
I see your point, thanks. But the fact that not a lot of research _is done_ in
this field doesn't tell you much about the validity of the non-scientific
blogosphere findings, right?

Nice analogy btw.

------
pmarreck
I have my doubts that a refactor that consists merely of more complex search
and replace actions is a true refactor. Also, this reads more like an ad for
their Python tool than about any lessons learned during this refactoring.

------
wry_discontent
It seems to me this is only going to handle the most trivial kind of technical
debt. This kind of tool can't manage the way you organized your codebase, for
instance. There's more to refactoring than find-and-replace.

~~~
sfrailsdev
Well even if it just gave you some reasonable abstraction over regex, at this
kind of scale it seems useful. Plus I bet they had excellent test coverage,
which really makes refactoring much easier.

------
yitchelle
I would also be interested the thought process in deciding what functionality
to refactor. Did you review the code and identify areas before unleashing your
tool on it?

With 3M lines of code gone, it must be terrifying to feel that it may have
broken something. How did you ensure that it is still working as before?

Edit: Grammatic corrections.

~~~
haylem
I had to deal with many sizable codebases of legacy code over the years and
even answered a similar question once on StackExchange. Apparently people
liked the answer: [http://programmers.stackexchange.com/questions/155488/ive-
in...](http://programmers.stackexchange.com/questions/155488/ive-
inherited-200k-lines-of-spaghetti-code-what-now)

The question was within a somewhat specific context (team of scientists,
visual programming environment, etc...) but I use the same approach for all
projects. Never reached 3M LoC for a single projet or system component though.
More like 1.5M. Anyways, that process works for me. Maybe it does for others.

------
knocte
This smells as being a need that comes as a consequence of using a
dynamically-typed language. Because the example given seems to be just getting
rid of the usage of a certain method, to replace with a new one. In a
statically typed language, e.g. C#, you just mark the old method with an
[Obsolete] attribute and go fix all the warnings. (Granted, a tool that
replaces all these usages is also useful, but to me, there are much more
complex ways of technical debt than just obsolete methods.)

~~~
smilekzs
In VS2015, showing all references to methods, props, classes, etc. is just one
click away. Unfortunately it's not available in the community version.

[https://msdn.microsoft.com/en-
us/library/dn269218.aspx?f=255...](https://msdn.microsoft.com/en-
us/library/dn269218.aspx?f=255&MSPPError=-2147217396)

~~~
knocte
Use MonoDevelop/XamarinStudio.

------
NikhilVerma
IMO a much better approach is JSCodeShift, which works based on the AST:
[https://github.com/facebook/jscodeshift](https://github.com/facebook/jscodeshift)

------
karlheinz
“Let a 1,000 flowers bloom. Then rip 999 of them out by the roots.”

This is paraphrasing of Chairman Mao:

"The policy of letting a hundred flowers bloom and a hundred schools of
thought contend is designed to promote the flourishing of the arts and the
progress of science"

And the ripping out by roots part brings labor camps to mind:

"After this brief period of liberalization, Mao abruptly changed course. The
crackdown continued through 1957 as an Anti-Rightist Campaign against those
who were critical of the regime and its ideology. Those targeted were publicly
criticized and condemned to prison labor camps."

[https://en.wikipedia.org/wiki/Hundred_Flowers_Campaign](https://en.wikipedia.org/wiki/Hundred_Flowers_Campaign)

------
ajdlinux
Related: [http://coccinelle.lip6.fr/](http://coccinelle.lip6.fr/)

Coccinelle is used extensively by Linux kernel developers for a whole tonne of
things like this.

------
crdoconnor
* Newline at EOF

* Double quoted docstring

* Remove unused imports.

These things are largely cosmetic.

~~~
TeeWEE
Indeed what i was thinking. A big refactoring is often structural, and changes
the program in a bigger way. In Java you can actually easly move classes
around by drag-n-drop and they code will refactor. In python this is
impossible.

~~~
crdoconnor
A big refactoring often means reducing the total amount of code you write to
make it more readable. In Java it's usually going to be about 2x what the
equivalent in python would be.

------
impish19
Having an intern write a good blog post for the engineering blog is a great
recruiting move.

~~~
yarper
Or a horrible statement at the level of company interest in tech debt - a
problem given to the interns.

------
manigandham
Refactoring... "puts a massive drain on developer time; time that could be
better spent working on new features and shipping new code"

This is the wrong way to think. Refactoring will save time by making code
faster, more reliable and making it easier to build those new features in the
first place. Looks like their biggest issue is bad technical management, not
deprecated code.

------
spapas82
Was the 3M LOC refactoring for yelp.com? The article doesn't say. How could
possibly a review site have 3M LOC?

~~~
raverbashing
The front-facing system is the smallest part

The admin and back-office system is deeper. Beyond the reviews you also have:
events, mailing list management, profile management, messages, i18n, search,
etc

Example: [http://engineeringblog.yelp.com/2015/10/how-we-use-deep-
lear...](http://engineeringblog.yelp.com/2015/10/how-we-use-deep-learning-to-
classify-business-photos-at-yelp.html)

Edit: still, 3Mi lines is massive. However, I think there's something that
contributes significantly: HTML and CSS

~~~
spapas82
Hmm, well even with all those, the amount of code just blows my mind!

~~~
qznc
Once I was talking to someone from SAP. He told me about their Netweaver
framework is 1 billion LOC and 40k SQL tables. That does not include any
application, yet.

The biggest reason for the bloat: People are not allowed to change code. You
only add code. Now try to add a button to some GUI without changing a single
line of the existing code. At this moment I understood, why enterprises
produce those overengineered AdapterFactorySingletonDecoratorBridge stuff.

------
mahyarm
The amount of code you have in a company is usually a function of the amount
of engineers you have.

~~~
jobigoud
Sad. Hopefully that function is sub-linear.

------
pmarreck
A funny and relevant tweet:

[https://twitter.com/php_ceo/status/765298072691806209](https://twitter.com/php_ceo/status/765298072691806209)

------
fleaflicker
Google has a similar tool called Refaster

[https://github.com/google/Refaster](https://github.com/google/Refaster)

------
ilostmykeys
The patterns that Undebt removes could be non-existent in the code base, but
the conceptual, algorithmic design decisions and architecture could be all bad
and that is what the actual technical debt is. Bad code patterns are just a
tiny slice of the problem in most cases.

------
hendry
If any of you guys are interested plotting total lines of code changes, do
checkout
[https://github.com/kaihendry/graphsloc](https://github.com/kaihendry/graphsloc)

------
kctess5
It's like codesearch [1].

[1]
[https://github.com/google/codesearch](https://github.com/google/codesearch)

------
yarper
The fact that you need a find and replace tool to make sensible changes to
your codebase indicates that it's hugely out of control already.

------
ksec
Interesting, I have always ( wrongly ) remember yelp as a Ruby Rails shop.
Does anyone know the stack behind yelp?

~~~
bpicolo
It is mostly Python, both the monolith and SoA. A few services are Java, e.g.
for talking to Lucene. The github site actually gives a reasonably complete
view of the stack via the various tools we have for integrating with various
parts of the stack. [https://yelp.github.io/](https://yelp.github.io/)

------
avindroth
Is there a Haskell version for this?

~~~
kelvin0
Haskell code (as with LISP code) does not need refactoring. It always comes
out as a distilled, pure and crystalline elixir of wisdom. :)

------
denfromufa
Why pyparsing, not ply or regex?

~~~
coredog64
Jamie Zawinski <jwz@netscape.com> wrote on Tue, 12 Aug 1997 13:16:22 -0700:

Some people, when confronted with a problem, think “I know, I'll use regular
expressions.” Now they have two problems.

~~~
pkroll
"I've noticed that when I get quoted in .sig files, it's never any of the
actual clever things I say all the time. Usually it's something dumb."
\--Jamie Zawinski

