
We decided to go for the big rewrite - duijf
https://tech.channable.com/posts/2019-10-04-why-we-decided-to-go-for-the-big-rewrite.html
======
royjacobs
Reading this article it seems like yet another example of "you don't have big
data". Most of the features that are unique to Spark (or Spark-like setups)
were not needed, so in the end it's mostly...just an app talking to Postgres?

I'm not sure, but reading other articles[0] on the blog seems like they've
been jumping on bandwagons before, so it's probably good to come back on those
decisions every now and again.

Edit: Not trying to come off as too snarky, though I've found that this type
of thing is pretty common in startups where everyone from the CTO on down has
_some_ experience but not _a lot_ of experience. I've fallen into that trap
too, at some point saying "Sure, Scala will work great! It's future-proof and
everyone will love it!" _cue crickets_

[0] [https://tech.channable.com/posts/2017-02-24-how-we-
secretly-...](https://tech.channable.com/posts/2017-02-24-how-we-secretly-
introduced-haskell-and-got-away-with-it.html)

~~~
kod
> "Sure, Scala will work great! It's future-proof and everyone will love it!"

Scala is still the least-bad option for a JVM language.

Anyone who can't be productive in Scala (not "better java", not "worse
haskell", Scala) isn't someone you want on your team anyway.

~~~
nemothekid
> _Anyone who can 't be productive in Scala (not "better java", not "worse
> haskell", Scala) isn't someone you want on your team anyway._

It’s funny that you gloss over one the big productivity issues with Scala.
Somehow everyone who joins your team should be acutely aware of _your_ flavor
of Scala.

If the community decided if wanted to be a better java or a worse Haskell then
I bet more people would be productive in Scala

~~~
kod
It's not "my" flavor of Scala, it's what it's designed for. It's a pragmatic,
ML family language.

Read Odersky's book, write code, still have access to all JVM libraries and a
lot less brain damage.

There are people that try to write worse Haskell in everything from Perl to
Kotlin too. The reputation is overblown and has little to do with the actual
language.

------
alexpotato
Rewriting an application can mean different things:

1\. "We are going to start over from scratch and rewrite the whole thing!"

Joel Spolksy famously said to "Never do this!"

2\. "We are going to slowly refactor the whole codebase."

This can, eventually, lead you to a place where none of the original code is
there so it's like a rewrite but much simpler.

3\. "We are going to slowly add new places to replace the old system till
there is no new system."

This is called the "Strangler App" model as described by Martin Fowler
([https://martinfowler.com/bliki/StranglerFigApplication.html](https://martinfowler.com/bliki/StranglerFigApplication.html))

Granted, for some reason, it seems that "I retired a legacy system and rolled
out a brand new one" seems to look better on a resume than "I refactored a
legacy system into a better system." so your YMMV.

~~~
steve_adams_86
I'm not sure about the resume part. My experience refactoring legacy stuff
into more modern stuff is actually a big talking point for my resume in my
experience. People are always interested in why I did it, how I did it, why
didn't I just build anew, etc.

Maybe that's because most jobs I apply for tend to have at least one 'legacy'
system laying around that needs some help. But this is somewhat common outside
of the startup scene.

Legacy in this case means an app that's maybe 7-10 years old, often written
like everything happened on the back of a napkin. Sprawling, inconsistent,
monolithic stuff. But not true legacy where it's written in Fortran or
something in the 1500s.

At any rate, I think it's worth having refactors on your resume. I know I'd be
interested in people who have done it - it's always such an educational
experience.

~~~
alexpotato
Should have been more specific as to which industry I was referring to
regarding the resume comment.

I worked for a big bank for 5 years supporting large trading systems. After
the 3rd attempt to retire a legacy trading system by rolling out a new system,
someone pointed out that if you are a mid level manager you don't get any
resume points for refactoring old systems.

Hence, you keep trying to roll out new systems even if those systems have bugs
like "switch all buy orders to sells" (true story).

------
gtsteve
This doesn't really sound like a big-bang rewrite as such but an incremental
development process. The sort of rewrite that would be a major strategic
mistake is where you start with a new repository and begin re-implementing the
entire product, but this seems all perfectly sensible to me.

This just sounds like the sort of incremental Ship of Theseus [0] development
that many of us are doing. The product I'm working on has had enough key
internal portions rewritten over a long enough time (including interestingly
the job management system) that you could say it's a rewrite compared to the
product from 2 years ago.

[0]
[https://en.wikipedia.org/wiki/Ship_of_Theseus](https://en.wikipedia.org/wiki/Ship_of_Theseus)

~~~
barking
I'd like to see some idea of what they mean by big. An order of magnitude in
terms of lines of code and the number of engineers they had for the job and
how long it all took versus how long the original took to write etc. One of
the things that scares me about something like this are all the seemingly
illogical bits of code that were added over time and are actually that way for
a reason because they fix some edge case or whatever. Hard to see those not
getting swept away in a rewrite as well as a whole bunch of new rarely
occurring issues added. A complete rewrite strikes me as something that's
almost guaranteed to be buggier than what it replaces, at least for awhile.

~~~
LandR
>I'd like to see some idea of what they mean by big. An order of magnitude in
terms of lines of code and the number of engineers they had for the job and
how long it all took versus how long the original took to write etc.

Yeah, at my last place before I left, they were looking to do a rewrite of
their main project that had been the result of years of hundreds of developers
working on it. The time-scale was wooly, but was expected to be between 5 and
10 years and probably require ~200 developers and testers.

This was rewriting millions of lines of code, very little would be reusable.

I don't know if they went ahead with it or not, but it was looking like it
would descend into chaos.

~~~
eej71
What prompted everyone to want to rewrite so much code?

~~~
LandR
They were on a system where if bugs existed past a certain date of them being
logged, their was financial penalties that had to be paid.

Also adding new features was incredibly slow, painful and dangerous. Even a
minor change resulted in so much manual regression needing to be done. It was
absurd.

------
Darkstryder
> Prematurely designing systems “for scale” is just another instance of
> premature optimization

> Examples abound: (...) using a distributed database when Postgres would do

This is the only part of the article that bugged me a little, because in my
experience the choice between single-machine and distributed databases is not
so much about scale as it is about availability and avoiding a single point of
failure.

Even if your database server is fairly stable (a VM in a robust cloud for
instance), if you use Postgres or MySQL and you need to upgrade to a newer
version of the database (let say for an urgent security update), you have no
choice but to completely stop the service for a few seconds / minutes
(assuming the service cannot work without its database).

Depending on the service and its users, this mandatory down-time might or
might not be acceptable.

Anecdotally I suspect services requiring high SLAs are more common than ones
requiring petabyte scale storage.

~~~
duijf
Re availability: We had a hard time keeping the system based on Spark
available. There were days when the cluster would freak out multiple times in
a single day. The 'fix' would be: restart a bunch of spark workers. We spent a
lot of time debugging/finding this out (some parts documented in [1]) but
couldn't work out what the problem was. (EDIT: Assuming there even was a
single problem.)

In this particular case, I'd take the single point of failure over the
previous situation.

That being said: we have successfully used PostgreSQL's fail-overs multiple
times. In my experience, they work quite alright.

[1]: [https://tech.channable.com/posts/2018-04-10-debugging-a-
long...](https://tech.channable.com/posts/2018-04-10-debugging-a-long-running-
apache-spark-application.html)

~~~
Darkstryder
Yeah, I agree. It was more of a general comment, because you seem to have one
Postgres instance for every client, which is already a big step against SPOF.

At $previous_job we had a "one service" = "one MySQL instance" policy. Every
time a MySQL server would go down all clients would all lose access to that
service at the same time. It was stressful and much less robust than your
setup.

------
jermaustin1
This is a story of one of my first product launches. And inevitable rewrites
that ensued. You can read more of it here:
[https://jeremyaboyd.micro.blog/2016/11/05/my-first-
product.h...](https://jeremyaboyd.micro.blog/2016/11/05/my-first-product.html)

Years ago I was building SEO software. One of the products was originally
written as an internal tool, and handled our work load without skipping a
beat. Then we decided to release it to the public, so I did a small refactor
to implement accounts. We launched with it on hosted on a small Dell PC under
my desk (where it had been running as our internal tool). Within 2 hours of
launch, it was completely overwhelmed and shutting down due to overheating.

It was "rewrite" time.

While doing that, I had to come up with SOME work around. So I opened the case
and stuck a box fan on it to try and exhaust some of the heat. That lasted
about 8 more hours. Before the server shut down, and I got a call from the
boss.

I went in to the office in the middle of the night, and started profiling the
application. I found a VPS host, quickly spun up their largest Windows VM they
offered, and that helped for a few days while I rewrote large swaths of the
application. Even after a ~80% rewrite and splitting the application in two,
we had more users that we'd ever anticipated and I was out of my depths with
scaling. So we got a few (much larger) physical servers at Softlayer.

This was the setup that this website ran under for the next couple of years
with minor tweaks, more space with an iSCSI array, more RAM, migrating to a
more CPU, etc, but all staying at Softlayer. Eventually when the hosting bill
was getting into the high four figures a month, we reevaluated and decided a
rewrite was in order to switch it to Microsoft Azure utilizing Azure SQL,
Azure Table Storage, Azure Queue Service, and offloading all of the
complicated tasks from the web server onto the Azure infrastructure. For all I
know it is still on Azure.

------
goto11
Is the current system is so badly architected that it cannot be refactored
gradually or rewritten piecemeal? Then the forces which caused these problems
will also be in effect during the rewrite, so it will end in the same place
when it reaches feature parity.

I can only think of a few places where a full rewrite is justified:

* You lost the source code

* The application is almost purely integration with some 3'rd party platform or component, and you need to replace that platform. (E.g. you are developing a registry cleaner and need to port it to Mac)

* You don't have any customers or users yet and time-to-market is not a concern.

* You are not a business and are writing the code purely for your own enjoyment

But these are business level considerations. For individual developers there
may be compelling reasons to push for a rewrite:

* You find it more fun to work on green-field projects than to perform maintenance development.

* The new platform is more exciting or looks better on the CV than the old

~~~
ryanelfman
That's not necessarily true. Those forces will still be there but the learning
of what not to do has also been had. People can improve and learn over time.

~~~
goto11
Yes but if you have improved over time, then at least the most recently
developed parts of the code would be high-quality and would not need
rewriting. So you would only need to rewrite some encapsulated legacy parts of
the codebase, which is completely different from a full rewrite.

------
mark_l_watson
I have also found Spark (and Hadoop before that) a little clunky to prototype
and develop on, but when you need to handle very large data sets with good
throughout performance then systems like Spark/Hadoop are great. One problem
they had was maintaining infrastructure, and to be honest, when I used
mapreduce as a contractor at Google or AWS Elastic MapReduce as a consultant I
didn’t have to deal too much with infrastructure.

Anyway, it makes sense that they backed off using Spark and HDFS - makes sense
given the size of their datasets.

The original poster mentioned that their data analytics software is written in
Haskell. I would like to see a write up on that.

EDIT: I see that they do have two articles on their blog on their use of
Haskell.

------
z3t4
"A creature from another dimension would see us as noise as our atoms are
constantly being replaced"

Don't wait for a big rewrite. Constantly keep deleting and rewriting. Just
make sure you are solving real problems while doing it.

~~~
barking
This feels like a far safer path to take

------
lifeisstillgood
>>> One of our main reasons for choosing Apache Spark had been its ability to
handle very large datasets (larger than what you can fit into memory on a
single node) and its ability to distribute computations over a whole cluster
of machines ... We cannot fit all of our datasets in memory on one node, but
that is also not necessary, since we can trivially shard datasets of different
projects over different servers, because they are all independent of one
another.

So this seems to be the massive takeaway - if you need to operate on a _whole_
dataset that is larger than one node's memory capacity then you have to go
distributed. Else it still seems an overhead barely worth the effort.

So Google: dataset is all web pages on the internet - yes that's too large go
distributed.

Tesco / Walmart : dataset might be all the sales for a year. Probably too
large. But could you do with sales per week? per day?

having the raw data of all your transactions etc lying around waiting for your
spiffo business query sounds good but ... is it?

I would be interested in hearing folks' cut-off points for going full Big Data
vs "we don't really need this"

------
jimbokun
We decided to go for a Big Rewrite for a completely different reason. The
initial license for the proprietary NoSQL database we had negotiated was about
to expire, and the company was going to charge us an order of magnitude more
to renew.

So we immediately set out redesigning our system to use other, fully open
source technologies. Also gave us an opportunity to reconsider architecture
decisions that had not scaled well. In our case, moving from a monolith to
microservices has had major benefits. Maybe the biggest being able to quickly
see which microservice is the bottleneck and needs to be scaled up to handle
the load. With the monolith, if it got slow, it was very difficult to figure
out which part of the workload was making it slow.

------
bluedino
Our company runs on a pile of VBA/Access. At least it talks to a MariadDB
server on Linux.

The biggest problems are trying to run/develop this code on machines that were
made in the last ten years, the other is that it's a horrible, horrible
codebase. Code practices from the early 90's.

To make things worse, objects are 'evil', all HTML/SQL/XML is built by
appending strings, there's no data sanity checks.....

I started a proof of concept replacement system that was written in Python and
ran on the web.

It was met with "We can go with a web based system, since if anything changes
in the browser we'll be up shit creek."

:-/

~~~
viraptor
Ah... The fine line between "better the devil you know" and "I don't
understand the options so every change is too scary".

------
sandGorgon
in case you are using PySpark - a good framework to move to is Dask.
[https://docs.dask.org/en/latest/](https://docs.dask.org/en/latest/)

it is also natively integrated with K8s - [https://github.com/dask/dask-
kubernetes](https://github.com/dask/dask-kubernetes) \- and Yarn
[https://yarn.dask.org/en/latest/](https://yarn.dask.org/en/latest/)

------
roland35
I am currently evaluating if we should rewrite a large chunk of our embedded
controller which handles motion control right now, so this is a timely write-
up! I think the lessons here are the same in embedded code - we can keep the
existing black box (consultant written!) code for now while the new motion
control code is written in a parallel branch. The more modular the project the
easier this is luckily!

------
sebastianconcpt
_This begs the question: in which situations is it appropriate to decide on a
full rewrite?

In theory, there is an easy answer to this question: If the cost of the
rewrite, in terms of money, time, and opportunity cost, is less than the cost
of fixing the issues with the old system, then one should go for the rewrite._

 _In our case, the answer to all of these questions was yes.

One of our original mistakes (back in 2014) had been that we had tried to
“future-proof” our system by trying to predict our future requirements. One of
our main reasons for choosing Apache Spark had been its ability to handle very
large datasets (larger than what you can fit into memory on a single node) and
its ability to distribute computations over a whole cluster of machines4. At
the time, we did not have any datasets that were this large. In fact, 5 years
later, we still do not. Our datasets have grown by a lot for sure, both in
size and quantity, but we can still easily fit each individual dataset into
memory on a single node, and this is unlikely to change any time soon5. We
cannot fit all of our datasets in memory on one node, but that is also not
necessary, since we can trivially shard datasets of different projects over
different servers, because they are all independent of one another.

With hindsight, it seems obvious that divining future requirements is a fool’s
errand. Prematurely designing systems “for scale” is just another instance of
premature optimization, which many development teams seem to run into at one
point or another_

------
AzzieElbab
This article is insane. I wouldn't even know how to begin configuring
HDFS/Spark cluster for 10Gb of data.

------
apta
Next up: why we decided to re-write our Haskell re-write in $HYPED_LANGUAGE.

~~~
duijf
Rust looks pretty cool.

