
Saving a Project and a Company - andrewaylett
http://jacquesmattheij.com/saving-a-project-and-a-company
======
austenallred
> A single clueless person in a position of trust with non technical
> management and a huge budget

This is terrifying to me, especially as the CEO (I use the term because it's
technically accurate in this case, not to be a douchebag) of a well-funded
startup who is semi-technical. I can build basic apps and scripts and started
out building things for the company, but now that we have real talent I would
slow things down if I were in the code every day.

As such, I end up in a semi-product manager role, and consider myself
responsible to make sure we're quickly shipping product in line with our
goals, but at the end of the day I have to trust the people I work with, all
of whom are experts in their respective areas. Luckily I can.

I have no idea how a non-technical product manager who has hired a contractor
from an Eastern Bloc country would know when they're being reasonable and when
they're just trying to screw him. That's just a recipe for disaster. It's a
blind man in a new city taking an unofficial taxi cab. Makes me cringe.

~~~
venomsnake
I think that the Eastern Bloc is largely irrelevant. The programmers here are
from top notch to complete morons. From anecdotal evidence - this is the same
everywhere in the world.

I know of similar projects where the contractors who fucked up where from all
around the world.

Incompetence knows no national borders.

~~~
austenallred
I lived in eastern Ukraine for a couple years, and some of the people there
were the best engineers/hackers I've ever met; I wouldn't dispute that there
are shops just as competent if not more so in the Eastern Bloc.

But it is relevant because there is no real legal recourse, so you're pretty
much crossing your fingers and hoping that they won't screw you.

~~~
alex_hitchins
I would spend a proportionate amount of time looking into the firm and talk to
past customers (where possible) before putting money anywhere, regardless of
where they are located. Plenty of legal loopholes for domestic firms to
deliver a less than desirable product and run away with the fees.

I too will say that all of the Eastern Block developers I have worked with,
both on site and remotely have been excellent.

------
mootothemax
Great write-up, Jacques!

I have a question about you find such work - or possibly, how such work finds
you :)

For pretty much well my entire career, since 1998 or so, I've been the guy you
go to when when you something's gone wrong, and no-one knows why. Unknown
codebase using an unknown language on an unknown platform? I don't know _how_
I do it, but I have a talent that lets me figure such things out when all
others have failed, fast.

It feels like this should be a valuable skill; is it simply through meeting
enough people that you're the go-to guy when situations like this flare up?

~~~
jacquesm
As a rule the work finds me. I've been at this for very long (fixing things,
working under pressure, good reputation) and that really helps.

~~~
joshu
I do similar things - DD, system repair (though I am usually much further up
the stack to include product, organizational, and business structure issues),
etc. Unfortunately it is often on a favor basis rather than paid.

I've been tempted several times to form the League of Extraordinary
Gentlepersons or something similar. You broke it, we fix it.

~~~
fold
What is DD short for? My google powers are not strong enough...

~~~
GVRV
Due Diligence?

~~~
lutorm
It's also the hull class designation for a destroyer.

------
njs12345
Nice article - this kind of retrospective is always really interesting. One
little thing that caught my eye -

 _It turns out that postgres has an ‘auto vacuum’ setting that when it is
enabled will cause the database to go on some introspective tour every hour
which was the cause of the enormous periodical loads. Disabling auto vacuum
and running it once nightly when the system is very quiet anyway solved that
problem._

Often vacuum problems can also be fixed by running auto vacuum more often too
- this means it has less to do per run, so should be able to keep up a little
more easily. Loads of stuff on vacuum on the postgres wiki:
[https://wiki.postgresql.org/wiki/VacuumHeadaches#Perverse_Fe...](https://wiki.postgresql.org/wiki/VacuumHeadaches#Perverse_Feedback)

~~~
getsat
Yeah, this was one of the eyebrow raisers in the article. The other was a load
average of 0.6 being "high".

For what it's worth, autovacuum can be enabled/disabled on a per table basis,
too. Some tables need frequent vacuuming, others less frequent, and others
none at all. If you manually VACUUM a table, don't forget to also ANALYZE it!

~~~
jacquesm
High relative to the traffic. I could probably bring that down much further
but it's pointless.

~~~
getsat
Okay, that makes more sense. The number in isolation seemed odd.

------
panozzaj
@jacquesm: I am curious how you approached charging for this project due to
the uncertainty of what was wrong and how long it might take to fix it, and of
the importance of this to the company. As you say, this may have saved the
company. Did you work on an hourly / weekly / project basis? If you can talk
about this, that is. Thanks for the writeup, this seemed like a great
challenge!

~~~
jacquesm
No-cure, no-pay, daily rate. It had to be working by Christmas and we _barely_
made that deadline. (Two days to spare...)

~~~
skrebbel
Given that they were pretty desperate and had been referred to you on personal
references, why did you agree to no-cure, no-pay?

Is it something they proposed or you? Seems like you could lose a lot and win
little - and for them, the biggest risk was not your fee, but whether or not
the system got up and running, so why bother?

Is it something you do a lot?

~~~
jacquesm
It's a point of honour with me. Why send an invoice if it doesn't save their
bacon? Better to align my goals with theirs. Make money _with_ the customer,
not _off_ the customer is one of my mottos and that has worked well for me
over the years.

~~~
dasil003
I'm guessing you might find this hard to answer but, is your day rate higher
accordingly? Or is it based on other independent factors?

~~~
unreal37
Of course it is.

~~~
dasil003
Thank you for your random mind-bending insight.

------
Decade
The flip side is that this is helping those developers stay employable.

I thought I was being all nice and responsible with money: keep it on a single
server, watch out for memory and CPU, minimize harm to the environment,
outsource to maximize use of our limited resources. Now I'm looking elsewhere
for employment and I have no "relevant" experience.

In the meanwhile, the clowns who made this mess get to claim J2EE, cloud, HA,
VMWare, Redis, Angular.js, Symfony2, and a living client for their resumes,
and their product didn't even work correctly.

~~~
kansface
The article does not lay the blame on the devs:

| A single clueless person in a position of trust with non technical
management, an outsourced project and a huge budget, what could possibly go
wrong...

~~~
debacle
If you are in a position to make technical decisions and you choose Symfony2
and don't even consider that you should have a cache or three in place, you
should be dragged behind a van by your teeth. Caching in PHP has reached a
point where turning it on is as simple as installing a package and setting a
flag.

------
jcr
> _" Instrumental in all this work was a system that we set up very early in
> the project that tracked interaction between users and the system in a fine
> grained manner using a large number of counters."_

I know you might still have some degree of an NDA pinch preventing you from
giving too many details, but if possible, can you give some more info on how
you went about setting up the tracking instrumentation?

As always, a fun read. Thanks!

~~~
jacquesm
That's tricky to answer without making this identifiable but let me try to
transpose it a bit hoping that still makes sense.

If you're running a store at any one point in time the store contains the
number of people that have ever entered - the number of people that have left.
So by just adding two counters (person entering, person leaving) you can
validate the current state of the store by subtracting the second from the
first and doing a quick count of the aisles. If you have more (or fewer)
people in the store than you think you should have you have either another
door somewhere that you're not aware of, people are being born or dying on the
premises (that might work for a hospital ;) or they're climbing out through
the roof.

If the counters match there is no guarantee that that is not the case but it
certainly helps to gain confidence that you know where your entrances and
exits are and that people aren't keeling over while shopping in your store.

Adding a large number of checks like that will eventually give you a very
quick way to test your assumptions about how things should work and to
determine the impact of a change on the system. We logged all those counters
on a minute-to-minute basis (1440 records per day is peanuts), and have
established a number of baselines indicating what 'normal' behavior is, what
'perfect' behavior should be and this in turn (over time) gives you a goal to
shoot for.

If after a change you're below normal you've probably messed something up and
should roll back, if after a change you're doing better than before than good,
don't change, establish a new 'normal' in a couple of days time and strive for
'perfect'.

This trick has made it fairly easy to steer the project in the right direction
and saved us from making stupid mistakes a number of times (most notably: at
some point we realized the sessions weren't cleaned up at all, but cleaning
them up too fanatically caused some of the relationships between the counters
to indicate that we had a problem, it didn't take too long before we realized
that the session cleanup routine was the culprit, without having that system
in place this would have taken _much_ longer and would have done a lot more
damage).

------
pbjorkm
Nice writeup! Interesting to see another perspective. I do this for a living
too but very seldom to get hear others experiences (due to the nature of the
work). My niche is enterprise CRM software so quite the same. Some
comments/questions: \- Amazes me as well how quickly you can turn things
around by just changing out parts of the team \- Also nice to see how you kept
parts of the old team. There are most always skilled people even in massively
failing teams. Reminds me to be humble (could be me sitting in the wrong
project the next time) \- Rewrite is almost never necessary. You tend to want
to do this starting out but normally that feeling comes from a good part not
being in control and having knowledge of the ins and outs of the current
software. Once you get to know it, it usually turns out to be not as bad (just
more complex than it needs to be)

I have done this so many times now that I have my own little mental model for
what to look at when I get airdropped in: \- Project managment. Do they have
one dedicated project manager who is reporting status correctly and frequently
to the stakeholders. Are plans available and follow-up, etc. \- Product
management. Are requirements from the business gaterhered and negotiated down
to clear and concise things that can be built \- Technical leadership. Do they
use suitable technology, proper infrastructure setup (in your case not), is
technical design simple to understand and not overly complex. \- Change
management. Is the team communicating the comes changes effectively to the end
users. Is training done and being planned correctly. \- Work process. Is there
a good process that with good flow from requirements to created and tested
feature.

My theory goes that if one of them fails, the project usually survives anyway,
the others compensate. If two or more fail, the project fails.

Finally a question. You write that it is usually not a good engagement for you
financially. Would be curious to know your business model here. I end up doing
these projects on a hourly rate for the most part.

~~~
jacquesm
> You write that it is usually not a good engagement for you financially.

Not 'not a good engagement financially', rather the opposite. Just more risky.
Typically I'll do these for daily rates depending on the perceived risk but
I'm pretty flexible.

------
faragon
Just 10,000 visitors per _day_ on a 64 core machine with 256GB of RAM? If were
per _minute_ (or even _hour_ ) it would make sense, otherwise, it seems poorly
designed.

~~~
alecco
Being a PHP + VMWare system (culture-wise, not the technology itself), I was
not surprised about all the madness on DB level. They didn't use indices in
Postgres but they added Redis. Its incredible how many, perhaps most, systems
do crazy things like that.

~~~
chris_wot
Of course, the flip side is too many indexes.

~~~
themonk
I have seen both extreme: no index v/s all column indexed, no cache at all v/s
cache at every level.

I have seen caches having more inserts v/s reads.

I have seen them replacing MySQL by NoSQL as it does not scale for them.

This is common for early stage funded startups founded by non tech founders.

~~~
raverbashing
Oh I can imagine the pain...

"Let's make our change password system handle 100k requests per minute but the
front page starts to get wonky at 1000 req/min"

~~~
themonk
This sounds familiar. Few day ago i over overheard this, "Our user
authentication system is on MySQL, it may not scale, let's move it to NoSQL."

Only time this table is touched to check user entered correct password or not.

------
alrs
Be glad there was no cache infrastructure. Ripping out an ill-conceived
caching layer is usually a nasty and unpopular step early in an architecture
rescue.

------
russnewcomer
A question instantly spring to my mind -

Why all the virtualization? Lack of experience? I haven't done a large amount
of work with virtualization, but stopping and thinking about it would seem to
have indicated a problem with the design. Did no one look at this and say,
"That's a bad idea..."?

~~~
jacquesm
I have my suspicions about that but I don't want to voice those here. For one
that's possible actionable second there is a lot more to this story than what
I can talk about publicly. I'm already very grateful they let me publish as
much as I just did.

So, yes: someone did say 'that's a bad idea' and got sidelined for his effort.

~~~
danielki
I know you can't say anything, but I'm guessing that there's a nonzero chance
that the guy who made the hardware decisions is a friend of the guy who sold
the hardware.

------
driverdan
This is the kind of freelance work I love doing, it's win-win. You get to
solve a complex puzzle and the people who hired you are happy that you saved
their ass. These jobs often have a lot of simple things you can fix early to
eliminate the immediate danger (eg opcode caching) and give you time to really
dig in and fix everything.

------
mcguire
There is a considerable amount of wisdom in this article. But there's one
thing in particular that caught my eye:

" _The job ended up being team work, it was way too much for a single person
and I’m very fortunate to have found at least one kindred spirit at the
company as well as a network of friends who jumped to my aid at first call.
Quite an amazing experience to see a team of such quality materialize out of
thin air and go to work as if they had been working together for years._ "

I have done similar sorts of things for my current employer and at previous
jobs. One thing we discuss here, since the opportunity keeps popping up, is
forming a specific team of people to parachute into a flailing project and get
it back on track. That frequently seems to involve taking it away from the
then-current developers, paring it down to "the good parts", and then
rewriting the rest, but it does yield results.

~~~
jacquesm
I try to be as fair as possible to those involved in earlier stages of the
project (management is usually not too levelheaded at this stage). In this
case the earlier developers for the most part had but one single failing, they
didn't put their foot down when they realized things were going pear shaped.
If they had done that they might have been able to stop the disaster before it
happened, but as it was they were too intimidated to draw a line in the sand.
I'm pretty sure they learned a bunch of lessons on this project, they're no
angels but they're definitely not the bad guys.

~~~
UK-AL
Putting your foot down can end up badly as well, as you simply be removed from
the project if you don't agree to deadlines.

Then management wonder why it failed.

~~~
jacquesm
I'd rather be removed from a project than to agree to something that can't be
done.

~~~
jamesknelson
I've also come to that conclusion after a number of years of freelancing, but
I understand the other point of view as well.

While freelancing, especially for people in another country, it isn't uncommon
for customers to just disappear without any trace (or pay) half-way through a
project. As such, when confronted with some problem mid-project, finishing the
job in a half-assed manner can be a way of ensuring you get paid. Of course, a
better method is making sure you have a solid plan before agreeing to the job,
and trying to avoid unreliable customers - but accomplishing this can be
difficult in any setting, even non-freelancing.

On the customer side, making sure you have a number of reasonably sized
milestones and pay for them immediately on delivery can help keep freelancers
confident, and thus encourage better quality work.

------
sokoloff
_I don’t mind doing these jobs, they take a lot of energy and they are pretty
risky for me financially but in the end when - if - you can turn the thing
around it is very satisfying._

What about the jobs is financially risky? Do you have a downside beyond "might
not get paid if the company fails"?

~~~
jacquesm
It concentrates a lot of time on a single customer which means I may have to
say 'no' to my other, repeat customers for shorter jobs. My line of business
is normally technical due diligence (helping investors to make savvy decisions
about where to invest and where definitely not to invest). That work is
usually very short term (typically a week) and there is no guessing ahead of
time when a job will come up. So when one of my customers calls I'm supposed
to be up and running within a day or so.

Another risk is that when I call my friends in to assist I assume their risk
of not getting paid, in other words, if the company would not be able to meet
its obligations I would make sure my friends and colleagues would be made
whole (those relationships are worth more to me than any job ever would be).

------
hkarthik
Great write up.

There's likely an alternative scenario where a consultant runs into a
different but similar set of problems with a company that has mis-configured
their Rails app across multiple AWS EC2 machines, in the wrong security
groups, with their EBS settings tuned improperly for their MySQL instances.
All resulting in extremely poor performance of their flagship application
which is costing them a lot of business.

~~~
themonk
Was innodb buffer pool as set to default as well?

------
thisone
thanks for the write up. Having been through this on the inside, company
implosion, the few of us who stayed needed to save the software from all the
poor decisions made for all the right reasons, I can say it's made me a better
programmer.

Not a job for the faint of heart, especially when it's your own history you
are now fixing, and I appreciate seeing the experience of someone else.

------
digital-rubber
Nice read Jacques :-)

Though typical story of any company/person that assumes a framework are great
for their problem, product, not realising what and what not happens in the
background. One has to perfectly understand which cogs, axis and wheels turn
when an operation is done. Know which wheels always do the exact same thing
(apply caches) etcetc.

But more important, best wishes for 2015 from nearby your office,

RB

~~~
jacquesm
Hehe. Dat was 'm dus :)

------
garry
This happens quite a lot actually. Premature optimization with the basics not
being figured out. Always get your database indices figured out first, and
then cache after that. Pick a reasonable place to start scaling horizontally,
but only after you've reached the sweet spot of what one fairly powerful
instance can deal with.

------
mcguire
" _The traffic levels were incredibly low for a system this size ( < 10K
visitors daily) and still it wouldn’t perform._"

This kind of thing irritates me. _User_ numbers are important, financially,
because "10,000 users daily" can tell an investor or manager how much money is
involved. But technically? That number doesn't mean anything to me. Are the
visitors making one request or a hundred? Are they clustered into the five
minutes before and after a horserace or are they spread out?

~~~
jacquesm
Being more specific would risk allowing the company to be identified, but
you're absolutely right that just quoting user numbers by themselves is not
going to be much help. Consider adding the words 'within the context of this
application' wherever such metrics are used.

As far as interaction goes I would qualify this particular product as halfway
between twitter and a social bookmarking site. More interaction than HN but
signficantly less complex than twitter. Both twitter and HN are deceptively
simple on the outside but remarkably complex underneath, so maybe I'm
overstating the complexity level but it's not too far off the mark. By my
estimate and using my own websites as a benchmark they should be able to run
their current product on a single machine up to or over 100K users daily
(using their current set of technologies), session times and concurrency of
course play into that heavily.

~~~
mcguire
I understand, and I'm sorry I seemed to be specifically targeting you. It's
more of a general complaint: I've seen too many people using that kind of
measure in a context where it's really not appropriate.

------
oldpond
Great story! Thanks for this. I'm glad it had a happy ending as so many of
these situations do not. It takes courage and vision to admit you have a
really big mess on your hands and need expert help.

As for the clueless PM, I have met far too many of these in my travels. If you
can't write software, what makes you think you can 'manage' a software
development project?

------
lifeisstillgood
My favourite part was the use of (graphite-like?) counters to monitor changes
and make implicit assertions about relationships in the system (ie if we push
that metric down, that metric will go up by same amount)

It's a really useful trick to stop yourself believing that the systems works
the way you think it does just because you think it.

------
davidw
The key quote for me was "First you scale ‘up’ as far as you can, then you
scale ‘out’". I see so many job postings involving "web scale" tech that make
me kind of suspicious. Do they really need it?

------
mc_hammer
this is a really good article; novices can almost use it as a "how to scale x"
or "scalability and optimizations: how to". good read.

------
emmanueloga_
Not to downplay the work of the OP, but the system he talks about seemed like
a feast of low hanging fruit :)

------
icedchai
Sounds like the original developers were either incredibly incompetent, or
wanted to guarantee themselves future work.... 100+ VMs? ridiculous. No op
code cache? No memcache or other forms of caching? Stock DB settings? No
indexes?

This is all stuff that is so basic. I gotta laugh.

And I have to wonder how bad the actual code was...

~~~
Ixiaus
He even said the developers _were_ competent...at developing. They'd released
some not-so-good software towards the end of the project due to "manager
pressure".

It sounds like they didn't have a good systems person though nor good (and
general) software leadership, often good software leadership is also your
early-stage systems person. Jacques here acted as their systems integrator to
save the day along with what sounds like mild programming support to cleanup
some of the unfinished software product that got pushed too early.

~~~
jacquesm
Spot on.

