Saving a Project and a Company

austenallred · on Dec 30, 2014

> A single clueless person in a position of trust with non technical management and a huge budget

This is terrifying to me, especially as the CEO (I use the term because it's technically accurate in this case, not to be a douchebag) of a well-funded startup who is semi-technical. I can build basic apps and scripts and started out building things for the company, but now that we have real talent I would slow things down if I were in the code every day.

As such, I end up in a semi-product manager role, and consider myself responsible to make sure we're quickly shipping product in line with our goals, but at the end of the day I have to trust the people I work with, all of whom are experts in their respective areas. Luckily I can.

I have no idea how a non-technical product manager who has hired a contractor from an Eastern Bloc country would know when they're being reasonable and when they're just trying to screw him. That's just a recipe for disaster. It's a blind man in a new city taking an unofficial taxi cab. Makes me cringe.

venomsnake · on Dec 30, 2014

I think that the Eastern Bloc is largely irrelevant. The programmers here are from top notch to complete morons. From anecdotal evidence - this is the same everywhere in the world.

I know of similar projects where the contractors who fucked up where from all around the world.

Incompetence knows no national borders.

mcguire · on Dec 30, 2014

Probably be better put as "contractors from a different country, speaking a different language, and with a different legal system." The specific country doesn't matter, what does in that you have no idea what they're doing and they have no idea what you want.

austenallred · on Dec 30, 2014

I lived in eastern Ukraine for a couple years, and some of the people there were the best engineers/hackers I've ever met; I wouldn't dispute that there are shops just as competent if not more so in the Eastern Bloc.

But it is relevant because there is no real legal recourse, so you're pretty much crossing your fingers and hoping that they won't screw you.

alex_hitchins · on Dec 30, 2014

I would spend a proportionate amount of time looking into the firm and talk to past customers (where possible) before putting money anywhere, regardless of where they are located. Plenty of legal loopholes for domestic firms to deliver a less than desirable product and run away with the fees.

I too will say that all of the Eastern Block developers I have worked with, both on site and remotely have been excellent.

sologoub · on Dec 30, 2014

The original comment seems to have misquoted this part:

"I don’t blame them for this, they were placed in a nearly impossible situation by their customer, the (dutch, not associated with the company that built the whole thing) project manager was dictating what hardware they were going to have to run their project on (I have a suspicion why but that’s another story)"

Sounds like the project manager was also outsourced to a 3rd entity and may have had some vested interest in the hardware part of the project. This just goes to show that the financial interests of those steering the "ship" better be aligned with those of the "ship" itself, or hilarity/misery ensues.

jacquesm · on Dec 30, 2014

In case that wasn't clear from the article, the project manager was a local. I'll update to make this clearer.

cubano · on Dec 30, 2014

Perhaps he was mentioning that not to disparage them, but to explain why trying to communicate with them would be somewhat more difficult than a shop located in the same time zone?

Certainly, many EB programmers are very highly-skilled.

jacquesm · on Dec 30, 2014

Definitely not to disparage them.

venomsnake · on Dec 30, 2014

I didn't took it that way. It is just systematic risk of outsourcing. If a person is not in the same building with you, you are making a leap of faith (or sometimes even for a team in the same building).

I am sure that if we make an "Poll HN: At what distance away from you was the team/person that turned your project into nightmare?" we will find pretty even distribution in the brackets.

danmaz74 · on Dec 30, 2014

There are great coders and bad coders everywhere, but the more distant the team is (geographically and culturally) the more difficult it gets to evaluate them: It's more difficult to do interviews, it's more difficult to get background information and friends-of-friends recommendations etc.

hobs · on Dec 30, 2014

Not limited to software even. Whether it is a building contractor, an employee/r, the government, etc, if you do not have accountability in the system people will simply exploit it.

It is a truth of human nature, we are min-maxers to the core.

iolothebard · on Dec 30, 2014

This is all I see in IT over the last 20+ years. People that are IT directors (CIO, VP of "Engineering", etc.) that really have no idea what they're doing either on the financial side (budgeting, accounting, etc.) or the IT side.

It's how you end up with a guy from marketing as the CTO of the Sony branch that got hacked. This is every single company I've been in that's not a software provider. The thing is, they are ALL software providers now, they just don't know it yet.

mixmax · on Dec 30, 2014

to be fair it's incredibly difficult to find people that are both technically and financially competent.

wpietri · on Dec 30, 2014

To my mind, this is a sign we're organizing things poorly.

If we design our organizations such that one person can make or break each project and we know there are way too few people to fill those roles, then frequent failures like this are the expected outcome.

In particular, I think the role of "project manager" is mostly wrongheaded. The theory is that you can take a person who knows nothing about the details, give them total power and perverse incentives, and then expect things to turn out well. I think it's only popular because it's an artifact of our current managerialist business culture.

I much prefer cross-functional teams where the team as a whole is accountable for results. I also think we technical people need to stop thinking of ourselves as minions and instead act as professionals. If a "project manager" told a professional (like a doctor or a structural engineer) to do something unsafe or wrongheaded, they'd say no. But software developers routinely go on building something terrible after token protest. I'd love to see that change.

ProAm · on Dec 30, 2014

Yep, that is usually why they get paid the big bucks, for their competency.

detaro · on Dec 30, 2014

Then don't rely on a single person to be both. Don't give full control without oversight to a single person for such an important project.

functionalfish · on Dec 31, 2014

Oh god, that's me! I was offered a position as CTO two years ago, couldn't turn it down. Coming from a very strong technical background, the financial side of things has been challenging. I have never had to plan budgets before. I'm learning. CFO has been hands-on with me, helping as much as he can. First year I overestimated by 10%, second year by 20%. We are doing well, so that was a non-issue and my overshooting went into bonuses + new software licenses for the devs.

How in the world do you get more financial experience, short of bashing your head against it like I'm doing? Are there books you can buy that don't suck? We're going to double our development staff in Q1 (10->20) which is sort of terrifying....wish I had more knowledge!

mathattack · on Dec 31, 2014

The key challenge is that if you're very good at technology and finance, you probably aren't in an internal IT job. You're much more likely to be in financial services or a technology company. This is why so many internal projects become a mess.

lifeisstillgood · on Dec 30, 2014

May I suggest the book "Waltzing with Bears" by Tom DeMarco and Lister. It's simple tag line is "risk management is project management for grown ups"

As CEO your job (IMHO!) is not to "manage" the tasks so the project hits it's deadlines. It's to ensure that the risks in the project are recognised, seen as likely to actually occur and some action is taken Beyond crossing our fingers, to ensure that the project survives the risk becoming reality.

That might mean tests, or in house developers, or encouraging people to give bad news, it might mean hiring jacques before launch, it might mean building two teams to compete for launch whatever.

Just focus on the risks. Not the nice optimum path of perfectly executed tasks.

boomzilla · on Dec 30, 2014

Antifragile by Nassim Taleb is also on the same theme. He argues that human, and life in general is (or should be) designed for robustness, not for optimal. Note that is not arguing against taking risk, but for better preparation for risks (both known and unknown).

angersock · on Dec 30, 2014

Read the article again: it appears that the contractors said over and over "Hey, this software isn't ready yet, please please please let us fix this", and the PM said, "lol no fuckit shipit".

themonk · on Dec 30, 2014

Ok, what does it takes to enable op code cache? It is not only about software, it is about many no brainers.

ethbro · on Dec 30, 2014

> Ok, what does it takes to enable op code cache? It is not only about software, it is about many no brainers.

From having worked on timelines like this, it was probably less about the time to enable it vs. "If we turn on op code cache and that breaks part of our code, we don't have time to fix it."

Solution w/ nebulous oversight and looming deadlines? Punt it to the ops team.

themonk · on Dec 31, 2014

Agree, there might not be ops team as well, it is dev ops culture. No one in team was brave enough to take the risk. I am sure there are no units test as well to ensure nothing breaks.

ww520 · on Dec 30, 2014

I would venture to guess the pressure to release an immature product is not dictated by the PM but from the highest level of management. The PM was just a conduit of pressure from the business side. A good PM would have the backbone to say no and shield the development teams from the release pressure. The technically clueless is not a single person but the whole group.

jacquesm · on Dec 31, 2014

Except that in this case that wasn't true. But that's usually how it would go, you're right about that.

mootothemax · on Dec 30, 2014

Great write-up, Jacques!

I have a question about you find such work - or possibly, how such work finds you :)

For pretty much well my entire career, since 1998 or so, I've been the guy you go to when when you something's gone wrong, and no-one knows why. Unknown codebase using an unknown language on an unknown platform? I don't know how I do it, but I have a talent that lets me figure such things out when all others have failed, fast.

It feels like this should be a valuable skill; is it simply through meeting enough people that you're the go-to guy when situations like this flare up?

jacquesm · on Dec 30, 2014

As a rule the work finds me. I've been at this for very long (fixing things, working under pressure, good reputation) and that really helps.

joshu · on Dec 30, 2014

I do similar things - DD, system repair (though I am usually much further up the stack to include product, organizational, and business structure issues), etc. Unfortunately it is often on a favor basis rather than paid.

I've been tempted several times to form the League of Extraordinary Gentlepersons or something similar. You broke it, we fix it.

jacquesm · on Dec 30, 2014

That's more or less what I've been doing but informally.

Making a corporate structure for something like this hard, I came up with a model a couple of years ago called 'The Modular Company' but the bookkeeping is a nightmare if you want it to be fair.

Still, I keep coming back to it and one of these days, who knows...

lifeisstillgood · on Dec 30, 2014

I'm in an odd position of trying to form something like this (focused specifically on python development in finance). 2015 is hopefully going to be a good year - I actually have sort-of backing and if I can find time it might take off (god, just reading those caveats makes me wonder)

Anyway - I would be interested in any thoughts or traps to avoid - if indeed we are talking about the same structures.

Do drop me a line (contact details in profile)

pjungwir · on Dec 31, 2014

Like in coding, complexity in a business partnership might be a warning. The book Managing the Professional Service Firm by David Meister has a lot of lessons for anyone running a traditional parter/associates firm, and the last part talks about splitting the profits. My read is that a "dumb" approach (equal split or simple seniority-based) is okay, and a "judgment" approach (publish criteria, but have a committee make the final decisions) is okay, but a metrics-based approach is risky. I agree, and for small operations I think the simpler the better. Anyway, if it's something you're thinking about, the book is a solid, meaty read.

jacquesm · on Dec 31, 2014

ok, I'll do that tomorrow. 1:44 am now, bedtime for me.

joshu · on Dec 31, 2014

Yeah, maybe just consulting? I dunno. Getting paid would be nice, I usually just have an investor-type relationship.

jacquesm · on Dec 31, 2014

> I usually just have an investor-type relationship.

Then in a way you do get paid (eventually, I hope...).

> Yeah, maybe just consulting?

That's one way to keep it simple.

But there has to be a better way, especially because the nicest flowers grow at the edge of the abyss.

joshu · on Dec 31, 2014

Then necessarily that requires taking risk. Do you want shares in the companies that get to the point where they need a your help?

jacquesm · on Dec 31, 2014

That really is the key question isn't it? I can see some ways out of that and some sets of companies (not many) where that would be the case but it's going to be hard to turn them around once they reach that stage. Maximum risk = maximum potential gain, it's never been any different.

joshu · on Jan 1, 2015

Sure. But consider adverse selection? The ones that need the help are in a different part of the spectrum than the ones that don't.

thaumaturgy · on Dec 30, 2014

I would really love to see that, along with the occasional writeup from your adventures. (And don't forget to invite rachelbythebay, her troubleshooting posts are some of my favorite HN content.)

rimantas · on Dec 30, 2014

If you haven't read it yet this is the book you might enjoy: The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win.

thaumaturgy · on Dec 30, 2014

Hmm, the book description makes it sound more like Yet Another Process Management book (e.g. "Lean ", "Waterfall ", "Six Sigma *", etc.) rather than a collection of nuts-and-bolts war stories. Is that not accurate?

I leave the process management stuff up to other folks, they're better salespeople than I am. They get paid to beat other people into submission; I get paid to beat servers into submission. :-) (Though not in Jacques' or Rachel's league.)

wpietri · on Dec 30, 2014

That would be awesome. I've occasionally done rescues and it can be really fun. So often people aren't willing to listen about organizational problems. But once you have solved some ugly, expensive technical problem, they are suddenly much more willing to consider changes that will prevent them from having the same problem in the future.

Perhaps some sort of league would solve some of the problems in doing that sort of work. E.g., the burstiness, the sudden need for specialized skills, the pipeline issues.

fold · on Dec 30, 2014

What is DD short for? My google powers are not strong enough...

GVRV · on Dec 30, 2014

Due Diligence?

lutorm · on Dec 30, 2014

It's also the hull class designation for a destroyer.

joshu · on Dec 31, 2014

Due dilligence.

reitanqild · on Dec 31, 2014

Old unix tool . See also ddrecue, a (couple of) nifty variation(s) on the theme.

mathattack · on Dec 31, 2014

This is true on a lot of levels. There are very few "Mr Fixits" so when problems happen, people look to who fixed the last one.

I used to be pulled into projects where the managerial side was messed up a lot. Usually the rep from one project is what would get me pulled in to the next.

An old non-technical version of this story is at http://www.amazon.com/Calumet-K-Samuel-Merwin/dp/1561141453

kamaal · on Dec 30, 2014

>>how such work finds you :)

I think you need to work towards a system and network which will help you get that kind of work. I know of a person who has worked and works only on what he considers premium projects.

One advice I got from him was to that I should think about work in terms of projects and not companies. All companies have great projects and routine 'keep the wheels running kind of work'. Your chances of ending up working for the routine boring projects even in big successful companies are very high. Plus most companies have closed allocation policies, and tend to execute critical projects from one specific geographical location.

So to look at one's career in terms which project you wish to work on, and not which company you wish to work for helps.

Another factor is to seek out and work with smart people. Once you've been a part of a good team and proven your worth and are actively seeking out good work, you are always going to find some one in your network who will get you a good project and then you use new opportunity and the new connections to get more ones.

santacluster · on Dec 30, 2014

I have the same "talent". Exploited it for a via a former employer who runs an agency for temporary contract workers, and who regularly ran into such jobs, also because he was in a network with several other such agencies.

Nice gigs, usually no more than a few days, and working with tech I had no real experience in and otherwise wouldn't encounter.

Usually very similar problems: system slows down to a crawl or completely stops working because it hits some bottleneck (often after years of working flawlessly), original supplier no longer exists. Fun things to figure out if you have no immediate stake in it.

grub5000 · on Dec 30, 2014

I absolutely love this kind of work. I've been trying to think of a way to convert this into a semi-reliable job.

'Professional Troubleshooter' or something similar...

raverbashing · on Dec 30, 2014

Looks like an interesting career path

mml · on Dec 30, 2014

It is, can't recommend it enough. You swoop in like batman & pull their bacon from the fire. Makes for good feelings all around.

themonk · on Dec 30, 2014

I saved many projects likes this, it is work that always hunts me, because of ex-coworkers and reputation.

armandososa · on Dec 30, 2014

I've been doing a similar kind of job for last six months or so. Except that I'm doing it for everything front-end related including UX and UI design.

jacquesm · on Dec 30, 2014

I really don't envy you, there is absolutely no way I could consistently work this hard for 6 months at a stretch. Consider me impressed.

failrate · on Dec 31, 2014

At least in the video game industry (or my section), that is a typical stretch. I crunched in a similar capacity for about three years straight. The burnout level is pretty significant.

njs12345 · on Dec 30, 2014

Nice article - this kind of retrospective is always really interesting. One little thing that caught my eye -

It turns out that postgres has an ‘auto vacuum’ setting that when it is enabled will cause the database to go on some introspective tour every hour which was the cause of the enormous periodical loads. Disabling auto vacuum and running it once nightly when the system is very quiet anyway solved that problem.

Often vacuum problems can also be fixed by running auto vacuum more often too - this means it has less to do per run, so should be able to keep up a little more easily. Loads of stuff on vacuum on the postgres wiki: https://wiki.postgresql.org/wiki/VacuumHeadaches#Perverse_Fe...

getsat · on Dec 30, 2014

Yeah, this was one of the eyebrow raisers in the article. The other was a load average of 0.6 being "high".

For what it's worth, autovacuum can be enabled/disabled on a per table basis, too. Some tables need frequent vacuuming, others less frequent, and others none at all. If you manually VACUUM a table, don't forget to also ANALYZE it!

jacquesm · on Dec 30, 2014

High relative to the traffic. I could probably bring that down much further but it's pointless.

getsat · on Dec 30, 2014

Okay, that makes more sense. The number in isolation seemed odd.

tarblog · on Jan 3, 2015

Can you tell me what load is measured in? I didn't find units anywhere in the article.

njs12345 · on Jan 7, 2015

It's a little complicated - see https://en.wikipedia.org/wiki/Load_(computing)#Unix-style_lo...

Essentially, if (Load average / number of CPUs) > 1 (for CPU bound work) then your system is overloaded.

panozzaj · on Dec 30, 2014

@jacquesm: I am curious how you approached charging for this project due to the uncertainty of what was wrong and how long it might take to fix it, and of the importance of this to the company. As you say, this may have saved the company. Did you work on an hourly / weekly / project basis? If you can talk about this, that is. Thanks for the writeup, this seemed like a great challenge!

jacquesm · on Dec 30, 2014

No-cure, no-pay, daily rate. It had to be working by Christmas and we barely made that deadline. (Two days to spare...)

revelation · on Dec 30, 2014

Since you normally do due-diligence: at what level of information about the project would you be willing to enter into a no-cure, no-pay contract?

That seems like an insanely risky proposition on top of the risk you already assume through consulting, failing company, hiring colleagues as temp workers etc.

jacquesm · on Dec 30, 2014

This was one on one three hour interview + some email follow up. Having done a lot of DD really helps though, it makes it easier to get the bigger picture clear in a hurry.

That said, there were quite a few details that made this project harder than it should have been.

skrebbel · on Dec 30, 2014

Given that they were pretty desperate and had been referred to you on personal references, why did you agree to no-cure, no-pay?

Is it something they proposed or you? Seems like you could lose a lot and win little - and for them, the biggest risk was not your fee, but whether or not the system got up and running, so why bother?

Is it something you do a lot?

jacquesm · on Dec 30, 2014

It's a point of honour with me. Why send an invoice if it doesn't save their bacon? Better to align my goals with theirs. Make money with the customer, not off the customer is one of my mottos and that has worked well for me over the years.

dasil003 · on Dec 30, 2014

I'm guessing you might find this hard to answer but, is your day rate higher accordingly? Or is it based on other independent factors?

jacquesm · on Dec 30, 2014

Depending on the perceived risks and the degree to which things have gone to pot already, how realistic the deadline is and so on I'll be happy to adjust (both ways). In the end what matters is that they get value for their money and that I am compensated relative to value created (or saved).

unreal37 · on Dec 30, 2014

Of course it is.

dasil003 · on Dec 31, 2014

Thank you for your random mind-bending insight.

RobAley · on Dec 31, 2014

Given that he works off reputation and recommendation, he probably won't want the reputation of getting paid for not fixing a system. The odd loss on a project now and then is probably worth less than the loss of recommendation, it creates a good feeling with the client, and I'm sure his risk analysis / due diligence before a project minimises the risk.

mgkimsal · on Dec 30, 2014

How detailed do you have to be on the 'cure' part? I can easily imagine some previous clients arguing the toss on something not being "fixed" to their satisfaction, and refusing to pay.

jacquesm · on Dec 30, 2014

That's a really good question. If I feel that there may be a dispute over this then I'll ask for an escrow and detailed release instructions.

In all the years that I've been doing stuff like this professionally I've had one customer that didn't want to pay the full amount (they asked for a discount after the work was already done) and I told them to tear up the invoice but never call again. Everybody else was more than happy to pay. Maybe I've been lucky in that respect but I think that it's more of a way business is conducted here than anything else. You stand by your agreements, it's a small scene and word really does get around.

mgkimsal · on Dec 30, 2014

Personally, I want to do more stuff like this - I thrive in these sorts of projects, but haven't been able to find too many. How did you get connected in that 'small scene' to start with?

panozzaj · on Dec 30, 2014

Thanks!

Decade · on Dec 30, 2014

The flip side is that this is helping those developers stay employable.

I thought I was being all nice and responsible with money: keep it on a single server, watch out for memory and CPU, minimize harm to the environment, outsource to maximize use of our limited resources. Now I'm looking elsewhere for employment and I have no "relevant" experience.

In the meanwhile, the clowns who made this mess get to claim J2EE, cloud, HA, VMWare, Redis, Angular.js, Symfony2, and a living client for their resumes, and their product didn't even work correctly.

kansface · on Dec 30, 2014

The article does not lay the blame on the devs:

| A single clueless person in a position of trust with non technical management, an outsourced project and a huge budget, what could possibly go wrong...

debacle · on Dec 31, 2014

If you are in a position to make technical decisions and you choose Symfony2 and don't even consider that you should have a cache or three in place, you should be dragged behind a van by your teeth. Caching in PHP has reached a point where turning it on is as simple as installing a package and setting a flag.

jcr · on Dec 30, 2014

>"Instrumental in all this work was a system that we set up very early in the project that tracked interaction between users and the system in a fine grained manner using a large number of counters."

I know you might still have some degree of an NDA pinch preventing you from giving too many details, but if possible, can you give some more info on how you went about setting up the tracking instrumentation?

As always, a fun read. Thanks!

jacquesm · on Dec 30, 2014

That's tricky to answer without making this identifiable but let me try to transpose it a bit hoping that still makes sense.

If you're running a store at any one point in time the store contains the number of people that have ever entered - the number of people that have left. So by just adding two counters (person entering, person leaving) you can validate the current state of the store by subtracting the second from the first and doing a quick count of the aisles. If you have more (or fewer) people in the store than you think you should have you have either another door somewhere that you're not aware of, people are being born or dying on the premises (that might work for a hospital ;) or they're climbing out through the roof.

If the counters match there is no guarantee that that is not the case but it certainly helps to gain confidence that you know where your entrances and exits are and that people aren't keeling over while shopping in your store.

Adding a large number of checks like that will eventually give you a very quick way to test your assumptions about how things should work and to determine the impact of a change on the system. We logged all those counters on a minute-to-minute basis (1440 records per day is peanuts), and have established a number of baselines indicating what 'normal' behavior is, what 'perfect' behavior should be and this in turn (over time) gives you a goal to shoot for.

If after a change you're below normal you've probably messed something up and should roll back, if after a change you're doing better than before than good, don't change, establish a new 'normal' in a couple of days time and strive for 'perfect'.

This trick has made it fairly easy to steer the project in the right direction and saved us from making stupid mistakes a number of times (most notably: at some point we realized the sessions weren't cleaned up at all, but cleaning them up too fanatically caused some of the relationships between the counters to indicate that we had a problem, it didn't take too long before we realized that the session cleanup routine was the culprit, without having that system in place this would have taken much longer and would have done a lot more damage).

pbjorkm · on Dec 30, 2014

Nice writeup! Interesting to see another perspective. I do this for a living too but very seldom to get hear others experiences (due to the nature of the work). My niche is enterprise CRM software so quite the same. Some comments/questions: - Amazes me as well how quickly you can turn things around by just changing out parts of the team - Also nice to see how you kept parts of the old team. There are most always skilled people even in massively failing teams. Reminds me to be humble (could be me sitting in the wrong project the next time) - Rewrite is almost never necessary. You tend to want to do this starting out but normally that feeling comes from a good part not being in control and having knowledge of the ins and outs of the current software. Once you get to know it, it usually turns out to be not as bad (just more complex than it needs to be)

I have done this so many times now that I have my own little mental model for what to look at when I get airdropped in: - Project managment. Do they have one dedicated project manager who is reporting status correctly and frequently to the stakeholders. Are plans available and follow-up, etc. - Product management. Are requirements from the business gaterhered and negotiated down to clear and concise things that can be built - Technical leadership. Do they use suitable technology, proper infrastructure setup (in your case not), is technical design simple to understand and not overly complex. - Change management. Is the team communicating the comes changes effectively to the end users. Is training done and being planned correctly. - Work process. Is there a good process that with good flow from requirements to created and tested feature.

My theory goes that if one of them fails, the project usually survives anyway, the others compensate. If two or more fail, the project fails.

Finally a question. You write that it is usually not a good engagement for you financially. Would be curious to know your business model here. I end up doing these projects on a hourly rate for the most part.

jacquesm · on Dec 30, 2014

> You write that it is usually not a good engagement for you financially.

Not 'not a good engagement financially', rather the opposite. Just more risky. Typically I'll do these for daily rates depending on the perceived risk but I'm pretty flexible.

faragon · on Dec 30, 2014

Just 10,000 visitors per day on a 64 core machine with 256GB of RAM? If were per minute (or even hour) it would make sense, otherwise, it seems poorly designed.

jacquesm · on Dec 30, 2014

I think you missed a '0' there. And not all applications are created equal, some really do have more business logic and systems requirements than others. 100K visitors on one machine is usually doable, 200K as well if the website isn't all that complex and the interactions between users aren't all that complex. Above that you're (usually, not always) going to see some clustering.

If it is just static content then you should be able to saturate your uplink from one single machine.

The 10K number applied to the whole setup, and that's now comfortably served from one box (as it should be). It could probably handle 10 times that number now without too much in terms of additional tuning (if any), above that it might require more work.

alecco · on Dec 30, 2014

Being a PHP + VMWare system (culture-wise, not the technology itself), I was not surprised about all the madness on DB level. They didn't use indices in Postgres but they added Redis. Its incredible how many, perhaps most, systems do crazy things like that.

faragon · on Dec 30, 2014

No. Most systems do not do crazy things like that. That system by itself could run Google search for lots of queries/s, and it is having trouble to deliver few operations per day. Come on, that server is faster than anything running in late nineties for world-wide massive access (!)

chris_wot · on Dec 30, 2014

Of course, the flip side is too many indexes.

themonk · on Dec 30, 2014

I have seen both extreme: no index v/s all column indexed, no cache at all v/s cache at every level.

I have seen caches having more inserts v/s reads.

I have seen them replacing MySQL by NoSQL as it does not scale for them.

This is common for early stage funded startups founded by non tech founders.

raverbashing · on Dec 30, 2014

Oh I can imagine the pain...

"Let's make our change password system handle 100k requests per minute but the front page starts to get wonky at 1000 req/min"

themonk · on Dec 31, 2014

This sounds familiar. Few day ago i over overheard this, "Our user authentication system is on MySQL, it may not scale, let's move it to NoSQL."

Only time this table is touched to check user entered correct password or not.

bluedino · on Dec 30, 2014

True. Maybe. Depends on what the users are doing, how long they are logged in...

Are they just logging in twice a day to punch in/out, like a online timeclock? Certainly too many resources for that amount of traffic.

But are each of the 10,000 users logged in all day, each working with large files or data sets, and doing intensive tasks?

blowski · on Dec 30, 2014

I have built projects that had 100 users per day but was told by the CEO that it needed to be able to handle 100,000 concurrent users, and part of the acceptance testing was proving that.

spydum · on Dec 30, 2014

Yup story of my life building a BI platforms lately. Everybody inflates their concurrency estimates, and you can question them all you want, just don't be wrong in the opposite direction..

T-hawk · on Dec 30, 2014

Yup. Estimation inflation syndrome is a thing. One of my previous jobs suffered from that all the time, on database sizing.

How much space does the client need for this trading app? DBA figures 1k transactions per day at 1kb/record for 1 MB storage, round that up to 1 GB per year, so tell them 10 GB should do for 10 years.

CTO hears 10 GB from the DBA, adds his own safety margin factor of 10x, tells the client 100 GB.

Client hears 100 GB, adds his own safety margin factor of 10x, tells their operations 1 TB.

So now we have a client building a giant SCSI terabyte array (this was when 72 GB SCSI disks were the high end server standard) to hold a database that's a year away from reaching even one gigabyte.

alrs · on Dec 30, 2014

Be glad there was no cache infrastructure. Ripping out an ill-conceived caching layer is usually a nasty and unpopular step early in an architecture rescue.

russnewcomer · on Dec 30, 2014

A question instantly spring to my mind -

Why all the virtualization? Lack of experience? I haven't done a large amount of work with virtualization, but stopping and thinking about it would seem to have indicated a problem with the design. Did no one look at this and say, "That's a bad idea..."?

jacquesm · on Dec 30, 2014

I have my suspicions about that but I don't want to voice those here. For one that's possible actionable second there is a lot more to this story than what I can talk about publicly. I'm already very grateful they let me publish as much as I just did.

So, yes: someone did say 'that's a bad idea' and got sidelined for his effort.

danielki · on Dec 30, 2014

I know you can't say anything, but I'm guessing that there's a nonzero chance that the guy who made the hardware decisions is a friend of the guy who sold the hardware.

russnewcomer · on Dec 30, 2014

That's too bad for them, good for you.

driverdan · on Dec 30, 2014

This is the kind of freelance work I love doing, it's win-win. You get to solve a complex puzzle and the people who hired you are happy that you saved their ass. These jobs often have a lot of simple things you can fix early to eliminate the immediate danger (eg opcode caching) and give you time to really dig in and fix everything.

mcguire · on Dec 30, 2014

There is a considerable amount of wisdom in this article. But there's one thing in particular that caught my eye:

"The job ended up being team work, it was way too much for a single person and I’m very fortunate to have found at least one kindred spirit at the company as well as a network of friends who jumped to my aid at first call. Quite an amazing experience to see a team of such quality materialize out of thin air and go to work as if they had been working together for years."

I have done similar sorts of things for my current employer and at previous jobs. One thing we discuss here, since the opportunity keeps popping up, is forming a specific team of people to parachute into a flailing project and get it back on track. That frequently seems to involve taking it away from the then-current developers, paring it down to "the good parts", and then rewriting the rest, but it does yield results.

jacquesm · on Dec 30, 2014

I try to be as fair as possible to those involved in earlier stages of the project (management is usually not too levelheaded at this stage). In this case the earlier developers for the most part had but one single failing, they didn't put their foot down when they realized things were going pear shaped. If they had done that they might have been able to stop the disaster before it happened, but as it was they were too intimidated to draw a line in the sand. I'm pretty sure they learned a bunch of lessons on this project, they're no angels but they're definitely not the bad guys.

UK-AL · on Dec 31, 2014

Putting your foot down can end up badly as well, as you simply be removed from the project if you don't agree to deadlines.

Then management wonder why it failed.

jacquesm · on Dec 31, 2014

I'd rather be removed from a project than to agree to something that can't be done.

jamesknelson · on Dec 31, 2014

I've also come to that conclusion after a number of years of freelancing, but I understand the other point of view as well.

While freelancing, especially for people in another country, it isn't uncommon for customers to just disappear without any trace (or pay) half-way through a project. As such, when confronted with some problem mid-project, finishing the job in a half-assed manner can be a way of ensuring you get paid. Of course, a better method is making sure you have a solid plan before agreeing to the job, and trying to avoid unreliable customers - but accomplishing this can be difficult in any setting, even non-freelancing.

On the customer side, making sure you have a number of reasonably sized milestones and pay for them immediately on delivery can help keep freelancers confident, and thus encourage better quality work.

sokoloff · on Dec 30, 2014

I don’t mind doing these jobs, they take a lot of energy and they are pretty risky for me financially but in the end when - if - you can turn the thing around it is very satisfying.

What about the jobs is financially risky? Do you have a downside beyond "might not get paid if the company fails"?

jacquesm · on Dec 30, 2014

It concentrates a lot of time on a single customer which means I may have to say 'no' to my other, repeat customers for shorter jobs. My line of business is normally technical due diligence (helping investors to make savvy decisions about where to invest and where definitely not to invest). That work is usually very short term (typically a week) and there is no guessing ahead of time when a job will come up. So when one of my customers calls I'm supposed to be up and running within a day or so.

Another risk is that when I call my friends in to assist I assume their risk of not getting paid, in other words, if the company would not be able to meet its obligations I would make sure my friends and colleagues would be made whole (those relationships are worth more to me than any job ever would be).

hkarthik · on Dec 30, 2014

Great write up.

There's likely an alternative scenario where a consultant runs into a different but similar set of problems with a company that has mis-configured their Rails app across multiple AWS EC2 machines, in the wrong security groups, with their EBS settings tuned improperly for their MySQL instances. All resulting in extremely poor performance of their flagship application which is costing them a lot of business.

themonk · on Dec 30, 2014

Was innodb buffer pool as set to default as well?

thisone · on Dec 30, 2014

thanks for the write up. Having been through this on the inside, company implosion, the few of us who stayed needed to save the software from all the poor decisions made for all the right reasons, I can say it's made me a better programmer.

Not a job for the faint of heart, especially when it's your own history you are now fixing, and I appreciate seeing the experience of someone else.

digital-rubber · on Dec 30, 2014

Nice read Jacques :-)

Though typical story of any company/person that assumes a framework are great for their problem, product, not realising what and what not happens in the background. One has to perfectly understand which cogs, axis and wheels turn when an operation is done. Know which wheels always do the exact same thing (apply caches) etcetc.

But more important, best wishes for 2015 from nearby your office,

RB

jacquesm · on Jan 9, 2015

Hehe. Dat was 'm dus :)

garry · on Dec 30, 2014

This happens quite a lot actually. Premature optimization with the basics not being figured out. Always get your database indices figured out first, and then cache after that. Pick a reasonable place to start scaling horizontally, but only after you've reached the sweet spot of what one fairly powerful instance can deal with.

mcguire · on Dec 30, 2014

"The traffic levels were incredibly low for a system this size (< 10K visitors daily) and still it wouldn’t perform."

This kind of thing irritates me. User numbers are important, financially, because "10,000 users daily" can tell an investor or manager how much money is involved. But technically? That number doesn't mean anything to me. Are the visitors making one request or a hundred? Are they clustered into the five minutes before and after a horserace or are they spread out?

jacquesm · on Dec 30, 2014

Being more specific would risk allowing the company to be identified, but you're absolutely right that just quoting user numbers by themselves is not going to be much help. Consider adding the words 'within the context of this application' wherever such metrics are used.

As far as interaction goes I would qualify this particular product as halfway between twitter and a social bookmarking site. More interaction than HN but signficantly less complex than twitter. Both twitter and HN are deceptively simple on the outside but remarkably complex underneath, so maybe I'm overstating the complexity level but it's not too far off the mark. By my estimate and using my own websites as a benchmark they should be able to run their current product on a single machine up to or over 100K users daily (using their current set of technologies), session times and concurrency of course play into that heavily.

mcguire · on Dec 30, 2014

I understand, and I'm sorry I seemed to be specifically targeting you. It's more of a general complaint: I've seen too many people using that kind of measure in a context where it's really not appropriate.

oldpond · on Dec 31, 2014

Great story! Thanks for this. I'm glad it had a happy ending as so many of these situations do not. It takes courage and vision to admit you have a really big mess on your hands and need expert help.

As for the clueless PM, I have met far too many of these in my travels. If you can't write software, what makes you think you can 'manage' a software development project?

lifeisstillgood · on Dec 30, 2014

My favourite part was the use of (graphite-like?) counters to monitor changes and make implicit assertions about relationships in the system (ie if we push that metric down, that metric will go up by same amount)

It's a really useful trick to stop yourself believing that the systems works the way you think it does just because you think it.

davidw · on Dec 31, 2014

The key quote for me was "First you scale ‘up’ as far as you can, then you scale ‘out’". I see so many job postings involving "web scale" tech that make me kind of suspicious. Do they really need it?

on Dec 30, 2014

[deleted]

jacquesm · on Dec 30, 2014

< 100K users / day or so?

mc_hammer · on Dec 31, 2014

this is a really good article; novices can almost use it as a "how to scale x" or "scalability and optimizations: how to". good read.

emmanueloga_ · on Dec 31, 2014

Not to downplay the work of the OP, but the system he talks about seemed like a feast of low hanging fruit :)

icedchai · on Dec 30, 2014

Sounds like the original developers were either incredibly incompetent, or wanted to guarantee themselves future work.... 100+ VMs? ridiculous. No op code cache? No memcache or other forms of caching? Stock DB settings? No indexes?

This is all stuff that is so basic. I gotta laugh.

And I have to wonder how bad the actual code was...

Ixiaus · on Dec 31, 2014

He even said the developers were competent...at developing. They'd released some not-so-good software towards the end of the project due to "manager pressure".

It sounds like they didn't have a good systems person though nor good (and general) software leadership, often good software leadership is also your early-stage systems person. Jacques here acted as their systems integrator to save the day along with what sounds like mild programming support to cleanup some of the unfinished software product that got pushed too early.

jacquesm · on Dec 31, 2014

Spot on.

icedchai · on Dec 31, 2014

I'm skeptical about the developer competency. A good developer knows at least something about indexes and how to create them. A good developer would recognize 100 VMs is almost certainly unnecessary for a system handing such a small amount of traffic, etc etc.