
Startup Fuck-ups: How we lost 25% of our monthly revenue overnight - noellep
https://medium.com/@insync/startup-fuck-ups-how-we-lost-25-of-our-monthly-revenue-overnight-1e3529aa9e01
======
shizcakes
This could be solved an even more fundamental way: Don't run your own mailer
as a startup. There are lots of companies that will be responsible for email
deliverability on your behalf, via an API. If it took them 3 months to notice
no mail was being sent at all, imagine how long it's going to take them to
figure out that their IP is blacklisted in Spamhaus or any number of other
deliverability issues?

~~~
spacefight
Not everyone wants to outsource transactional mail handling to a third party.

~~~
rtpg
Just because you don't want to doesn't mean you shouldn't.

The major advantage to outsourcing it is that the people you're handing the
job over to actually know what they're doing and are experts at it. Plus they
can spend 24 hours a day checking this stuff instead of you.

The arguments for being in-house on absolutely anything but your core
competency when you're an early-stage startup are really hard to justify.

~~~
Cyranix
Especially with the number of services that offer zero-cost plans at low usage
levels! You can rarely even make a fiscal argument for keeping this kind of
stuff in-house, much less a skills argument.

------
wmt
_What we found was that a number of failed jobs were being kept by the system,
which meant that these were taking up a ton of space that they shouldn’t have
been.

To fix the issue, we put together a script to delete the failed items, since
any retries to send them didn’t appear to work._

At this point my head was screaming "NOOOOooooo!" and made me feel bad for
author for the whatever disaster would soon follow.

Not only was the problem not fixed, it wasn't even understood. Hiding the
problem by fixing its symptoms will rarely get you far. I don’t think I'm even
Captain Hindsighting here, as I've learned over and over that not
understanding the root cause of any issue means you will be screwed by the
issue sooner or later, and it likely will not be pretty.

Sure, sometimes you don't have the time to get to bottom of an issue, but even
then you cannot pretend that it's fixed. It'll be back with a vengeance.

~~~
Bluecobra
Well said. This is the end result of letting developers perform system
administration tasks. I doubt that they were able to Google the root cause on
Stack Overflow. :)

------
rbadaro
I don't think terms like "Fuck-ups" or "screwed" should belong in corporate
communications, start-up or not. It's cool they are talking about this openly,
but unfortunately what I took from their write-up is that their communication
style is less than professional.

~~~
protonfish
I don't care what language they use, personally. However I am at work and
having the F-word in 72pt font blaring from the top of my browser made scroll
down very quickly. I doubt my bosses would be thrilled to see that.

~~~
meritt
Man, I'm sorry, but what sort of job do you have where seeing the word "Fuck"
is like an actual job risk?

~~~
protonfish
It pays awesome and is easy work. If the handbook prohibited saying the word
"zucchini" under penalty of dismissal I'd happily strike the word from my
vocabulary.

------
wrs
Further evidence that "if you aren't monitoring it, it isn't happening"!
Ensuring that code is monitorable needs to be right up there with ensuring
it's testable.

------
derwiki
"Small" is relative -- at the "small startup" I run, our email volume is low
enough that I BCC myself on every email the site sends. Poor man's monitoring
and it won't scale, but I usually notice within hours if something had been
broken.

~~~
aqeel
Where I work, BCC-ing on all mails is not scalable. Instead a random sample of
the mails is sent to ourselves.

------
Someone1234
As someone who has created quite a few of these mailers, the queue getting
stuck on a single piece of mail and hanging indefinitely is incredibly common.
As time has gone on my solutions have become simpler and more pragmatic, since
additional complexity breads additional problems.

For example, if I was going to design an emailer today:

-Grab the email from a database save it to a file (likely one or several XML files) and place it in an "Outgoing" directory (ye olde file system).

\- Then have a process which grabs an atomic lock (only one running at a
time!), gets the directory listings, and launches the actual "sender" for
every file individually (concurrently).

\- When the launcher launches the sender it records the PIDs of the process
against the actual emails/XML files internally.

\- After a set wait period if any processes are still running, the launcher
kills them, and moves the email/XML into a "Failed" directory which we monitor
independantly.

\- Every email which is sent gets moved to an "Archive" directory by the
sender process, and we monitor that to see if no emails have been archived for
a long time (e.g. 30 minutes).

You can accomplish the same thing using a database (Outgoing, Achive, and
Failed tables), but frankly with so many awesome file system tools already
around it doesn't make sense to reinvent that wheel. Plus people intuitively
understand that if a file is sitting in the "Outgoing" or "Failed" directories
then it hasn't been sent yet (just like your client would!).

~~~
brusch64
Sounds like you just recreated Microsoft Biztalk.

But to sound less like a dick - communicating with a low probability of errors
is hard !

~~~
specialist
Having implemented, deployed, and supported "workflows" using BizTalk,
ICAN/JCAPS, Orion, homegrown JMS-based stacks, Cache & Ensemble, etc...

Someone1234's solution is the antithesis of BizTalk, the complete refutation
of the ultimate futility of using workflow engines.

------
lazyant
If you don't outsource this type of service (preferred solution imho) then
from the very beginning you have to monitor internally (the solution done
after the fact) and also and most importantly externally, in this case having
one or more monitored client-like email accounts.

------
nasalgoat
I have a script that sets up 90% of a full nagios/icinga server automatically
in about 5 minutes.

Why, in 2014, are people still not monitoring everything as job #1?

Why isn't this being taught in schools? How do people with tech jobs not know
this?

------
meritt
Make sure your mistakes never directly affect your consumers. Don't spam them,
don't overload them with ads, and don't leak their PII. Quickest way to lose
customers.

------
baudehlo
I'd really like to know what the mail server software was that failed this
way.

------
pbhjpbhj
No one in the company was subscribed to their mailings?

Hope they're monitoring their backups.

------
Animats
_scheduled emails to check in with our users._

That's not "transactional email". That's spam.

You're a spammer. Die.

------
general_failure
"I am single, 41 and pregnant"...

I don't know when pregnancy became a part of a person's identity. I guess this
adds to the coolness factor these days because you have to try harder? Sigh.

~~~
amouat
Where does it say this? (I guess it's been removed?)

And why does it matter to you whether or not the author considers being
pregnant to be part of her identity?

IMO it would be more healthy to comment on the content of article than than
the author's identity.

