
Scaling lessons learned at Dropbox, part 1 - eranki
http://eranki.tumblr.com/post/27076431887/scaling-lessons-learned-at-dropbox-part-1
======
jgannonjr
Great post, but this part scares me a bit...

 _I think a lot of services (even banks!) have serious security problems and
seem to be able to weather a small PR storm. So figure it out if it really is
important to you (are you worth hacking? do you actually care if you’re
hacked? is it worth the engineering or product cost?) before you go and lock
down everything._

Just because you can "afford" to be hacked, doesn't mean you shouldn't take
all the steps necessary to proactively protect your data. In the end, security
is not about you, it is about your users. This is exactly the type of attitude
that leads to all the massive breaches we have been seeing recently. Sure your
company is "hurt" with bad PR, but really your users are the ones who are the
real victims. You should consider their risk (especially with something as
sensitive as people's files!) before you consider your own company's well
being.

Edit: formatting

~~~
jandrewrogers
Yeah, that point significantly underestimates the cost of cleaning up once
your systems have been penetrated. By the time you notice that one system has
been compromised, there is no guarantee that every system at your company is
not compromised, particularly if so little effort is put into a robust
security architecture. I've seen companies that took the attitude the author
does and ended up paying for it down the road.

Systems get compromised, it happens. Organizations with weak security
architectures can become so compromised that cleanup becomes a nightmare
because it is difficult to isolate the threat(s) without serious disruption in
services. A strong security architecture is not so much to ensure breaches
never happen but to limit the amount of damage likely to occur when breaches
do happen.

And yes, this happens even to organizations that think they have nothing worth
hacking.

~~~
jgannonjr
You are absolutely correct; I have consulted with several companies, large and
medium sized, who have this exact thing happen. Just to quote the article
again:

 _Having internal firewalls between servers that don’t need to talk to each
other — again a good idea. But if your service doesn’t actually need this,
don’t necessarily do it_

I can not think of any reason why "your service doesn't actually need this"
and "don't necessarily do it". I understand that it costs money to do these
things, but setting up a firewall is relatively cheap, significantly less than
the cost of the additional cleanup if the breach is not contained.

Security, in a way, can be compared to insurance. Sure, if you are young and
live a healthy life style you may not necessarily see the need to spend $100+
a month for a health insurance policy, you can save a bunch of money... but if
an accident does happen, you can rest assured it will cost you _significantly_
more than if you had just bought the insurance in the first place.

This, in a sense, is the security tradeoff.

I think really smart engineers who are well versed in security can know where
security needs to be, and yes it is possible to go overboard, but I think this
is the exception rather than the rule. Advising readers that it's ok to not
worry too much about security because:

 _lot of services (even banks!) have serious security problems_

is absolutely ridiculous and is horrible advise.

------
brc
The idea of running extra load - it sounds good in theory but I can't help
thinking that it's a bit like setting your watch forwards to try and stop
being late for things. Eventually you know your watch is 5 minutes fast so
start compensating for it. I wonder if this strategy starts to have the same
effect - putting fixes off because you know you can pull the extra load before
it becomes critical. In the same way you leave for the train a couple of
minutes later because you know your watch is actually running fast.

~~~
apu
I actually purposely used to set all the various clocks at home ahead by
anywhere from 0 - 15 minutes. At first, I could remember which ones were ahead
by how much, but then soon I started to forgot and had to just assume they
were running at the right time. It worked great.

After a few years of this, I set them all back to right time and found that I
had trained myself to just leave at the right time, with no more trickery
needed.

~~~
inerte
I only have one alarm. If it fails I am late. I found out that depending on
complex systems work against you.

Once I had three wake up alarms, at different points at the bedroom. Didn't
work.

Being late is lame. Suffering its consequences is the best teacher one can
have.

~~~
chipsy
Were you the college roommate I had who suspended the alarm on a string over
the bed so that standing was required to turn it off?

~~~
pestaa
I saw an alarm that had a propeller on top of it. The clock made it fly and
you had to hunt it down (even if it falls behind the furniture) to shut the
alarm.

~~~
andrewflnr
I've heard of alarms on wheels that run away from you but that takes the cake.

------
nl
I wish he'd left the security advice out.

The whole post was excellent, but all the useful points will now be
overshadowed by the armchair quarterbacking about security by people who
mostly don't understand that _ALL_ security is a compromise, and it is as
important to _understand_ and make deliberate decisions about your security as
it is to try to make a secure system in the first place.

~~~
bestes
I'm glad he put the security notes in. It is so hard to get true facts about
how things are _actually_ done.

~~~
mturmon
Looking again at the post, I think the author was in fact rather careful to
_not_ give away anything about security practices at Dropbox when he was
there, for obvious reasons.

He keeps many comments at a high level (security/convenience) and refers to a
few non-Dropbox examples.

------
dools
_but I really hate ORM’s and this was just a giant nuisance to deal with_

I like object relational mapping as a theory (ie. I have an object of type
Author which has 1 or more books I can loop over), but I hate ActiveRecord
implementations. Eventually, they just end up implementing almost all of SQL
but in some arcane bullshit syntax or sequence of method calls that you have
to spend a bunch of time learning.

I also seriously doubt that anyone has ever written a production system of any
reasonable complexity and been able to use the exact same ORM code with
absolutely any backend (if you have an example please correct me on this).
This barely even works with something like PDO in PHP which is a bare bones
abstraction across multiple SQL backends.

When it comes down to it, the benefits of ActiveRecord are all but dead on
about the third day of development. The data mapper pattern adopted by
SQLAlchemy (et. al.) takes all of the shitness of ActiveRecord and adds mind
bending complexity to it.

SQL is easy to learn and very expressive. Why try and abstract it?

I spent years working with an ActiveRecord ORM I wrote myself in my feckless
youth and thought that it was the answer to the world's problems. I didn't
really understand why it was so terrible until I did a large project in Django
and had to use someone _else's_ ORM.

When I really analysed it, there were only three things that I really wanted
out of an ORM:

1) Make the task of writing complex join statements a bit less tedious

2) Make the task of writing a sub-set of very basic where clauses slightly
less tedious

3) Obviate the need for me to detect primary key changes when iterating over a
joined result set to detect changes in an object (for example, looping over a
list of Authors and their Books)

To that end, I wrote this:

<https://github.com/iaindooley/PluSQL>

It's written in PHP because I like and use PHP but it's a very simple pattern
that I would like to see elaborated upon/taken to other languages as I think
it provides just the bare minimum amount of functionality to give some real
productivity gains without creating a steep learning curve, performance trade-
off or any barrier to just writing out SQL statements if that's the fastest
way to solve the problem at hand.

~~~
arohner
> I also seriously doubt that anyone has ever written a production system of
> any reasonable complexity and been able to use the exact same ORM code with
> absolutely any backend (if you have an example please correct me on this).

You're entirely right here, because databases are different. For example, (I
forget the exact details), "select count(*)..." in MySQL is O(1), but it's
O(log n) or O(n) in Postgres, depending on indices. That's a detail no ORM is
going to save you from.

> SQL is easy to learn and very expressive.

Strongly disagree. The reason everyone keeps trying to write ORMs is because
1) SQL is a shitty language and 2) it's not the language that programmers want
to use. Write a better frontend language for Postgres, and the ORMs would
disappear.

I strongly suspect that would take some of the wind out of the NoSQL crowd.
There are certainly NoSQL deployments that would have a hard time on
traditional RDMBS, but there are a lot of other places that use Mongo just
because they don't like SQL-the-language, rather than Postgres-the-DB.

~~~
batista
No, actually the only reason is "its not a language programmers want to use".

It is very much non shitty.

Its just that lots of programmers, especially OO minded cannot get into its
mindset, and use it for what it is, they have to put a lame OO abstraction on
top.

Functional programmers shoud fare better in this regard (or Prolog
programmers, if they still exist).

If you really want to abstract it, something like LINQ is a better way.

~~~
jmathai
I agree. I see SQL similarly to regular expressions. There's a handful of
commands which let you do a lot of stuff.

The hard part in SQL is optimization which requires really understanding how
the underlying database engine optimizes and executes the query.

Optimizing complex queries is no joke. It's one of the reasons noSQL seems
nice at glance. You can do the optimizations by adding lots of indexes or
using application logic. In reality, it's a tradeoff for other problems.

------
misiti3780
Great advice:

"pick lightweight things that are known to work and see a lot of use outside
your company, or else be prepared to become the “primary contributor” to the
project."

------
prayag
Fabulous post. Thanks for writing.

One point it misses though is to test your backup strategy often. When you
scale fast things break very often and it's good to be in practice of
restoring from backups every now and then.

~~~
mirkules
Just started reading a book called "High Performance MySQL" and in one of the
early pages, the following advice appears:

"It's an excellent idea to run a realistic load simulation on a test server
and then literally pull the power plug. The firsthand experience of recovering
from a crash is priceless. It saves nasty surprises later."

Same goes for testing network connectivity and failover. I can't tell you how
many times I've heard things like "The automatic recovery _should_ have kicked
in but..."

Having a recovery procedure and backup strategy is _completely_ different from
having actually restored a backup and recovered from a failure.

~~~
RegEx
Reading High Performance MySQL as well. Loving it so far!

------
akent
_I noticed that a particular “FUUUCCKKKKKasdjkfnff” wasn’t getting printed
where it should have_

Why not take the extra half a second to make those random strings meaningful
and hidden behind a DEBUG log level?

~~~
ephemeralgomi
Probably most of their logging _is_ meaningful, but deciding how to
professionally phrase each and every log message will eventually get you to
decision fatigue.

The point that he was making with this was that over-logging is a good thing -
this probably wasn't something the initial author thought was going to be
terribly informative, hence the random string. And yet it ended up diagnosing
a real world problem.

In a perfect world, by all means properly write out your messages - but if
you're stalling on a log message because you're not sure how to phrase it, you
may get concrete benefit from just dropping a FUUUCCKKKKKasdjkfnff and moving
on.

~~~
vosper
I don't know how many new logging statements you commit to production code
every day, but I can't imagine it averages out to more than one or two. If you
can't take the time to phrase them both professionally and meaningfully then
you're doing yourself and your team a disservice.

~~~
flatline3
Moreover, you have, in your head, the log message that should be written.

At the time of writing the code, you're hopefully thinking through "how could
this fail?"

There's your log message.

------
elefont2
'Even memcached, which is the conceptually simplest of these technologies and
used by so many other companies, had some REALLY nasty memory corruption bugs
we had to deal with, so I shudder to think about using stuff that’s newer and
more complicated'

Does anyone know what memory corruption bugs they are referring to?

------
acslater00
For the record, I use sqlalchemy 0.6.6 regularly under fairly heavy load, and
have never had a problem with it. Any 'sqlalchemy bugs' are inevitably coding
mistakes on my part.

~~~
kennywinker
Yeah, I found that bit quite vague. Are they using SQLAlchemy's object layer,
but just not the high level query stuff? Or are they using only the low-level
query stuff and nothing else?

I'd love to know more about how their system works, if they are indeed not
using an ORM.

Every time I've tried to build something without an ORM, I just end up writing
my own shitty one accidentally.

------
ivankirigin
Rajiv is awesome, you should listen to him

~~~
akent
Says an ex "Product Manager at Dropbox".

Edit: Thanks for the downvotes. My point is, just make it unambiguous to
everyone in your comment so we don't have to click through your profile.
Context matters. e.g.:

"I was Product Manager at Dropbox and worked with Rajiv (the OP). He's
awesome, you should listen to him."

Much better.

~~~
stratos2
which means his opinion counts at least 100 times more than yours does.

~~~
carb
What he's saying is that ivankirigin should have said that himself. I don't
know that he has any credibility to his statement and wasn't going to give it
any merit until akent made me realize that ivankirigin had first-hand
experience.

------
JohnGB
I believe that the section on "The security-convenience tradeoff" is
fundamentally flawed.

A username and password represent a pair. Neither one has meaning in terms of
authentication without the other.

Take the example where I have forgotten my username (JohnGB), but try with
what I think it is (Say JohnB), and enter the correct password for my actual
username. The system would then tell me that my username is fine, but that my
password isn't. From then on, I would be trying to reset the password for a
different user as the system has already told me that my username was correct.

Please, for the sake of sane UX, don't do this!

~~~
dudeguy
No way, sir. Saying 'you entered the wrong password' in that case is not any
more confusing than the ambiguous error that says 'you got one of them wrong
but I'm not gonna tell you which.' most reset password systems are keyed to
your email address anyway.

------
opminion
A topic usually left out in scaling discussions is: how much can one predict?
Or is it mostly trial and error? Is it mostly about good "reactive"
engineering, would it have benefited from good mathematical modeling?

------
crazygringo
> _I noticed that a particular “FUUUCCKKKKKasdjkfnff” wasn’t getting printed
> where it should have_

:)

I've never seen a shorter description of real-world software development.
That's it in a nutshell!

------
wulczer
Great article! Small nitpick from someone who just tried this on his server
logs :)

    
    
      * on my machine xargs -I implies -L1, so you can drop that
      * use gnuplot -p or the graphic will disappear immediately after rendering

~~~
ralph
I agree, good article.

A sort -n is also required before the uniq since server logs have the time of
the request but are printed when the response is complete so they're not
necessarily increasing.

------
anamax
There's a talk about Dropbox scaling at
[http://www.stanford.edu/class/ee380/winter-
schedule-20112012...](http://www.stanford.edu/class/ee380/winter-
schedule-20112012.html) .

------
gallerytungsten
Great article. Rajiv made it easy to understand the conceptual framework. The
lesson is: always strive to be robust. Test your failure points deliberately.
Applicable to more than just server scaling.

------
matt
Nice, love the idea of running with extra load to predict breaking points.

------
lobster_johnson
I'm surprised that Dropbox actually uses S3 internally to store data. All
along I had assumed, wrongly, that Dropbox had built their own distributed
storage cluster.

------
philfreo
Can you explain the nginx/HAproxy config a little more?

~~~
emmett
HAproxy is great at exactly one thing: load balancing. It's better than nginx
for that one use, because it's more flexible, has better controls for
flapping, is smarter about queuing, gives you cool stats pages, etc.

Nginx is great for...pretty much everything else.

------
kevinburke

        MySQL has a huge network of support and we were 
        pretty sure if we had a problem, Google, Yahoo, 
        or Facebook would have to deal with it and patch 
        it before we did. :)
    

I am fairly certain Google is running its own (patched) version that's fairly
different than the off-the-shelf MySQL.

~~~
nl
You mean using the Google Mysql5patches[1]?

[1] [http://code.google.com/p/google-mysql-
tools/wiki/Mysql5Patch...](http://code.google.com/p/google-mysql-
tools/wiki/Mysql5Patches)

------
mistercow
Running with extra load seems inefficient in terms of energy consumption.
Would it be possible to achieve the same thing by inserting delays or
something that can be turned off?

------
stratos2
all security is a balancing act which is the point he is making. there is
always a tradeoff

