I think a lot of services (even banks!) have serious security problems and seem to be able to weather a small PR storm. So figure it out if it really is important to you (are you worth hacking? do you actually care if you’re hacked? is it worth the engineering or product cost?) before you go and lock down everything.
Just because you can "afford" to be hacked, doesn't mean you shouldn't take all the steps necessary to proactively protect your data. In the end, security is not about you, it is about your users. This is exactly the type of attitude that leads to all the massive breaches we have been seeing recently. Sure your company is "hurt" with bad PR, but really your users are the ones who are the real victims. You should consider their risk (especially with something as sensitive as people's files!) before you consider your own company's well being.
Systems get compromised, it happens. Organizations with weak security architectures can become so compromised that cleanup becomes a nightmare because it is difficult to isolate the threat(s) without serious disruption in services. A strong security architecture is not so much to ensure breaches never happen but to limit the amount of damage likely to occur when breaches do happen.
And yes, this happens even to organizations that think they have nothing worth hacking.
Having internal firewalls between servers that don’t need to talk to each other — again a good idea. But if your service doesn’t actually need this, don’t necessarily do it
I can not think of any reason why "your service doesn't actually need this" and "don't necessarily do it". I understand that it costs money to do these things, but setting up a firewall is relatively cheap, significantly less than the cost of the additional cleanup if the breach is not contained.
Security, in a way, can be compared to insurance. Sure, if you are young and live a healthy life style you may not necessarily see the need to spend $100+ a month for a health insurance policy, you can save a bunch of money... but if an accident does happen, you can rest assured it will cost you significantly more than if you had just bought the insurance in the first place.
This, in a sense, is the security tradeoff.
I think really smart engineers who are well versed in security can know where security needs to be, and yes it is possible to go overboard, but I think this is the exception rather than the rule. Advising readers that it's ok to not worry too much about security because:
lot of services (even banks!) have serious security problems
is absolutely ridiculous and is horrible advise.
Sucks, but that's capitalism. However, there are a few states now which allow you to have some charitable clauses in your corporate charter.
After a few years of this, I set them all back to right time and found that I had trained myself to just leave at the right time, with no more trickery needed.
Once I had three wake up alarms, at different points at the bedroom. Didn't work.
Being late is lame. Suffering its consequences is the best teacher one can have.
For example if you are designing an aircraft your first design is never perfect. So you when you do the initial design, you do it as if the aircraft has to weight 70% of what it really will. As errors are corrected in your original design (or features creep in) you will slowly eat away at that 30% margin. Hopefully by the time you finish the have some left, or the aircraft will never get off the ground.
OP's "extra reads" is dumb because he could have had normal metrics for memcached load and planned to only support like 70% capacity or somesuch, and when load hit that number, he would immediately increase capacity. Instead he's running with a handicap. It's just useless.
Secondly, you should know what your capacity is. Stress testing exists for a reason.
Stress-testing, shared-nothing and dollar-scalable are platonic ideals, and they're not always achievable. If Dropbox had three infrastructure engineers, they probably weren't able to build proper capacity planning models, and probably couldn't afford to build a full production work-alike for stress testing anyway. (And at some scales, that's literally impossible. Our vendors couldn't physically manufacture enough servers to build a full test environment, cost aside.) I'm sure they did some simulated tests as well, but those won't tell you the whole story.
You're focused on IOPS, but you have no idea if that's what Dropbox's bottlenecks were. (Not to mention: What does IOPS mean on an EBS and S3 infrastructure?) Complex systems fall over in complex ways. You can predict the next bottleneck, but not the one after that; by the time you get there, your fix for the first bottleneck will have changed the dynamics.
It sounds like they did do stress testing, using real-world loads, on a system that was 100% similar to their production system. They ran continuous just-in-time stress tests in the Big Lab.
That being said, trends in user visits are of course great numbers for capacity planning because you have an idea how much growth to expect in the near future. But it's only a vague multiplier; you need to know how beefy a box to get (by stress testing to determine capacity) and then multiply by the growth factor. But it's usually more complicated than this.
Stress testing doesn't have to be a formal process in all environments. You might just have a developer with a new chat server and they want to get a benchmark of how many users can join and chat before CPU peaks. An hour or two of coding should provide a workable test on like-hardware, which can then be generalized with tests of other software to give an idea of the capacity when a certain number of users are logged in and performing the same operations. The point isn't to know 100% when you will fall over, but to have at least an idea when you're going to fall over, so you don't have to actually fall over to figure out when and where to scale.
I have no problems with very-short-term big lab stress testing. We had the same issue at my last place, and with lots of caution, it worked fine. But jesus christ, if I told my bosses "I think we should run all the servers with extra load until they fall over, then re-evaluate", they'd look at me like I had antlers growing out of my head.
Incidentally, fuel dump systems were initially added due to a rule by the FAA that a plane's structural landing weight not be exceeded by its takeoff weight. Many commercial planes never had this problem, so dumping systems were not installed. As a result, most planes just circle until they've burned up enough fuel, or land anyway overweight. You could dump fuel to lessen the chance of explosion, but only if your plane is equipped with a fuel dump system, and such incidents are so rare it's not even a safety consideration.
"Why not just plan ahead? Because most of the time, it was a very abrupt failure that we couldn’t detect with monitoring."
So you have a system, and you have monitoring in place. Let's say the monitors were set up for 1 minute polls, because somebody thought that was a good idea. Suddenly you find out one of your servers is down. Oh noes! There's 45 seconds until the monitor finds this out, which would be horrible.
Since we have doubled the reads on the existing servers, we now no longer have capacity and connections are stacking up. Shit :'( But not to worry! Let's just quickly kill the extra reads - now we have more capacity! Hooray!
Except, if the extra reads weren't happening, they would have already had extra fucking capacity and not had to flip a switch in the first place.
Now you see why i'm mad, bro?
They actually do this kind of stuff (except for the "lets dump the lead" part), in stress tests, especially in cargo and millitary planes. And they do similar tests not only in aviation, but in most kinds of engineering.
So maybe misplaced sarcasm?
Don't you mean "it sounds good in practice"? This entire post is about practical experience.
I don't think this is like setting your watch forward 5 mins. I think it's more like RAID. When you get a warning that one of your drives has died, you know you have to get in and replace it.
Depending on how critical the machine is, the cost of getting to the data centre etc. you might leave now, in the middle of the night and drive like a bat out of hell, or you might leave it til next week when you'll be in there anyway.
Either way you know your risk just went up a hell of a lot. Depending on how risk averse you are, you will act accordingly.
In theory, you'd think you can do load-testing and simulations and capacity planning, and find these breaking points ahead of time. In practice, it's not always feasible, and this seems like a simple-enough hack that gets you much of the way there.
The whole post was excellent, but all the useful points will now be overshadowed by the armchair quarterbacking about security by people who mostly don't understand that ALL security is a compromise, and it is as important to understand and make deliberate decisions about your security as it is to try to make a secure system in the first place.
He keeps many comments at a high level (security/convenience) and refers to a few non-Dropbox examples.
I like object relational mapping as a theory (ie. I have an object of type Author which has 1 or more books I can loop over), but I hate ActiveRecord implementations. Eventually, they just end up implementing almost all of SQL but in some arcane bullshit syntax or sequence of method calls that you have to spend a bunch of time learning.
I also seriously doubt that anyone has ever written a production system of any reasonable complexity and been able to use the exact same ORM code with absolutely any backend (if you have an example please correct me on this). This barely even works with something like PDO in PHP which is a bare bones abstraction across multiple SQL backends.
When it comes down to it, the benefits of ActiveRecord are all but dead on about the third day of development. The data mapper pattern adopted by SQLAlchemy (et. al.) takes all of the shitness of ActiveRecord and adds mind bending complexity to it.
SQL is easy to learn and very expressive. Why try and abstract it?
I spent years working with an ActiveRecord ORM I wrote myself in my feckless youth and thought that it was the answer to the world's problems. I didn't really understand why it was so terrible until I did a large project in Django and had to use someone else's ORM.
When I really analysed it, there were only three things that I really wanted out of an ORM:
1) Make the task of writing complex join statements a bit less tedious
2) Make the task of writing a sub-set of very basic where clauses slightly less tedious
3) Obviate the need for me to detect primary key changes when iterating over a joined result set to detect changes in an object (for example, looping over a list of Authors and their Books)
To that end, I wrote this:
It's written in PHP because I like and use PHP but it's a very simple pattern that I would like to see elaborated upon/taken to other languages as I think it provides just the bare minimum amount of functionality to give some real productivity gains without creating a steep learning curve, performance trade-off or any barrier to just writing out SQL statements if that's the fastest way to solve the problem at hand.
You're entirely right here, because databases are different. For example, (I forget the exact details), "select count(*)..." in MySQL is O(1), but it's O(log n) or O(n) in Postgres, depending on indices. That's a detail no ORM is going to save you from.
> SQL is easy to learn and very expressive.
Strongly disagree. The reason everyone keeps trying to write ORMs is because 1) SQL is a shitty language and 2) it's not the language that programmers want to use. Write a better frontend language for Postgres, and the ORMs would disappear.
I strongly suspect that would take some of the wind out of the NoSQL crowd. There are certainly NoSQL deployments that would have a hard time on traditional RDMBS, but there are a lot of other places that use Mongo just because they don't like SQL-the-language, rather than Postgres-the-DB.
It is very much non shitty.
Its just that lots of programmers, especially OO minded cannot get into its mindset, and use it for what it is, they have to put a lame OO abstraction on top.
Functional programmers shoud fare better in this regard (or Prolog programmers, if they still exist).
If you really want to abstract it, something like LINQ is a better way.
The hard part in SQL is optimization which requires really understanding how the underlying database engine optimizes and executes the query.
Optimizing complex queries is no joke. It's one of the reasons noSQL seems nice at glance. You can do the optimizations by adding lots of indexes or using application logic. In reality, it's a tradeoff for other problems.
I think you are exaggerating quite a bit when you refer to SQLAlchemy's patterns adding "mind-bending complexity". Object relational mapping is a complex affair to start with. Have you much experience with modern versions of SQLAlchemy directly (and if not, how fair are comments like that) ?
There are only two hard things in Computer Science:
cache invalidation and naming things.
-- Phil Karlton
At any rate what I'm saying really is that reducing the amount of keystrokes writing and maintaining joins is the only part of SQL where I see there can be significant gains in productivity through automation of the task.
Most ORMs implement where clauses, from clauses, aggregate functions, grouping, having, etc. etc. etc. ie. they wind up basically re-implementing SQL and abstracting it so that your previous knowledge of SQL is basically obsoleted and in order to debug problems or create complex queries you either have to switch entirely to SQL (in which case you lose all query building functionality) or map in your head the SQL you want to achieve, to the arbitrary syntax provided by the ORM software.
That alone is most of the usefulness of SQLAlchemy, as it lets you write subqueries and joins extremely easily.
On top of that, the (optional) ORM is built as models on top of SQLAlchemy's table/relationship API. These models can be queried almost exactly like the raw tables.
"pick lightweight things that are known to work and see a lot of use outside your company, or else be prepared to become the “primary contributor” to the project."
One point it misses though is to test your backup strategy often. When you scale fast things break very often and it's good to be in practice of restoring from backups every now and then.
"It's an excellent idea to run a realistic load simulation on a test server and then literally pull the power plug. The firsthand experience of recovering from a crash is priceless. It saves nasty surprises later."
Same goes for testing network connectivity and failover. I can't tell you how many times I've heard things like "The automatic recovery _should_ have kicked in but..."
Having a recovery procedure and backup strategy is completely different from having actually restored a backup and recovered from a failure.
Why not take the extra half a second to make those random strings meaningful and hidden behind a DEBUG log level?
The point that he was making with this was that over-logging is a good thing - this probably wasn't something the initial author thought was going to be terribly informative, hence the random string. And yet it ended up diagnosing a real world problem.
In a perfect world, by all means properly write out your messages - but if you're stalling on a log message because you're not sure how to phrase it, you may get concrete benefit from just dropping a FUUUCCKKKKKasdjkfnff and moving on.
When the problem occurs, it's pretty quick for the guy who needs to fix it at 2am, to find where it exploded in the code base, while the original developer is (maybe) passed out in a bar somewhere.
Not much else matters. He could of just done :( x 10 and had the same result. The main thing is, it's easily traceable!
At the time of writing the code, you're hopefully thinking through "how could this fail?"
There's your log message.
"FUUUCK" is awfully good at conveying the seriousness of the error, and "aslfkhsdf37" ensures that the string is unique, so you can pinpoint it instantly in your gigantic codebase.
The fact is, it kind of works. Something like "missing record (line 38)" doesn't indicate the severity, there might be 10 different "missing record" error strings in your codebase, and somehow in real life, line numbers and filenames never seem to quite match up like they should (transcompilation, async callbacks, and so on.)
Let he who has never written a frustrated, nonsense, print statement throw the first stone, if you will.
Plus, the statement is not only meaningful, but also very expressive.
Does anyone know what memory corruption bugs they are referring to?
I'd love to know more about how their system works, if they are indeed not using an ORM.
Every time I've tried to build something without an ORM, I just end up writing my own shitty one accidentally.
My strategy with SQLAlchemy has always been to under-promote it. If you have lots of big players early adopting you and hitting all the pointy edges, it can damage your rep. There's a group of major folks out there who will never use my library due to old experiences. Others like Reddit and Yelp have hung on, and apparently dropbox is still using the core, hooray !
That's why I'm always amazed at how aggressively MongoDB is promoted, when it seems like they're still going through a lot of growing pains. I guess they sort of have to, given that they're a business and all.
Edit: Thanks for the downvotes. My point is, just make it unambiguous to everyone in your comment so we don't have to click through your profile. Context matters. e.g.:
"I was Product Manager at Dropbox and worked with Rajiv (the OP). He's awesome, you should listen to him."
A username and password represent a pair. Neither one has meaning in terms of authentication without the other.
Take the example where I have forgotten my username (JohnGB), but try with what I think it is (Say JohnB), and enter the correct password for my actual username. The system would then tell me that my username is fine, but that my password isn't. From then on, I would be trying to reset the password for a different user as the system has already told me that my username was correct.
Please, for the sake of sane UX, don't do this!
I've never seen a shorter description of real-world software development. That's it in a nutshell!
* on my machine xargs -I implies -L1, so you can drop that
* use gnuplot -p or the graphic will disappear immediately after rendering
A sort -n is also required before the uniq since server logs have the time of the request but are printed when the response is complete so they're not necessarily increasing.
Nginx is great for...pretty much everything else.
MySQL has a huge network of support and we were
pretty sure if we had a problem, Google, Yahoo,
or Facebook would have to deal with it and patch
it before we did. :)