Hacker News new | past | comments | ask | show | jobs | submit login
When "clever" goes wrong: how Etsy overcame poor architectural choices (arstechnica.com)
163 points by llambda on Oct 3, 2011 | hide | past | web | favorite | 86 comments

I've learned to never ever badmouth a weird architecture in a working and profitable company.

Simply - the people that threw it together probably had significant restraints: Lack of money, lack of knowledge, lake of time. To then come in and poo-poo their work (that is paying the bills) is bad form.


We all can sit back as say it sucks, but I don't necessarily think that sticking business logic in DB stored procedures is always the wrong thing: If you think of the DB storage system as separate from the DB procedure system then there's the split right there - it just happens to be running in the same process and you just have to be extra careful.

"I've learned to never ever badmouth a weird architecture in a working and profitable company."

I learned that lesson myself earlier this year. A client had a system that was just so ... wrong, I found myself thinking I'd made a terrible choice of project, and initially marvelled at the fact that the company simply survived.

I stuck it out because it was a short-term engagement and eventually learned that they'd solved an "unsolvable" problem, one that I'd tried myself and seen dozens of other attempts at, all eventual and apparently inevitiable failures. But this place, by ignorance or genius, seemed to make the wrong decision at every step, and did it.

Not to say that I agree with "if it ain't broke, don't fix it" all the time, but when you see something done wrong in a new way, make sure that you extract what you can from it and you might be surprised at what you'll learn.

> they'd solved an "unsolvable" problem

Without giving away any proprietary information, could you elaborate on that? It sounds very interesting.

Ross's talk is significantly more deferential in tone than the Arstechnica writeup may suggest.

I'm glad to hear that. The more I think about this article (and a lot of the comments here) the more uncomfortable I become. There's a difference between "don't be clever", "don't build it yourself", and "don't ever optimize". I find it hilarious that the other high ranking posts today are about Haskell and NoSQL, both of which might be reclassified into the "clever" category in a few years.

While this cleverness probably crimped Etsy at some point in their growth, it also probably saved a lot in hardware costs at a critical time in the business. It's worth nothing that they are still in business, that it was possible to throw hardware at the problem to buy time, and that they now have a solution they are comfortable with. While one can always learn from mistakes, I wouldn't argue too hard with that sort of success.

I think you could argue that using postgres and using logic-heavy stored procedures was positive technical debt, in the sense that it was the approach that one of the founders was familiar with. (Note that I'm definitely not saying that I think doing things this way was an objectively good or even easy path: just that given that particular founder's experience, it probably worked out better than spending the time to learn something totally new would have.)

So was using a monolithic master database, despite all of the pain we have migrating off of it now. The way we work with MySQL now is great at scale, but it does require more work to do many things. It relies on denormalization and presumes the existence of some kind of job server to keep things consistent. If we started out trying to build the architecture we have now, we never would have gotten anywhere.

The python middle layer is kind of a different story. I think it was kind of a blind attempt at scaling to 1000x when we should have been worried about 10x or maybe 100x, and since it wasn't a reaction to a real problem it was closer to being just totally wrong. Rather than allowing us to iterate quickly for a while at our own future expense, building/deploying and dealing with it actually froze new feature development for an embarrassingly long time.

Sounds like the paþ to madneß…

Just because something is successful doesn't mean it's right. I get what you're saying, but it doesn't mean that successful systems are exempt from intelligent discussion about where things went wrong. A lot of poor decisions are propagated because people are building on "successful" technology.

A lot of this makes sense (DBAs gating features! shudder).

But I'm not totally clear on why a switch from Postgres to MySQL was warranted.

Just because Postgres can do stored procedures well doesn't mean that it's effectively a stored procedure server. It also happens to handle SQL pretty well too. :)

The answer to that is simple, even with the little information given: Flikr used MySQL, the CTO driving the changes at Etsy came from Flikr, therefore, Etsy now uses Flikr.

Given their existing investment in Postgresql, I can't imagine they suddenly had more experience internally with MySQL, though the CTO may have been the only person with any horizontal DB scaling experience since Etsy didn't scale before.

I've seen the pattern of "CTO changes tech to something he/she is familiar with" many times. Sometimes that's good; mostly, it annoys the teams who are now less efficient. We certainly don't have enough info here to know. The scaling argument (his knowledge thereof, not Postgresql's "inability") is pretty compelling.

If you read the comments that preceded yours in responding to my question, you'll see that there are perfectly apolitical reasons for them to have done this move; they wanted a scale-out less DBA-controlled database, and at the time there wasn't a built-in Postgres way to do that.

Also, knowing how to scale DB X and not knowing how to scale DB Y often means scaling DB Y is a science project.

I feel bad for baiting people into psychoanalyzing Etsy's team.

There still isn't a built-in way in postgres to do master-master replication.

Neither is master-master built into MySQL.

You can apparently do that with Galera, but I haven't seen that in production.

What they're referring to is most likely MySQL's half-baked "circular replication". Some people claim it works for them. Personally I tried it once and the usual MySQL replication issues (desync, InnoDB deadlock) would create much more uncomfortable situations than in a Master/Slave setup.

With master/slave you at least stand a chance of having one half of the pair survive through a problem. With circular replication the cluster would lock up hard almost every time for us, and - to add insult to injury - leave the pair inconsistent after recovery.

I think the confusion here is based on the difference between "multi-master" replication and "master-master" replication. MySQL doesn't support the former, where a machine can be a slave of more than one master, but it does support (like you eluded to) having two masters (usually one writable, one read-only) in a loop, which MySQL folks call "master-master" replication.

Whatever your opinion of MySQL's replication, the MySQL community calls it master-master.

Based on the zero information I have I'm going to assume that there was probably more knowledge on the team about how to make MySQL do what they needed than Postgres. That's usually all the reason anyone really needs.

That's not my take on it. Their old PostgreSQL system was largely monolithic, they needed something that scaled well horizontally, so they went with MySQL. Makes perfect sense to me.

PostgreSQL has many, many strengths over MySQL, but historically replication wasn't one of them. Built-in replication didn't come to PostgreSQL until version 9, and until then various poorly scalable trigger based log shipping schemes (like Slony) were used. I wouldn't be surprised if PostgreSQL's built-in replication system with version >= 9 works quite well nowadays, but I think at the time Etsy were considering the switch there was still quite a lot of FUD around replication.

You are confusing triggers with log shipping.

They say they use master-master replication with mysql so they have no single points of failures. Back then, I think the only option with postgres was to use slony.

This is correct, master-master is necessary to shard as we have implemented it.

..and if you've used slony in anger, you know using slony isn't really an option.

Interesting, but do we have enough information? To quote Spolsky...

A 50%-good solution that people actually have solves more problems and survives longer than a 99% solution that nobody has because it’s in your lab where you’re endlessly polishing the damn thing. Shipping is a feature. A really important feature. Your product must have it.


This article doesn't go into details about the history of why Etsy's architecture is the way it is. It's entirely possible that this architecture enabled them to launch and iterate faster at a critical point in their history, without which they'd be nowhere.

Do what you need to now that will work. If your project is successful enough to experience a few orders of magnitude growth it'll need rearchitecting in some way anyway; build the heavy engineering when you know what you need, not think you know what might be the pain point.

It doesn't feel like this applies here. 6-7 years ago a decision was made to go with stored procedures over using an orm or at least implementing db logic in the application code.

Along with this, they also chose to split the responsibility for code and sql across two teams.

6-7 years ago, plenty of other options were available. In fact, the sharded MySQL solution that they use now was already possible back then.

It sounds more like they made some architectural decisions based on the companies org chart and it came back to bite them in the ass hard.


The constraint may have been knowledge. They may have made a very reasonable tradeoff to forego learning "best practices" and "proper architecture" to just get something out the door. The original developer may have just graduated with his philosophy degree and never have written more than a bash script. Hard to say.

Given the size of the company, the architecture didn't grow along org charts (that is a very real phenomenon in large companies!). Rather, the org chart grew along architectural lines. Regardless of cause, it is a pretty significant smell and can tell you something is (or will be) wrong.

The corollary to that theorem is to make changes when you identify thaas what you are doing isn't working anymore as that pain grows exponentially.

Good case of avoiding early optimization but hard to believe in 2005 someone thought it was a good idea to build your business rules in your database layer.

I don't find it to be all that surprising. I remember when Ruby on Rails was just starting to gain in popularity around the same timeframe. One of the big arguments against it was that it pushed you to put all of your business logic in the application. "How will you write another application to use the same database?" they would cry.

The answer turned out to be write one application tier that provides a web API to the database for all other applications to use. Interestingly, that turned out to be a pretty popular solution and is now probably the most prevailing method to get data to your user-facing applications. Looking back, it seems pretty obvious especially with the rise of mobile apps using that technology. It wasn't clear to most people in 2005, however.

This seems like an ongoing debate, I remember similar discussions between Java developers and DBAs before the rise of RoR. The answer back then was to migrate business logic into business logic servers, with application servers talking to the business logic serves over a queue. Architectures like that allow organizations to have heterogenous application architectures: One company had .NET for internal apps and Java for customer-facing web systems.

Everything talked happily to a business logic server written in Java, and that server talked to a couple of different databases and some legacy systems, one of which was written in MUMPs.

There is a lot more to programming life than Rails, ActiveRecord, and web APIs.

It's incredibly arrogant to find it hard to believe that someone thought stored procedures are a good idea.

SPs are one possible solution among many to common some programming problems.

It's also not an either/or choice. You can have a web API and still use some stored procedures.

In fact, I'm working on a Rails app right now facing this choice. We're a small site, barely a thousand users, but we have some complex queries for relevancy ranking that are just too slow in ActiveRecord - it ends up creating lots of expensive objects which kill you on GC. Whether this code turns into a SQL-based view or a full-on PL/pgSQL procedure is largely dependent on my mood tomorrow morning, but the fact remains that ActiveRecord can't solve all your problems. In fact, putting this logic in the database will help us make other application logic cleaner and more ActiveRecord-y; all of a sudden, we'll be able to do Content.order(:relevance).paginate, instead of having to pass around unwieldy arrays of content.

Many Postgres folks seem to think putting you rules in your database layer is a good idea, because you (1) get to deploy them automagically, (2) the commit logic is easier when there is no round trip, and (3) the PL/PGSQL language is actually pretty good.

I don't work on things big enough to think about this, and I would be curious if there were a summary of why/ why-not to put business logic in the DB, and where it SHOULD go.

but hard to believe in 2005 someone thought it was a good idea to build your business rules in your database layer.

I remember having that exact argument in 2005. Someone not only thought it was a good idea, but thought it was a good enough idea to aggressively argue with me for.

This brought up some repressed memories. As part of the same argument, people were advocating for keeping executable code (VbScript) as well as HTML strings in the database layer.

The whole codebase was this Lovecraftian nightmare where I kept fearing I would go insane if I kept reading.

Don't ever get a job at a bank. It's all stored procedures with no mocked out unit tests. Like others below mentioned, it's the "Enterprise-y" way of doing things. Coding with GUIs.

Didn't Microsoft with ASP.NET Pet shop still advocate stored procedures around that time?

I burned myself in that time frame, trying to implement SPs, because some people advocated for them.

Like Etsy, I discovered you needed a lot of them, and worse, they weren't very easy to debug or version control. I think I decided to pull them back into the application layer in the face of those problems, and I probably would have avoided the SP path entirely if I'd known about sharding back then.

And a lot of people still do. It's a rather 'Enterprise-y' and database-centric view of the world, but a lot of people (wrongly IMO) think that way.

I just went to a talk last weekend where the speakers advocated storing business logic in the db.

I suspect the real problem was social or political. To do anything that involved the database, the developers had to request the DBA's to write the SP's. This kind of bureaucracy invariably leads to developers advocating NoSql/schemaless databases, storing data structures as XML in a blob field or whatever else they can come up with to route around the obstacle to development.

They chose a new architecture and platform which made the broken process impossible for technical reasons, but I wonder if they would have gotten the same benefit cheaper if they just had fixed the process in the first place?

But I guess it is easier politically to shift platform than changing a process which reflects turfs and hierarchies in the organization.

I honestly don't understand all the hate on stored procedures. Tracking changes is trivial, since they're a part of the schema -- use version control. It's quite possible to use them with a properly sharded DB architecture, and still enjoy their performance and security benefits.

As a quick example, how else would you do an upsert (update or create) with a single trip to the database server?

I didn't see the presentation but it looked to me like the hate was really focused on the dev team structure that had DBAs owning all the sprocs. I've seen this on other teams I've worked with and it is indeed a nightmare.

For whatever it's worth, you shouldn't need sprocs for security; in fact, sprocs provide one more way for SQLI to sneak into your code (dynamic queries based on tainted input inside the procedure) --- something that comes up a lot when DBAs are forced to be a first line of defense against attacks.

There's nothing wrong with stored procedures per se, but the architecture where every database call is a stored procedure matched to a use case from a higher tier is error-prone. PL/pgSQL is not a great programming language.

> PL/pgSQL is not a great programming language.

Its main feature is that it resembles PL/SQL.

Otherwise, you can choose to write stored procedures in Ruby, Perl, Python, Lua, C, R ...

> There's nothing wrong with stored procedures per se, but the architecture where every database call is a stored procedure matched to a use case from a higher tier is error-prone.

Agreed in general; I prefer to use stored procs tactically. The worst stored proc architecture I've seen is the idea that every table should be wrapped in a stored proc API. It's hideous.

> Tracking changes is trivial, since they're a part of the schema -- use version control.

It's not quite that simple. You can't just "install" the new schema onto an existing database with data like you can by just copying files. You have to write your own set of alter statements to keep things synchronized (sometimes with complicated data migrations to new tables).

It works ok for 1 or 2 databases since you just do it manually - but then you lose many of the benefits of automatic source control.

And if you have a large number of database servers - or many unrelated (client) installations - you need a much more complicated system, and it's far from trivial.

Putting the SQL commands in the code is a lot simpler. There are no security benefits to stored procedures vs bound SQL statements. Perhaps it's a bit faster, but I'm not so convinced, you are trading parsing time for execution time since the stored procedure is now a program instead of a data update command.

Stored procedures have different security semantics than bound sql statements. See the security definer attributes.

Briefly, a stored procedure can run with the calling user's permissions, or the definer's permissions. If you set up a function as a security definer, you can do things with data that protect it from disclosure in a way that you can't in a sql statement. You can do it similarly with views, but they're more of a read only case.

(sometimes with complicated data migrations to new tables).

To be fair this is an issue with or without sprocs.

> As a quick example, how else would you do an upsert (update or create) with a single trip to the database server?


And that right there is why they are generally a bad idea. You can write a condition to check how to save your data or you can write the code necessary to verify a user's email address. Assuming you have time for one task, which is more valuable?

Taking the example further, would you hand write how to verify that email address? Probably not, someone has obviously done that before and you shouldn't reinvent the wheel. Why would you hand write persisting a record to the database then?

I apologize if this comes off as attacking, but stored procedures are one of the greatest time sinks ever created wasting countless dollars that could have been spent fixing real problems. Like this guys, http://news.ycombinator.com/item?id=3067740

DB2 has a MERGE INTO command that lets you accomplish this - I expect lots of other DBs have similar functionality.

What are the institutional reasons that "frequent, small" software releases are never the first thing companies turn to? I mean, iterative development and TDD have been around since the 50s, but it seems that every single company has to rediscover it before they try it.

I think it is because more "control" and "testing" over changes sounds better, and safer, at first glance than speed-to-deploy does. It's easier to choose the option that sounds stable.

But then over time, companies realize that their tendency to add bureaucratic gate on top of gate to the process has begun to harm them because they can't release simple updates in anything less than a few months.

Frequent and small deployments sound riskier, and tougher, at first.

Frequent and small deployments sound riskier, and tougher, at first.

Helping people re-think this perverse risk assessment has become a sort of personal mission for me. One of my most rewarding experiences was setting up automated testing and deployment from day one for a new startup.

Hmmm...that last sentence doesn't jive totally with me. As a business-guy, I'd rather diversify my risk over many deployments than over one deployment every six months.

The word "control" probably has more to do with it. The option that makes management more important (big requirements documents, for example) is probably the one management will choose.

I can tell you what the problem is: overzealous postmortems.

The head of engineering wants to be able to say what went wrong and how to fix it next time. That's probably fine, if the solution is better automation and failure detection. But once you go into postmorteming a situation where you dropped a few requests to a minor service, the whole process turns into a pile of shit where the engineers and operations team are scared, management feels in control, and the product managers are left screwed because engineering and operations no longer take risks and everything takes forever.

Engineering management doesn't stop and look at the cost of what they're doing and when software takes forever to release, they don't look in the mirror for the blame. I received a weekly technical email once containing info that 10 requests were "throttled" during a deployment. That's great! We don't need any "next steps", our failover worked fine.

Control is part of it, but it's also due to the fact that it is hard to implement features to work in a heterogeneous environment. The temptation is to dev and test a fully deployed environment, when really you should test probable configurations with several pieces either missing or performing badly.

Frequent, smaller releases take more work upfront. You have to build the process first - lots of automated testing, a way to deploy automatically, etc.

dBaseII - sales people don't want $SOFTWAREv1.05, they want The New Version. They can then sell that upgrade, and the extra features. PHBs get caught up in that. Kaizen and iteration doesn't feel exciting; you don't have a release buzz.

Sproter also 'created substantial developer friction,' Snyder added, because it required DBAs to write stored procedures for nearly every piece of site functionality—and created a bureaucracy developers had to go through to get functionality made.

Sounds like a horror story.

And to reiterate the last statement of the article: if you're doing something "clever" you're probably doing it wrong. Or in other words, if you don't know what the hell you are doing, then copy someone who probably knows what they are doing.

The insane part is that a startup has bureaucracy, not that code is in a RDBMS.

Bureaucracy is just process that you don't like, isn't it?

It goes beyond that. Process is rules that stop people from being able to do things they shouldn't; bureaucracy is process that stops people from being able to do their jobs.

Process is rules that stop people from being able to do things they shouldn't

People tend to disagree on what should or shouldn't be able to do. Should a developer be able to release code to production servers? Should a developer be able to change the database schema? Should a developer be able to access the database tables through an ORM or ad-hoc SQL or must everything go through stored procedures?

A healthier way to look at process is that it's about facilitation, not control.

Bureaucracy is a kind of process group think. It involves adherence to fixed, externally defined, rules. Each contributor is doing things in a way they think are cumbersome and difficult to change.

The alternative to fixed rules, is to discuss issues and tradeoffs, and trust developers to make good decisions based on knowledge.

and trust developers to make good decisions based on knowledge

I would add to that, "give individuals enough room to make mistakes that everyone, especially themselves, can learn from"

I'm a little surprised this sentence hasn't gotten more attention or turned more heads:

The presentation side was driven mostly by PHP on Lighttpd web servers, chosen at the time because the Etsy team felt Lighttpd was less common and less likely to be hacked.

Me too. Security through obscurity reasoning.

Yeah, I'm going to assume the technology change was not what improved things, but the engineers who were working on the problem.

At the end of the day, People solve problems, not technology. Someone with sufficient skill could probably rewrite the lions share of the environment in Perl, had they been given the task and enough resources.

It's all about your engineers, and the resources you give them!

>Yeah, I'm going to assume the technology change was not what improved things, but the engineers who were working on the problem.

I agree, this sounded like a technical solution to a social problem. The real issue sounded like the disconnect between the application folks and the db folks. If trashing one decent architecture for another was the catalyst they needed to bring around development culture change, it was a good move, but it hardly sounds like they had a technology problem.

There is so, so much more to this story that isn't here that it is practically not worth considering stuff like "mysql vs postgres" etc.


To sum up, how Etsy fixed their technical problems: they hired a bunch of people from flickr.

I wonder why the developers couldn't write the stored procedures.

I'm hoping that the postgresql projects for easier master-master replication get done soon!

In my experience it's usually because the DBAs end up "owning" all database development, both because the DBAs consider it their domain, and developers don't feel comfortable with PL/pgSQL and they don't want to anger the DBAs. It shouldn't have to be this way, though.

I learned at PgWest last week that in August 2011 TransLattice (commercial multi-master replication system) became available. http://www.translattice.com/

It's based on Postgres 9.0 and Postgres-R.

Also, there is a talk "Total Availability With PostgreSQL & Multi Master Replication" on Friday October 21st at PgDay Denver: https://pgday.consistentstate.com/node/51

While I agree that MySQL is great and probably is most anyone really needs, it does become harder to shard when your dealing with large scale. I feel you can further extend your agile team with the benefits of NoSQL (speed, ease of use like quickly adding/dropping columns). Some big data sites I know use both.

MySQL and Postgres are awesome for a lot of what websites do. They have advantages and disadvantages. Things like sharding tend to be premature optimizations except for write-heavy sites running at high scale. I prefer a mixture of SQL and NoSQL for the projects I work on. Redis, especially, is great at things like atomic updates and counters that become cumbersome in sql. With a well architected model-layer the complexity of multiple datastores can be minimized significantly. By using the right tools for the job, we are often able to simplify even versus using one database and a hacky solution.

Isn't updating a counter in SQL just: update counters set amount = amount + 1 where id = something?

In a basic application yes this is true. But if you have a large cluster and data is spread out it becomes more difficult not to mention time consuming to run that query. With redis and other nosql solutions they are designed with this in mind so they are much faster and easier.

You didn't check for overflow.

Don't check for overflow. Simply use a data type so large that it would take centuries to overflow.

If you were doing a million increments per second it would take almost 600,000 years to overflow a 64 bit int.

Check your assumptions. The field could be updated at any time to any value by another connection.

And who's to say we started at zero.

My background is business-first, hacker-second, and I mainly read everything on HN to pick up little places not to fuck up when architecting a software business.

The benefit of NoSQL's ease of dropping/adding schema...that's going to be one of the better ones.

I hope your code is compatible with mixed schemas. Otherwise your NoSQL is just moving the problem somewhere harder to test.

Can't vote this up enough.

The use of NoSQL is for very specific scenarios, don't make the mistake of thinking it's a good object store replacement. As a business guy this is definitely not a decision you should be making, let your team use what they're happy with, they'll get a lot more done that way.

I refer to this as the database making you eat your vegetables.

"I just want to change this tiny thing, what could possibly go wrong?"

"Nein!" shouts the database. "It will violate this and that and the other rule you told me to enforce!"

"Stupid database, always getting in my way. I'm going to install MongoDB/Redis/Couch ..."

12 months later: "Hang on ... which version of the code were these 40,000 records entered under? Does that integer field refer to the protocol 2 or protocol 3 headers? Why is there some XML in the email field?"

LOL. Awesome analogy. Muchos gracias to your reply and everyone else's.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact