"Moving a message needs to move all of the message's comments, and all of the comments' files, and all of the comments' files' versions."
Why can't you just change some top level reference in the database? I'm imagining a Projects table and a TodoLists table. Each TodoList has something like a projectID foreign key right? Why can't you just change that and automatically have all of the messages, etc, come along with the TodoList?
"But I hope you will see that sometimes even the simplest feature can be much more complicated than it looks from the outside."
I agree there, just wondering why the simple solution doesn't work in this case.
I'm going to take Sam's word on this that there is not a simpler way, because he's the one working on it, and I trust the competence of the 37s dev team. That's not to say that some armchair analysis here on HN might not have valuable insight, but without actually seeing the code, and in light of the full list of points Sam mentioned, I doubt very much that there is a much simpler solution of any form.
How is that for a mixed metaphor?
That's certainly true for a low-level, high-precision estimate, which you can only make when you've got detailed requirements. Prior to that you still need to be able to deliver an estimate, and the best approach to take is to provide an estimate with Quantified Uncertainty. The phrase is important, because the estimate is for managers and managers are all about quantifying risks, and uncertainty in a development estimate is definitely a risk.
When you say "I don't know how long this will take, because I don't know how much I don't know about the solution" you're giving an estimate with uncertainty, but you're not quantifying the uncertainty. It's got no bounds, it's limitless. Managers can't make any decisions with that, so it's useless information. But if you say "This will take 4 to 8 effort weeks, and I'll be able to give a more precise estimate after 2 effort weeks of investigation" that gives your manager something to work with. You're still uncertain, but you've quantified the bounds on the uncertainty, and decisions can be made. For example, solving the problem might not be worthwhile unless it happens within the next two weeks (eg: to meet a release deadline) so the manager can just delay the work until later.
The trick with this approach is setting the bounds. The man who taught me this style of estimating called it a Surprise Range. The lower bound should be an estimate such that you'd be surprised if it took less effort than that, and the upper bound should be an estimate such that you'd be surprised if it took more effort than that. Given whatever information you have about the requirements, you just keep pushing your bounds until you can honestly say "I really don't think it would take less/more effort than that."
As you get more experienced, you'll naturally start accounting for the unknown unknowns, because your past experiences with them will be nagging in the back of your mind while you're trying to come up with an estimate for some new task. This is especially true if you're estimating work on a project you've worked on before; you'll have a feel for where the trouble spots are and whether or not the task in question is going to get into those spots.
The other trick with this approach is the "more precise estimate after some investigation" bit of the statement. You need to provide some estimate now, buy some investigation time, and commit to another estimate later. Most of the time the second estimate will fall somewhere within the bounds of the first one, but not always, and your manager should be aware of that possibility. But with experience, you'll usually wind up within the original bounds and narrower than the original bounds too, which is quantifiably less uncertainty than before so you can show progress.
A month later my peer, a senior Engineer with the Company, thought "We should run 32-bit! How long can it take?", checked out the source and started changing files. Gradually people noticed what he was doing, resources got added under the table, finally a year later he had something working. Lots of kudos, smiles, what a hero!
I showed him the report I filed a year ago - and he said "yeah, that's about what I had to do. I'm glad I didn't see that to begin with, I would never have started".
Also, "two man-years" isn't a complete estimate using my approach. "One to two man-years" would be a complete estimate, providing a range. Sometimes you really do come in near the bottom of the range, and maybe a month into it if you re-estimated you would have come up with a lower range.
This brings up another point of contention between development and project management. When developers say "estimate", they mean "best guess in the face of uncertainty", but project managers tend think of developer estimates the same as plumber, electrician, and car mechanic estimates. Those guys are saying "I'm not sure exactly how much work I'm going to have to do, but if you pay me $X I'll do it." They inflate $X so that most of the time they wind up overcharging in order to cover the occasions when they wind up doing much more work than expected. The $X is a commitment, and the estimate is how much work is needed.
It doesn't work this way in software development, because there are factors out of the developers control. The developer can say how much effort a task is likely to take (the ranged estimate) but not how much time (the commitment.) The time depends on other things the developer has to work on, both expected and unexpected, dependencies on other developers and tasks, holidays and days off for the developer and others the developer is dependent upon, etc. That's all stuff the project manager is supposed to keep track of, so the developer doesn't have a full picture. Unfortunately, most project managers don't recognize this, so they treat the effort estimate as a time commitment, set an arbitrary start date, and expect the developer to be done X days later.
I don't, in large part because this shouldn't be a complicated problem to solve.
Your confidence probably serves you well (let me guess, early 20s?), but in this case you literally don't know what you're talking about.
But that's from personal observation, and I haven't made a point of recording my findings for analysis. I suspect you haven't either.
> but in this case you literally don't know what you're talking about.
I don't need to know what exact decisions they made to be able to tell that they were poor decisions. Easy things should be easy, and when they're hard, they're hard because of incompetence somewhere in the process.
FWIW, I've worked on a site extremely similar to basecamp (but with far more traffic, at least according to Alexa). I'm not speaking from inexperience: if this problem is hard for 37signals, it's because of failures of the implementation, not because the problem is intrinsically hard. Engineers who don't recognize that are not ones whose competence I would put trust in.
The problem is easy. Seriously, it really is. Most of the programmers here have probably done something almost exactly like it ten or more times in their relatively short careers; I know I have. There is no real intrinsic difficulty to the problem.
I'm not saying that the programmers are 37signals were incompetent. I'm saying that I have do not trust that they are competent. I have no reason, especially in the face of this evidence, to believe that they're competent. They made a successful software product: so have lots of other programmers of questionable competency.
The fact remains that at least in this example, one of their programmers points at an intrinsically easy problem and says, "This is really hard because I'd have to do a lot of stuff that I really shouldn't have to do" except he doesn't seem to realize that he shouldn't have do those things. So tell me, in face of that obvious fact, why should I trust 37signals' engineers' competence?
It's easy in the simplified mental model that you have constructed devoid of real world context. Any engineer worth their salt knows the devil's in the details. I'm going to stop arguing now because you're not even responding to my actual argument. It's not a zen thing, you should be able to get it if you actually read my comments, but just in case you need a koan, ponder this fact that I would bet my life savings on:
There exists a potential feature which could be implemented in Basecamp faster than in your product (and vice versa).
I constructed my mental model based on the detailed description by a BaseCamp engineer of what he'd have to do in order to solve the problem in his system!
> I'm going to stop arguing now because you're not even responding to my actual argument.
Welcome to my world: you've been arguing this whole time as if I were claiming that BaseCamp engineers were incompetent when I've only been saying that I don't trust that they are competent.
> There exists a potential feature which could be implemented in Basecamp faster than in your product (and vice versa).
There are probably many such features. That's really beside the point, unless you're claiming some additional knowledge here that the decisions made in this particular design contributed to the ease of those potential features. If you're only assuming that because you trust the competence of 37signals, your entire argument is circular and depends on the very claim I'm contesting.
Do you realize how weaselly that is? What does that mean? Connotatively it takes a huge swipe at the 37s team without actually making any commitment. You might as well have said nothing at all if you're not going to take a real stand.
There's a distinct difference between saying that you know someone is incompetent, and saying that you don't know that someone is competent. It's the same as the difference between agnosticism and atheism.
What I am also saying, and what you're probably really arguing with, is that the linked article constitutes evidence against 37s' competence. That's still very different than saying that it shows that they're incompetent, which is how you've been mischaracterizing my posts.
Understood, but it's still weaselly because you're offhandedly casting aspersions on a group of people's ability ("I have no reason to believe those people are competent") and then pretending like it doesn't hurt their reputation because of the exact words you spoke ("I never said they were incompetent, just that I have no reason to believe they are competent").
Agnosticism vs atheism, while logically the same distinction does not have this libelous aspect to it.
Why the h... doesn't their ORM, or even better their database, already take care of cascading updates?
Why does it have to be implemented manually for every code that updates something? (... which is of course expensive and prone to errors)
So I guess the main reason people pay for Oracle is because they only know about MySQL and not about PostgreSQL.
The irony, of course, is that a single Postgres server is frequently (IME) more reliable than multiple MySQL servers even when the latter is setup for HA.
Also note that if you want to trade some of the ACID properties for better performance and replication, the so-called NoSQL databases (CouchDB etc) seem to be a better trade-off than, say, MySQL with MyISAM instead of InnoDB.
While I fully agree that trade offs like yours do occur in our profession, their mere existence is no argument against my initial non-trust of 37signals' competence: you need to actually demonstrate that this problem is one of those cases, and I think that would be difficult to do.
The problem of associating entities in a way that their associations can be modified easily and consistently (i.e., normalization) is effectively a solved problem. Just like the programmer complaining that his program is slow because he's doing a linear search through a linked list, a BaseCamp programmer complain that a requested feature is difficult to implement because their data is in an abnormal form does not inspire confidence, and certainly gives no reason to trust his competence.
In Erlang, the most common data structure is the linked list. There's no way to search them other than linearly. You are claiming that in general we should suspect that Erlang programmers are incompetent?
In the case of sharding by project, it actually seems like a decent move given that a single Basecamp account has potentially large space requirements (75 GB), and you can't realistically provision enough space on a box to meet all the needs of a bunch of expanding accounts on that box. Being able to throw an entire project onto a new server and assuming that everything associated with a project is on that one server are both nice simplifying assumptions which seem like they would strike a nice balance.
Please stop painting everything with broad black and white strokes, programming and business in general are frequently more about striking the right balance of compromises than about knowing the "best" way to do things. There's usually not a universal "best". 37Signals seem to be doing a marvelous job balancing those compromises between their various disciplines, especially given their massive success with customers.
Edit: This is assuming they're storing attachment files on the same machine, which might very well be wrong. It would be pretty hard for one group to get up to 75GB of todo's and other text content...
This said, if they have their database sharded by project_id then it would not be that big an issue, but it can be that their database structure is very complex or messed up...
Millions of users is not really that many. It's certainly within the realm of what can be reasonably vertically scaled.
Hundreds of the top websites in the world (the majority, I would hazard, though without the data to support it) have scaled vertically to millions and tens of millions of entities just fine. Far more than have scaled horizontally. It works. It's been done. And it doesn't give up the sort of transactional niceties that make problems like this easier.
> most other web companies like Google, Facebook, Yahoo and Microsoft
You're confusing "the very biggest web companies" with "most other web companies." Most other web companies continue to use commodity products, and Google/Facebook/Yahoo!/MS certainly would (and do) insofar as it's possible at that scale. Expending resources now to be as horizontally scalable as Google is wasteful premature optimization.
Notably, Yahoo! runs the largest PostgreSQL installation in the world, and Google and Facebook both continue to use MySQL.
> horizontally scaling without big iron is the way to go.
You can get 32-core machines with 128GB of ram from Dell (a mildly tweaked R910) for $30k these days. Is that big iron? How does its price compare with the amount of developer salary and benefits you'll have to spend to grok a non-relational data store, migrate your data to it, and reimplement the ACID features of a relational store in the code for your app? How many developer-days will you spend maintaining that code and how many developer-nights will you spend triaging a crashed site because of the complexity and likely bugginess of that reimplementation? How many users' feature requests will you have to reject as "too difficult to implement" because you feel the need to scale to Google/Facebook levels despite having only a few million users now and predicted growth which shows you'll never in a million years catch up to them?
> Currently the only way to scale a relational database such as MySQL or Postgre is by sharding and partitioning
It will be years before the vast majority of startups exhaust reasonable, cost-effective options for vertical scaling. The recent fervor for non-relational, horizontally scalable data stores is simply the new way of scratching the intellectually masturbatory premature optimization itch that programmers have had since ENIAC.
For what it's worth, I'm not the only crank who thinks this; Dennis Forbes has argued it much more eloquently and compellingly on his blog, e.g. http://blog.yafla.com/Getting_Real_about_NoSQL_and_the_SQL_P... .
Do note that I am not saying that small websites should shard or scale horizontally - - but big sites with millions of users and tons of data should not scale vertically (it can't payoff and at some point they'll hit the limit).
No doubt, but 37Signals shouldn't have a lot of relational data. The bulk of their per-project bytes, it seems likely, is non-relational stuff like attachments.
> Given the size of Basecamp and 37 Signal's future projection I doubt it would be wise to hope that they can scale vertically
Isn't it less wise to pay a cost you don't yet need and may never need to pay? You pay a significant price in development velocity by forgoing a relational database and using a non-relational data store. Certainly any reasonable organization should be able to project when they will actually need to pay that cost.
> big sites with millions of users
Single digit millions of users isn't that big.
> and tons of data should not scale vertically (it can't payoff and at some point they'll hit the limit).
It can certainly pay off if you never actually need to convert to a non-relational data store. The limit is a lot higher than you seem to think: banks and financial institutions process billions of transactions for hundreds of millions of users daily on the same ACID, relational data stores that you're saying a site like BaseCamp will hit the limit of. I don't buy it.
So this is more an example of "How to make simple tasks difficult" rather than "Even simple tasks have their pitfalls".
I once inherited a large enterprise system that had been designed from the ground up to allow that kind of flexibility. The problem was that there were so many levels of abstraction and indirection in the database and the ORM model that performance was abysmal.
"We can't use database transactions because performing a big move would slow Basecamp down for everyone. So we have to log the process of each step of the move, and make it so any failure in the move can be rolled back gracefully. That means a move is actually a series of copies and deletions instead of just changing a field for each moved item"
Now consider that Basecamp allows comments on messages, todo items, and milestones. What does your schema look like now? Add a feature to show a user all recent comments across all of his projects. What does your query look like? How does it perform? Get all of this (not to mention the other features) working at scale.
Maybe you can build something like Basecamp that works at Basecamp's scale without resorting to denormalization, sharding, and/or partitioning. But I doubt it.
That's a small piece of the picture. With partitioning, a move could entail removing data from one table and inserting it into another. With sharding, a move could mean removing the data from one database and inserting it into another, which probably means manually ensuring consistency because your typical transaction won't span databases.
What's all this talk about "re-implementing" and "if" they have transactions in their DB? They explicitly state that they can't use transactions because it would slow things down, not because they don't have them. They also state that Since moving one milestone could potentially result in hundreds of database operations...
This means they have a transaction supporting database. There's no re-implementing onto a different system. There's an apparently-broken implementation.
I see this all the time; developers who have "grown up" in an environment where locks and cursors are expensive pick up some odd habits that don't translate well when they code in an environment (such as Oracle) where locks are cheap and cursors are free.
 Second paragraph: http://37signals.com/svn/posts/2479-nuts-bolts-database-serv...
Edit: I found this:
> With that in mind, we went looking for an option to host the Basecamp database, which is becoming a monster. As of this writing, the database is 325GB and handles several thousand queries per second at peak times.
325GB RAM?! But now they have multiple servers. Read more: http://37signals.com/svn/posts/2479-nuts-bolts-database-serv...
- Moving a message needs to move all of the message's comments, and all of the comments' files, and all of the comments' files' versions.
Not really. comment should only by 'tied' to message, i.e. there should be comments.messages_id command, so moving massage somewhere else shouldn't require moving comments (unless you fucked up the db schema to begin with)
- Moving any file needs to move its thumbnail, too, if it's an image.
No need to move the file to begin with, see above.
- Moving a milestone needs to move any associated to-do list and messages. And of course those to-do lists can have to-do items with comments, and attached files, and multiple versions of those files.
This is probably the one that indeed needs special treatment even in reasonable designed db. There are ways to skip this if you anticipate the need at the very beginning but it will require some convoluted db hacks to allow O(1) simple move (i.e. make everything including project just an item and just have parent_id in each item, then have messages to be children of the milestone. but this will have other problems, performance and complexity etc. so I wouldn't go this way)
- Moving a to-do list whose to-do items have associated time tracking entries needs to move those time entries to the destination project too.
Not really. time tracking entries should not have project_id in it. just todo_id
- Moving a message or file needs to re-create its category in the destination project if it doesn't already exist.
- If a moved file is backed up on S3 we need to rename it there. If it's not, we need to make sure it doesn't get backed up with its old filename.
why the file path on S3 has project id in it. attachments.id would suffice to uniquely identify stuff. its not like you need to invent multilevel directories on S3 like you do on a file system.
- When someone posts a message to the wrong project, then moves it to the right place, we need to make sure that everyone who received an email notification from the original message can still reply to the message via email.
If the email link just has the message_id and not the project_id it solves itself. You might need to play with some ACL though.
- Similarly we need to make sure that when you follow a URL in an email notification for a moved message, comment or files, you are redirected to its new location.
Dont' include project_id in the url to begin with. Use 'flat' routes.
- Since moving one milestone could potentially result in hundreds of database operations, we need to perform the move asynchronously. This means storing information about the move, pushing it into a queue, and processing it with a pool of background workers.
Not really. Even moving a milestone should only require 2 transactions. move the milestone, and move all milestone's children. Thats might be a non-trivial UPDATE sql operation but its not hundreds of queries.
- We also have to build a new UI for displaying the progress of a move. It needs to poll the Basecamp servers periodically in the background to check to see if the move is done yet, and take you to the right place afterwards.
- We can't use database transactions because performing a big move would slow Basecamp down for everyone. So we have to log the process of each step of the move, and make it so any failure in the move can be rolled back gracefully. That means a move is actually a series of copies and deletions instead of just changing a field for each moved item.
If you eliminate most of the complications above then you can use transactions.
I doubt 37signals wanted to be in a place where an apparently simple change would involve so much work, but that's where they found themselves. They did what they had to do. There's no point snarking about their starting place without knowing how and why they got there.
- "Pouvez-vous me dire comment aller à Paris"?
- "Ben si j'étais vous je ne partirais pas d'ici!"
Very French in its non-helpful but matter-of-fact way... but maybe the Irish are the same!
If you've spent more than 20 minutes of your life writing code, you've probably discovered a good rule of thumb by now: nothing is as simple as it seems.
Unless they're denormalized for optimization purposes? Threaded comments can result in some massively deep queries.
This almost invariably leads into a challenge that seems easy in theory but the practical execution is always plagued with unforeseen hurdles, crap, unexpected mishaps and other random elements.
It's not that hard to draw parallels between software development and the crazy Top Gear challenges, which is probably a reason that many coders who've never been behind a wheel enjoy watching them.
You're right, under the right conditions. Namely: small site, early in lifecycle, and a perfectly normalized and non-sharded/non-partitioned database.
But I suspect it's more complicated because:
1. they have a ton of traffic
2. they want to maintain a seamlessly perfect UX everywhere
3. their database is denormalized, sharded and/or partitioned
4. it's gotten at least a little crufty with age (shrug, it happens)
I agree that even a seemingly simple feature can be hard to deliver, but on the flip side there are dials a developer can turn to adjust his LOE up or down as desired. Tradeoffs as always.
Link to article: http://blogs.msdn.com/b/ericlippert/archive/2003/10/28/53298...