I think rewriting from scratch is the core of their problem and not really Cassandra. Gradually going over to Cassandra would have been a much better idea.
Solaris is another example of a rewrite that seems to have worked, though the rewrite did derive from a different set of existing code, not a total from-scratch job. But the classic SunOS 1.x.-4.x codebase was ditched, and SunOS 5.x / "Solaris 2" replaced it.
Release to production dozens, if not hundreds of times. Releases are non-events, rollbacks are non-events.
A system-wide ground-up rewrite with a big-bang switchover at the end is a classic clusterfuck recipe. It's a shame that so many people think it's a good idea, even in 2010.
Sounds good in theory. In practice? Part of the problem with many big ball of mud systems is that all the parts depend on and talk to all the other parts. Want to fix that horrid DB schema? You'll have to rewrite all the code that talks to it, or rewrite it to talk to an intermediary. Want to rewrite that horrid bit of code that's called foobar_20040623 ("foobar" has been changed, but yes, I saw this in some PHP code...), you'll have to find out what all it interacts with, and likely redo that too.
1. Wallow in it (work with existing structure). Sadly, this is what a lot of people do. I left a job once because that was the only way out of the ball of mud. I was afraid I would turn into a mud-person.
2. Slowly crawl out of it (incremental rewrite). This is hard, but do-able. It involves setting up barriers to mitigate ripple effects, automated testing to be comfortable with frequent releases, and tolerance for temporary imperfection (basically, you need to be willing to frequently release things that are only a tiny bit better than status quo, even if it's not ideal). Not everyone is willing to accept this persistent imperfection and lack of conceptual consistency, especially when option #3 is more exciting and fun.
3. Try to leap out of it and land wherever (total rewrite). This is very, very hard, and prone to total failure or spending valuable money and time to effectively stand still in the market.
I've seen development teams leap directly out of one ball of mud into a different one. One where nobody even knew their way around anymore. How is that anything but a huge waste of time and resources?
In terms of "standing still in the market", I think the incremental rewrite contains a lot of that too, no? It's just spread over more time. You're still rewriting it, and during that time, you're not adding new features. An example might be retrofitting some testing code to a system that's never had test code. That could potentially be a fair amount of work, and given a constant pool of resources, it will take time away from "new stuff". Just that it's not so much of a quantum leap - you can still drop your new testing code and go implement some must-have feature if you need to, without saying you have to wait for the whole thing to be ready.
Sadly though, my experience in this is that the reason there's a ball of mud in the first place is a political/social one, so that any "dead time" is frowned upon.
You just have to plan things carefully, work hard, and keep your head screwed on. (Just like with many things in life ...)
I guess I could "change the names to protect the innocent" and tell some stories about digging out of tight places incrementally. If it would convince even one development team that they didn't absolutely have to do a total rewrite, it would be worth it.
Rewriting the code is fine - if it goes wrong, just back up to the old code, users won't notice the difference. You can do it on a page-by-page basis, just route URLs selectively to the new install.
Migrating data is something that should be done under only the most extreme circumstances - and something will inevitably go horribly wrong, so be prepared to rollback.
The approach I would take is to get the minimum set of engineers who know the most about each major aspect of the code, and put their heads together on what the ideal architecture would be. But rather than building it from scratch, figure out how to implement just one of those pieces now. That way you can decrease entropy in the codebase piecemeal without chucking out all the code at once, which is no doubt full of forgotten assumptions that no one will remember until it's too late.
Hello, Flash? (via Gruber)
Never mind the fact that the user base is confused by the rollout.
And never mind the fact that Google has forever damaged the term "Beta" in the minds of general Web consumers.
The general web consumer still thinks betas are a type of fish. Not really relevant to the median Digg user, though, since they are not the general Web user.
Implementing this should not be to hard and it allows them to get some feedback.
In this case you would not just have to write scrips to one-time migrate all needed data from v3 mysql to v4 Cassandra.
No. You would have to build a mechanism that doesn't just do it in both directions, it would also have to work at near-realtime speed.
If you then need consistency of the transferred data, this quickly gets impossible. Try finding a way to ensure consistency between these two completely different architectures.
In the end, most of v3 would have needed to be rewritten for a parallel use to be possible, at which point you don't really gain much.
"not hard to implement" - sigh
Disclaimer: I don't work at digg and I don't know more about their backend than the rest of the public. I did however just get around doing something like what you describe and there it wad "just" a different schema on the same database backend and even just that would have been hell
Now in the current case, the most issues apparently come from the non-working backend as opposed to the changed featureset.
So while they could have run the two versions in parallel, they would not have gained anything. Likely, this was their rationale behind not doing so in the first place.
It's not a system that would be ideal for something transactional like a bank, but it may have been possible for an organization like Digg.
> Implementing this should not be to hard...
Don't ever develop yourself into a one way street!!
One way to "safely" do a big-bang change would have been to use both architectures in parallel and sync back and forth. That way, if the new architecture fails, you still have the old one.
Of course, that is lots of effort, and restricts new features (which may be a good thing anyway).
I think Cassandra is pretty well tested. There have been lots of super-large-scale deployments. It just seems lame to blame it on that, but I guess maybe their anonymous sources inside Digg revealed it? But then we'd hope they'd know if the problem was with the datastore or the implementation.
I get that some VP suggested a new buzzwordy technology, they gave him enough rope to hang himself and he did and left the company with a broken pile of crap. It could happen, if you have a healthy company you give trust to people. That it got this far doesn't speak well of the rest of management. It doesn't speak really well of the rest of the team either. Shouldn't there have been some circuit breakers or something?
Digg isn't a poor, bring-your-own-laptop startup. They've got resources, they've had substantial investment. They can afford to build and test software and I know of no real marketing reason they had to push something untested out. Rose could go out and say it wasn't done, it's going to take more time.
How does Rose keep his job? Wasn't he this VP's boss?
And it's single technology? That couldn't have been vetted and tested independently of all of digg? Really? And MongoDB, or hadoop, or one of the dozens of other nosqls wouldn't work, either? You do what you have to do and there is never a 'truth' with VPs and CxOs are canned, but it all doesn't float with me, just looks like another over valued and under talented company, got lucky and the blind squirrel found the nut, there isn't gonna be a second act.
Maybe it just seems low brow to me, name and finger the guy, blame the opensource tool you use, never explain or elaborate why you launched anyways when you were fixing bugs in the tool at the 11th hour.
* the main marketing tool that digg has left is "Kevin Rose as genius"
* Kevin and the people he drinks with pushed for the adoption of new cool technology.
* the entire thing was a giant clusterfuck because of Kevin and the people he drinks with.
* but once that became obvious there was a need for a scapegoat so that digg could keep its primary marketing tool.
Also I said this was a "rough guess." Comments on TC from people I know who worked there (there's been a mass exodus for the last two years) suggest I'm mostly correct.
Finally, I think "sour grapes" is a poor rejoinder. Especially since it's the standard PR line at digg.
He calls it "still beta software" and states they were fixing bugs in Cassandra during the days leading up to the release of v4.
original blog post: http://about.digg.com/blog/looking-future-cassandra
Treads on HN: http://news.ycombinator.com/item?id=813528
arg - really want more insight, maybe Quinn will elaborate now that he's gone.
Back then they seemed to have a rather sensible migration strategy (ie, basically running the new Cassandra back-end in parallel with the MySQL backend).
It seems to me that it was the v4 upgrade that broke, not Cassandra alone. It's possible their frustrations with Cassandra were more long term, and the fact the v4 upgrade didn't go well was the last straw.
For example, Digg has done a lot of work on Cassandra internals and tools. If you are using a new, open source product you kind of expect that, but it's possible the expense of that didn't seem like good value once v4 started to get into trouble,
Unless you are owned by a larger company that runs you at a loss as part of their strategy. Or stragedy.
Evidently Digg is better at the monetization game than Reddit has been, so more sales staff I can understand... but devs?
Personally I have no idea nor do I see any reason for that many people working for digg, but my understanding is that Digg (or rather Kevin Rose) has always been about having the perception of big without actually being that big.
Personally, I don't consider Digg a startup.
do you have a source for that ?
> it is used as a makeshift k-v store
That is very true, they should switch. It just seemed like they had no plan to do so.
> do you have a source for that ?
I screwed up our Cassandra deployment, and wrote about how I screwed it up. We were under-provisioned, and the version we were using didn't deal with the case of being overloaded in a graceful way. We're no longer under-provisioned, so I don't know if more recent versions deal with it better.
We've never claimed to have performance issues, I don't know where you're getting that one.
No malicious intent, after all, I did ask you for verification on twitter, linking here.
Cassandra looks to be working fine for reddit now: http://blog.reddit.com/2010/08/everything-went-better-than-e...
These are orthogonal.
From the article:
Quinn was the main champion of moving over to Cassandra, say our sources. Now the site is taking a huge hit, at least in the short term, because of that decision and/or how it was implemented, and Quinn is paying for it with his job.
It's always a toss up between whether it was implemented correctly or not. The correct course of action of course would have been to slowly move the site over to the new technology piece by piece rather than a wholesale switchover. The risk is in the migration strategy not the technology picked. They could have been equally stupid switching over to a new architecture with mysql.
by doing this, they are laying the blame on one person so the media can stop hating Digg and start hating the ex-VP of Engineering who "killed Digg".
of course, whether he was actually responsible in some way is something we may never know. for all we know he may have been completely against releasing v4 but was vetoed by Rose et al. or on the other hand he may have overpromised and under-delivered, putting the company in jeopardy, in which case he deserves to be let go.
it's all speculation until we hear an official comment from either side.
It's a change of features/product rather than technology that is the problem.
They haven't said much about the details of why they are having trouble.
It could be a core Cassandra problem, something they added, or completely unrelated to Cassandra; But the internet doesn't care. It's drama at its finest.
Digg made a big deal about their move to Cassandra (just as Digg's move to Cassandra was used to legitimize Cassandra, and by correlation NoSQL, among a wide range of zealots), going back over a year.
The thing about talking big like that is that it often comes back to bite you in the ass if things don't go well.
If Digg quietly released a new version that worked more reliably and provided a better experience, they would have been in a perfect position to pontificate on technology.
For most users, change in features is what drove them away, not necessarily the spotty QOS (though that was pretty bad for a while).
In theory, Java is great and Cassandra is great. In practice - Java under a heavy load is a disaster, because it was never designed for it, and Cassandra is a just a hype and propaganda.
Face the reality - it doesn't work in production as it supposed to - as a primary storage engine.
People at the Digg aren't amateur idiots, so I think they do everything as it described in docs, but the damn thing just doesn't work.
Google's heavy use of serverside Java would indicate otherwise.
> because it was never designed for it
Yes it was. Java has a lot of problems, but one thing that isn't a problem is heavy load.
> Cassandra is a just a hype and propaganda
Facebook seems to be using it ok.
System which was designed for being isolated from an OS (leave alone hardware) will have a bottleneck exactly in this level.
Facebook doesn't use it as a primary storage with a high load.
Most languages used in web programming are divorced from the hardware, and either use a virtual machine (e.g. Java, .NET) or an interpreter (PHP, Python, Perl, Ruby).
Virtually no-one uses a low level language for web programming, and the benchmarks say that the virtual machines are faster than interpreters. Java is very fast and efficient once it's running, but initial start-up time is often slower than interpreted languages due to the JIT compilation, but servers rarely "start up". If you're really hitting a barrier with Java's performance, just as with other non-native languages, you can write the performance critical section in C.