Good on Apple for fixing this quickly. Even the best team can let a bug like this slip through, and the best solution is a fast response. However, I'm confused by the blog post implying that the solution to unbricking the apps was somehow novel or praiseworthy. This is pretty much textbook for an auto-update system. You just bump the revision to force an application update; there shouldn't be any need to reinstall or muck with the data files. I'd honestly have been shocked if they hadn't been able to handle it just the way they did (which would have been noteworthy, but not in a good way).
However, I'm confused by the blog post implying that the solution to unbricking the apps was somehow novel or praiseworthy.
In the 4 years of the App Store we've never seen a distribution problem of this magnitude (it would be an absolute nightmare for any devs affected) and we really have no idea how Apple would respond to such a problem.
We've also never seen Apple unilaterally update a specific group of apps like this. Many App Store devs likely didn't think Apple would or could take such action, and those of us familiar with what it takes for a device to accept and install new bits are now wondering just how this is being done.
Are they manually bumping each affected app's version string? Is there there some hidden field that forces a device to reinstall the same version of an app? It's curious.
There's a difference between being curious about the exact mechanics and considering the activity itself novel. Apple manages the updates and signs app bundles that developers upload. They need to be able to redeploy packages and trigger updates just to deal with normal operational issues (eg. malformed bundles, versioning mistakes, or reversions). Those are just the operating expectations of any system like this. There's nothing unusual about Apple being able to competently manage a relatively straightforward system that's such a big part of their business.
My theory: there's a version number that devs set on an app, and there's another version (say an integer starting from zero) that Apple puts on apps. They bumped their version, leaving app's version intact.
I'd say that's a good theory, and if they baked it in from the beginning, it's a great solution to this sort of problem.
The fact that Apple is so strict about monotonically increasing dotted numbers for version strings led many of us to believe that they were being parsed for important things inside the App Store publishing platform.
But maybe they are just part of the UI. Apple is known to have strong opinions and strict adherence requirements about that, too.
Do they need to update any part of the app on the devices for this? I think it could be something as simple as "if a device checks for updates, check whether it (might have) gotten a faulty binary. If so, lie to him that there is a new version, and send them a version he already has, but use a correctly signed binary this time".
Marco used the word "interesting", not "novel". Nevertheless, it is "novel" in the context of the App Store because, as far as we know, it has never been used before and none of us know how it works yet, nor that it could be done at all.
And it is "interesting". Apple did not bump the user-visible version strings (which are set by the developer), which would have been easy and expected, but cause some minor confusion for devs.
They didn't even bump the semi-invisible fourth field in the version string, which almost no devs use, and would have solved the problem quickly.
They also did not just mark transfers as having failed and requeue the downloads, which would have been fairly unsurprising.
They were able to determine which apps were corrupted, of all the updates in the affected period. This shouldn't have been too hard, but it shows that they are optimizing their solution pretty well.
They were apparently unable to determine who got corrupt copies of the affected apps and who did not, or maybe they are just erring on the side of caution here. Since their reupdate mechanism seems to work so well, the extra caution costs nothing.
I don't know how many kinds of catastrophic failures you've recovered from, but at App Store scale, recovery is often hard. They were either prepared for this sort of problem, or figured something out quickly that resolves the issue quite cleanly, taking care of all the details.
I guess this might seem impressive if you're unfamiliar with automated client software updates. But this really is the norm when you're packaging, signing, distributing, and updating software bundles. You're going to treat the bundle you get as relatively opaque and not rely on its data. First, it's not reliable to trust third-party data. Second, it's just easier because you own the metadata wrapper (for the signatures, etc.) and the update channel.
It's a special case of automated client software updates, though.
Most of us are probably more familiar with systems like Firefox, Chrome, or Sparkle. The App Store mechanism is more complicated, and interesting.
The third party data (version string) that you don't want to trust is, in this case, very carefully screened by Apple as part of the submission process. It is trustworthy by the time it lands at the App Store -- but since Apple owns the whole workflow, they have much better options.
Does anybody have any insight on how a process like this is debugged at Apple? I've heard, for example, that at Amazon there is a process of blame-finding after a major issue/outage like this, whereas at Google I hear things are more post-mortem let's fix the process that led the human error involved in the outage, rather than blame the person who made the faulty commit.
Anybody know how things work inside Apple's culture?
"that at Amazon there is a process of blame-finding"
Ex-Amazonian here. It's important to note that Amazon's Cause of Error (COE) process is not about blame. It is about determining what happened, why it happened, and what concrete steps are being taken so that it does not happen again. Individuals are not blamed as part of this process and that's in the official rules. The goal is to iterate and avoid making the same mistakes again.
I've heard from a lot of ex-amazonian's that in practice there's a lot of blaming as part of the process at least in part because of the compensation/promotional processes. But maybe that's changed recently?
Of course, I've also heard that there is a wide diversity of culture between teams, so maybe that plays in to it, too.
Totally happens, even though it is not supposed to. If a dev didn't outright break the rules then they shouldn't be blamed. The rules in this case would be something like "a peer must review before deploy" - if you bring down the site after breaking that rule, you're going to get fired most likely.
I've never had COEs come up in my or others' performance reviews.
If the boss who owns the COE "gets it" and has internalized the old-school Amazon culture then there won't be blame. The bosses really take the hits here, they do get personally blamed for these things. If they can't stand between the team and the more senior management then they are not doing it right.
If you and your boss mutually hate each other (which unfortunately I have seen) then it won't go well.
I'm sure it depends a lot on your team and your manager. Also, some people see a process like that and automatically assume blame is being assigned even if it isn't.
This is Apple culture at its best: No one ever talks about it. All employees are so loyal to this company that you can't even imagine one speaking about such insights.
Loyalty? More like fear. I'm not saying that we should be privy to the inner-workings of every company, but a lack of transparency is hardly a culture worthy of praise.
Why shouldn't it be? Only cause people don't like it doesn't mean it is a bad thing... You don't have to put all your efforts and insights in blog posts!
>All employees are so loyal to this company that you can't even imagine one speaking about such insights.
I think it is more of fear about losing their job rather than loyalty. It is not uncommon to find that any leak will be tracked down and the leaker summarily fired.
There was an error with the FairPlay DRM signing process that lead to a lot of app binaries becoming corrupt; people would go to download updates from the app store, and after doing so would be completely unable to launch the app. Not even a splash screen or a display of the freeze-dried screenshot from multitasking. Half a second of black, then kicked back out to Springboard.
How did they fix/change the version string, or did they? I don't see how simply re-released a previously released version would cause affected users to update their binaries.
This snafu, the Galaxy Nexus thing and the fact that the simple update to my highly rated and heavily used iOS app has been "waiting for review" for ten days made me start Android Dev. Sorry Apple. You took too long. Hire more reviewers already.
Come on... Shit happens. And it was the first glitch in the App Store after 5 years of operations! They've sold 30 billion apps (and updated probably well over 200 billion).
Well, many, many things suck about the App Store and the submitting process, but they are not that important. This one really ruined a lot of people's holiday, tons and tons of angry and 1-star reviews and if I'm not mistaken, people lost local data. If it was a little more widespread, it could've been a real fiasco.
Minor issue, but the App Store did not launch when the iPhone did. According to someone on wikipedia, the App Store isn't quite 4 years old yet. Still an impressive record if it truly is the first issue they've had.