
Saving Data: Reducing the Size of App Updates by 65% - jor-el
http://android-developers.blogspot.com/2016/12/saving-data-reducing-the-size-of-app-updates-by-65-percent.html
======
cperciva
It's a bit humbling to think that my "quick hack" is yielding bandwidth
savings measured in _petabytes per day_.

~~~
TorKlingberg
I just tried bsdiff, and can I complain a little? I tried to diff two 1 GB
files, and within a few seconds bsdiff was using enough memory to lock up my
Ubuntu desktop. 10 minutes later I managed to get a text console and kill it.
Maybe I should blame Ubuntu for letting a single process hose the system, but
it was still annoying.

Edit: bsdiff is documented to use memory 17x the file size.

~~~
derefr
> Maybe I should blame Ubuntu for letting a single process hose the system

Speaking of: is there a good way to ask an OS to "reserve memory and CPU-time"
for emergency interactive maintenance (e.g. spawning a console on a thrashing
box), in a similar way to how ext2/3/4 "reserves space" for the root user for
emergency maintenance?

The only real way I can think of to do this (that doesn't involve already
knowing what processes could thrash and cgroup'ing them) is to run a
hypervisor and treat the dom0 as the emergency console. That still doesn't let
you get _inside_ the thrashing instance to fix it, though.

~~~
chousuke
If you use systemd you can assign services to slices which can be assigned
cgroup resource reservations / limits. I don't think any distro does too much
with this out of the box yet (IIRC by default all services run under a single
"system" slice, and user sessions under a user slice), but it should be
possible to run an "emergency environment" slice in which the processes are
always guaranteed some small amount of resources necessary to recover.

~~~
JdeBP
The LinkedIn people report that it's not that simple.

* [https://engineering.linkedin.com/blog/2016/08/don_t-let-linu...](https://engineering.linkedin.com/blog/2016/08/don_t-let-linux-control-groups-uncontrolled)

------
i336_
I'm reminded of Courgette: [https://www.chromium.org/developers/design-
documents/softwar...](https://www.chromium.org/developers/design-
documents/software-updates-courgette)

It's used to deliver unbelievably small Chromium OS updates.

To quote the link above:

    
    
      Full update:      10,385,920
      bsdiff update:    704,512
      Courgette update: 78,848
    

The reason it's not being used here is because Courgette's disassembler/symbol
resolver only works on x86/x86_64/ARM assembly language.

But it got me thinking - how much bulk does binary code take up in an Android
APK? If it's noteworthy, a JVM disassembler could produce really interesting
results.

If it's mostly images and sounds then things are pretty much as optimized as
they already are. (I have no idea myself.)

~~~
KMag
> a JVM disassembler

Note that Dalvik bytecode is very very different from the Java bytecode read
by JVMs. For one, Dalvik is a register machine (like LuaJIT's internal
bytecode) and the JVM is a stack machine (like Python bytecode).

~~~
i336_
Oooh, that might be one of the reasons why LuaJIT is so fast.

I did some reading about Forth some time ago, and learned that modern CPUs
have minor struggles with stack-based interpreters. I understand it's almost
negligible but has a small impact. (Forth is stack-based.)

~~~
masklinn
There are many reasons why LuaJIT is false, Mike Pall has talked pretty
extensively about it (e.g. [http://lambda-the-
ultimate.org/node/3851](http://lambda-the-ultimate.org/node/3851)).

I don't recall being register-based coming up as a first-order reason, though
it may have allowed optimisations which a stack-based machine would not allow
(e.g. the JIT doing register hinting and renaming).

~~~
i336_
What do you mean by "LuaJIT is false"?

Thanks heaps for that link though, reading through that thread has been really
interesting.

~~~
masklinn
> What do you mean by "LuaJIT is false"?

Meant to write "LuaJIT is fast", but was probably thinking about something
else and the fingers went haywire, sorry about that.

> Thanks heaps for that link though, reading through that thread has been
> really interesting.

Pall has written in other venues (mostly LuaJIT mailing lists but I think I
remember comments of his on reddit or HN, though I might be mistaken),
comments of his are always interesting reads.

~~~
i336_
Ah, I've had the same kinds of finger-"autocorrect" issues myself. Thanks!

I've been trying to find more info about Mike, he seems to be a really
private/mysterious type. What he writes about is definitely interesting
though.

------
kolistivra
Disclaimer: I used to work at Google exactly on this project.

I'm so happy this finally made it to daylight! =) I was the original one
researching for the feasibility of this, but at the time (3.5 years ago), the
re-compression burden made it a no-go even for the most modern phones so we
decided to table it.

~~~
schiffern
Possibly dumb question: why doesn't the hash verification take place on the
_uncompressed_ contents, moving recompression from the updating-critical-path
to the nice-to-have-later-to-save-disk-space category?

My dream updater would just be a bunch of stream processors, so the
downloading / decompression / disassembly (a la courgette), patching /
recompression happens all at once, as the data packets come in (thus utilizing
both CPU and IO). If done right an update shouldn't take much longer to
complete than the download step, no?

~~~
fulafel
Many (most?) Android phones are chronically short of flash space, even in the
status quo storage shortage often prevents applying app updates. There is
really no room to make storage use more lax.

------
pja
Instead of faffing about re-compressing all the App data in order to compare
signatures, wouldn’t it would be simpler to store two signatures alongside
each App in the first place: one for the compressed version & one for the
uncompressed one?

Then computing the signature after using the compressed diffs to upgrade an
existing App in place would just require walking over the upgraded files &
comparing the hash to the previously computed one. No CPU intensive re-
compression required.

(Clearly you’d have to sign the diffs in some way to prevent bad actors
injecting data into the system, but that’s a separate problem to the 'does
this data match what the App developer sent us in the first place' question
which Google is currently solving by re-compressing everything at great CPU
expense.)

------
r1ch
I really wish Google would let us download pre-compiled native blobs on fast
WiFi connections. Downloading is by far the fastest part of app installation /
update for me, the time spent recompiling the bytecode can measure in the
minutes for some apps.

Even worse is the forced-recompilation of every single app for the tiniest
system patches, which can take upwards of an hour.

~~~
EddieRingle
That's no longer the case with Nougat. Apps are now interpreted with a JIT
compiler and a profile is created for performing optimized AOT compilation
when the device is charging and unused for subsequent executions.

~~~
r1ch
Does this have to be turned on somewhere? I updated to Nougat last night and
it still recompiled all my apps on the first boot. I didn't notice any
improvements while updating apps from the play store either.

~~~
Mindless2112
The upgrade from Marshmallow to Nougat should be the last time you see
"Optimizing Apps".

------
legulere
So they are recompressing the files after applying the patches and hope that
it's the same? Wouldn't it make more sense to work on having the signature
work on uncompressed files?

Also it's kind of shocking that almost everyone is using deflate in with
compression level 6 or 9. Why is 6 even used? Why does almost nobody use
zopfli which achieves 3–8% higher compression rates. Why aren't other
compression formats supported like lzma, brotli, etc.?

~~~
vog
_> Wouldn't it make more sense to work on having the signature work on
uncompressed files?_

No! That would create an additional security risk, in addition to potential
performance issues.

Ideally, unchecked data should _only ever_ hit the signature check and nothing
else. Otherwise, it could also attack the decompression, not just through
security holes, but also through denial of service (by injecting something
that explodes in size on decompression).

This is a general principle! You should neither decompress, nor decrypt, nor
unpack/untar, nor parse (XML/JSON/...), nor anything else before the signature
check. See also:

"Preauthenticated decryption considered harmful"

[http://www.tedunangst.com/flak/post/preauthenticated-
decrypt...](http://www.tedunangst.com/flak/post/preauthenticated-decryption-
considered-harmful)

However, I agree that it makes sense to have an _additional_ check after
decompression. However, that check usually happens in most decompression tools
anyway, and usually through fast, lightweight checksums. Because, again, the
point here is to detect bugs in the decompressor, not to protect from
malicious tampering, which should always happen before complex stuff like
decompression.

~~~
Benjamin_Dobell
_> Otherwise, it could also attack the decompression, not just through
security holes, but also through denial of service (by injecting something
that explodes in size on decompression)._

 _> This is a general principle! You should neither decompress, nor decrypt,
nor unpack/untar, nor parse (XML/JSON/...), nor anything else before the
signature check._

Google have stated that they're going to be decompressing, patching _and_
recompressing data in an attempt to recreate the original data stream - simply
to perform a signature check.

So, whilst minimising the attack surface is a great goal in general, your
advice is not relevant here. They're already vulnerable to having their
decompression, patching and compression software attacked.

~~~
leni536
Your old apk is already signed. I guess it would make sense to sign the patch
too (if they don't do it already).

------
pjc50
Now if someone could figure out how to reduce the size of installed apps. And
preferably explain why most Android apps seem to be about 60MB even for
trivial things.

~~~
irishbro
I have worked on a few mobile games where there was a big developer push for
reducing the app size but it was shot down by management to add more new
features instead. A lot of features had expected performance targets to hit
before they got the go ahead and it was hard for reducing the app size to have
a sizeable difference on KPIs until it got to critical levels.

I think there is also a certain amount of developer apathy towards application
size.I was recently talking to a colleague who had worked on a new banking app
and they saw no problem with developing, what was essentially a web front end,
with the Unity game engine. When I asked if he was concerned the effect on the
app size including a game engine would have he said it was what they where
used to using and that was more important.

~~~
iainmerrick
A banking app in Unity??!?

Come to think of it, my banking app takes forever to start up, I wonder
why...! I've never bothered to check how big it is, though, which possibly
proves your colleague's point. :(

~~~
kbenson
Well, if they are a C# shop, that might be a really easy way to get their app
on phones while being able to reuse code that might already exist, while also
not having to deal with phone specific toolkits.

------
rohan1024
I actually did exactly the same thing over a year ago. I needed to download
Cynogen nightlies quite frequently which was not feasible at that time as I
was on limited data. So I used to download the nigthly on my digitalocean
droplet and calculate patch over their. Patch sizes were just around 12MBs. I
would then download the patch file and patch the old nightly on my local
system. Voila! phone upgraded by downloading just 12MBs! I did thought why
Google can't do the same thing with Android applications.:)

------
adrianN
Slightly offtopic: What algorithm does git use for binary diffs in packfiles?
Is it bsdiff, or the improved algorithm form Colin's thesis, or something
else?

~~~
niklasrde
As often is the case, SO has an answer:
[http://stackoverflow.com/questions/9478023/is-the-git-
binary...](http://stackoverflow.com/questions/9478023/is-the-git-binary-diff-
algorithm-delta-storage-standardized)

------
_delirium
> _Disclaimer: if you see different patch sizes when you press "update"
> manually, that is because we are not currently using File-by-file for
> interactive updates, only those done in the background._

Out of curiosity, does anybody know why that would be the case? I had assumed
automatic app updates used exactly the same code paths / APIs as manually
pressing "update" in the Google Play app did, just triggered automatically.
But it sounds like they actually go via a different route?

~~~
vanderZwan
The article implies they prioritise installation _time_ for manual installs
(which sounds sensible to me, since waiting time is the biggest source of
frustration for users, and decompression is the bottleneck there, especially
on older devices.

------
blaze33
The code is open sourced at: [https://github.com/andrewhayden/archive-
patcher#table-of-con...](https://github.com/andrewhayden/archive-
patcher#table-of-contents)

------
mirekrusin
This kind of problems could be made public.

Small or big company, government or organisation could define problem, give
test input, expected output and performance points (delta size, cpu usage
etc); submissions open for all.

Similar to programming competitions but instead of made-up problems, there is
a real world problem to be solved.

This could be good for everybody - companies, because they'd get (I hope) the
best solutions (there could be some price for top solutions, but still
fraction of in-house r&d cost), for people, because they could be hired or at
least would have a chance to test their skills on real problems; it could be
used as part of hiring process to skip early nonsense etc.

The biggest win, IMHO, would be to have government organisations getting
involved in this kind of "Open R&D" \- open for contributions and also open
for new, not yet articulated ideas.

~~~
jerf
It already is public. That's why Google is using bsdiff, an algorithm
developed for FreeBSD, and which they are legally able to use due to the way
in which it is licensed. We don't need some sort of central clearinghouse of
problem instances.

bsdiff predates the entire concept of Android by many years. To the extent
that Google wasn't using it from day 1, it isn't because the solution didn't
exist somewhere. (It sounds like it was due to problems that would have
existed for any clever solution and they chose deliberately to ship down un-
diffed-in-any-way updates.)

There's actually a profound lesson about how open source really works here.
And I'm not really saying this as an open source evangelist... it's more an
observation that our instinctive (and probably genetic) instinct in favor of
centralization is not always an accurate model of how the world either _does_
or _should_ work. There's not much some sort of centralized contest could do
in the present that wasn't already done in the real world over a decade ago
without it.

------
krzrak
On a side note: I "love" when I have to download 100+ MB update of the mobile
app with only changelog description: "minor fixes".

~~~
paradite
Or worse, opening up the app on the train just to be greeted "This version is
outdated. Please update to continue using."

~~~
slig
It's great when it happens with Uber and you're outside, with a crappy 3G
connection.

------
amluto
Why recompress at all?

Rather than using .zip, an improved APK format could be more like a tarball.
You'd tar up the files, sign it, and compress the result.

To verify, you decompress (streamily, without saving the result anywhere) in a
tight sandbox and verify the signature on the output. Then you decompress
again to install. Call it APK v3.

This adds a second pass at decompression, but decompression is generally quite
fast. In exchange, you avoid a much slower recompression step.

~~~
iainmerrick
Yes, it definitely seems like fixing the flaws in the original APK design
would be a win! Just signing the uncompressed data, as many others here have
pointed out, would allow most of your suggestion to be implemented.

I wonder if it's a sign of communication problems between the Android and
Google Play teams? The Play team seems to have spent years bending over
backwards to build this really awkward solution, which could have been done a
lot better with some low-level Android changes (i.e. a new APK format).

~~~
amluto
I bet there's a valid historical reason for the current design: APK looks a
lot like JAR and Android is kind-of-sort-of Java. Java random-accesses its
JARs, so the awkward format makes sense in the context. Android may, too, to a
limited extent, but Android has no concept of running an APK in place, so it
doesn't need this capability.

~~~
iainmerrick
Well, exactly, both APKs and JARs are just zipfiles with some extra
conventions over format and contents.

Why not update the format? Android has been around long enough to make the
advantages and disadvantages pretty clear.

------
SnaKeZ
Enabled only for automatic updates because it's CPU intensive.

~~~
pmontra
Auto update is the default and I'm sure it's what most people is using. I
disabled it many years ago to prevent app abusively add permissions and to
prevent crashes. It was pretty common for new versions not to work properly on
many devices. I remember checking the reviews before updating. This hasn't
happened anymore for at least a couple of years and Android 6 solved the issue
with the permissions but I didn't lose the habit. I'll download the full
update. Furthermore I keep the phone in airplane mode at night. I want to
sleep.

~~~
Nullabillity
Before Marshmallow it would refuse to auto-update apps if they added a
permission. It would just show you a notification about the new permission and
whether to accept/decline the update.

~~~
pmontra
So I did it because of the crashes. I forged the other memory. Thanks.

------
esturk
This is great for games like Hearthstone that currently requires a re-download
of the whole app which takes 2 GB whenever there's an update.

~~~
masklinn
> This is great for games like Hearthstone

Nope.

Hearthstone is a Unity game, and the issue (for mobile packaging at least) is
Unity creates a giant bundle of all the assets, so it's a single binary file
which tends to greatly change from one release to the next, even with per-file
diffing you're not getting any improvement[0].

Other developers get around that by downloading assets separately from the
main application bundle à la Supercell. I don't know that Supercell uses
Unity, but they do follow that pattern, they have an application bundle of
code & trivial assets, but most of the assets are downloaded at the loading
screen.

Checking on my phone, Hearthstone is ~2GB with 10.5MB of "documents and data"
(= a ~2GB base app bundle), clash royale is ~450MB with 336MB of data (= a
~100MB base app bundle), jetpack joyride is similar with ~330MB split between
a ~90MB base bundle and ~250MB "documents and data".

The default unity approach makes sense for rarely updated applications
(Monument Valley or République), it's an absolute bear for dynamic frequently
updated on-line games like Hearthstone.

[0] IIRC per-file diffing is what apple does on the appstore, it does not work
at all for HS

~~~
m12k
Unity has asset bundles precisely for use cases like streaming and downloading
of assets:

[https://docs.unity3d.com/Manual/AssetBundlesIntro.html](https://docs.unity3d.com/Manual/AssetBundlesIntro.html)

Blizzard just doesn't seem to be using them...

------
jfroma
Even before bsdiff and this new file-by-file patching mechanism I find Android
updates much faster than iOS. I don't know if iOS updates are just bigger
becuase application are bigger, or is just that apple has a worst CDN (I live
in Argentina).

------
voidlogic
TL:DR? Compress diffs, don't diff compressed things?

------
hawski
Could rsync algorithm be used to prepare binary diffs? How would it compare to
bsdiff?

EDIT:

Thanks for the downvote.

There is a tool using rsync algorithm for binary diffs. It's called rdiff [1].
I found Master's thesis Binary Differencing for Media Files by Vladimir
Komsiyski [2]. What I take from there is that rdiff is fast, but bsdiff
produces smaller patches.

Excerpts:

> _bsdiff_ is marginally better than xdelta as far as compression is
> concerned. _rdiff_ and especially edelta generate much larger patches. The
> upload gains from the viewpoints of the server are exactly the same as the
> compression rates and the saved disk space is proportional to it. For
> example, an expected number of versions per file of 3.7 (which is the case
> in Dataset A) and differencing with xdelta result in 70% less required disk
> space.

> As expected, _bsdiff_ ’s slow differencing time results in 3 times slower
> upload from the viewpoint of the client when compared to the server’s
> experienced time. Apart from that, the added time due to the differencing is
> smaller compared to the transfer of the produced patch and lead to up to 21
> times faster transfer than when no differencing is done. Even the worst
> performing tool edelta reaches a speed-up with a factor of 3. xdelta,
> showing the best results out of the four tools, achieves a 21x speedup when
> transferring a file to the server. The second best tool, _bsdiff_ , is only
> 9 times faster. However, its coefficient may dramatically increase with
> increasing the file size (Adobe Photoshop files) because of the non-linear
> runtime of this tool. edelta is severely outperformed by all other tools.
> The fastest tool _rdiff_ loses its lead due to the bigger patches it
> produces.

> For Adobe Documents (and most likely other image editing software files) the
> binary differencing tool xdelta shows the best run-time for the compression
> it achieves. _rdiff_ is faster, but its patches are bigger. _bsdiff_ shows
> better compression, but its non-linear runtime makes it unusable for files
> of this size. The benefits of xdelta are considerable - decrease of the file
> size with more than 96% for Adobe InDesign files, causing up to 30 times
> faster file transfer. Adobe Photoshop documents also show substantial gain
> despite their much larger size.

[1]
[https://en.wikipedia.org/wiki/Rsync#Variations](https://en.wikipedia.org/wiki/Rsync#Variations)
[https://linux.die.net/man/1/rdiff](https://linux.die.net/man/1/rdiff)

[2]
[http://scriptiesonline.uba.uva.nl/document/490827](http://scriptiesonline.uba.uva.nl/document/490827)

------
lultimouomo
I wonder how this deals with renamed files.

If it doesn't, one should avoid renames of bigs assets when working on updates
of an already released app.

------
samsk
The question, why took this so long ? 8 years for figuring out how to make
differential updates...

~~~
cgvgffyv
Read the thread.

[https://news.ycombinator.com/item?id=13121903](https://news.ycombinator.com/item?id=13121903)

------
franciscop
Is this basically something like git? Isn't it crazy that something like this
hasn't been implemented until today?

~~~
StreamBright
I am not sure what you mean by git. This is binary data diffing using bsdiff,
applied at the individual file level instead of the whole package (single
file).

------
mkj
Nice work there but who really talks like this?

"in order to provide users with great content, improve security, and enhance
the overall user experience"

~~~
MichaelBurge
The author is listed right above that: Posted by Andrew Hayden, Software
Engineer on Google Play.

------
ekux44
The author of this post now works at Amazon!

[https://www.linkedin.com/in/andrewhayden](https://www.linkedin.com/in/andrewhayden)

~~~
iainmerrick
They deserve serious kudos for managing to finish and ship their product at
Google first. (Another commenter here said the project started 3.5 years ago,
wow)

------
Aaargh20318
If only they spent this much effort in getting Android OS updates to
consumers.

------
aembleton
How does this differ to the delta updates that came out over four years ago?

[http://www.androidpolice.com/2012/08/16/google-flips-the-
swi...](http://www.androidpolice.com/2012/08/16/google-flips-the-switch-on-
smart-app-updates-in-the-play-store-video/)

~~~
danielsamuels
Read the article before commenting?

~~~
aembleton
Which part of the article covered the 2012 announcement? There was the piece
near the top: `earlier this year, we announced that we started using the
bsdiff algorithm (by Colin Percival).`

But earlier this year would be 2016, not 2012 so I assumed that this was
something else.

~~~
danielsamuels
And clicking on the link in your quoted text would give you this:

> Google Play has used delta algorithms since 2012, and we recently rolled out
> an additional delta algorithm, bsdiff

