
Much faster incremental apt updates - edward
https://juliank.wordpress.com/2015/12/26/much-faster-incremental-apt-updates/?utm_source=twitterfeed&utm_medium=twitter
======
abrowne
Another (fairly) recent improvement for apt, but on the UI side, is the _apt_
command [1]. It "combines the most commonly used commands from apt-get and
apt-cache. The commands are the same as their apt-get/apt-cache counterparts
but with slightly different configuration options." So no need to remember
that it's _apt-cache search_ , and it's a shorter command. (I know you can set
aliases, but this works by default, so on a live USB or another user's
machine.) Bash completion didn't work at first, at least on Ubuntu, but as of
15.10 it's working.

[1]:
[https://mvogt.wordpress.com/2014/04/04/apt-1-0/](https://mvogt.wordpress.com/2014/04/04/apt-1-0/)

~~~
fistfuck
Ugh can't even count how many times this happens:

Apt search somepackage

 _sigh_

Apt-cache search somepackage

~~~
malux85
then:

sigh

apt-cache search somepackage | grep '^somepackage'

------
orf
This is crazy, apt has been reading diffs a byte at a time? Think of the
_millions_ of hours that have been wasted due to this.

~~~
grandinj
No, they made the right choices. (1)Make it work. (2)Make it correct. (3)Make
it fast.

Can take a long time to get to step 3 with something as safety critical as an
update tool.

~~~
rektide
Sometime between the start and the end, there's often a mandatory burn it all
down step required to more fully satisfy either (2) or (3). Watching hundreds
of MB of data getting written to my HD for an apt-get update makes me 90%
certain that Debian indeed needs to burn it all down.

To their credit, they started really early. And burning it all down seems to
be an increasingly better option for purely extrinsic reasons; the late mover
advantage keeps increasing, and better and better options keep emerging. A
LMDB-based refactor perhaps? This is probably a pipe dream, alas: Debian
developers must have a hundred different programs that all build and extend
forward the original "make it work" technology, to make it survivable. And
that's really my final counter to your cute linear notion of ongoing
improvement: the mold is set, and there are dozens of things bound up in what
has happened.

~~~
julian-klode
Where do you want LMDB? apt has a super fast binary cache, the repositories
and dpkg store their package lists as text files.

RPM systems have the advantage IIRC that they can basically just download a
bunch of sqlite3 databases, which means they do not have to parse any data.

~~~
hyc_symas
AFAIK RPM uses BerkeleyDB, though it will probably switch to LMDB
[https://bugzilla.redhat.com/show_bug.cgi?id=1086784](https://bugzilla.redhat.com/show_bug.cgi?id=1086784)

------
saurik
I run an ecosystem of tens of millions of "end users" who are working with a
GUI package manager I developed (Cydia) built on APT (using libapt). We stress
APT to its limit, with users often using very large numbers (thirty is common,
but I have seen well over a hundred) repositories that are often hosted by
random people of varying skill at IT (and so tend to be slow or have errors in
their metadata; DNS errors and 200 OK HTML error pages abound).

We have so many non- and semi- skilled people using APT that if you Google
many APT errors, you actually tend to come across people using Cydia as the
primary people discussing the error condition ;P. Our package churn and update
rate is faster than Debian, and we have run into all of the arbitrary limits
in various versions of APT (total number of packages known about, total number
of delete versions, total number of bytes in the cache): really, we use APT _a
lot_.

1) Despite APT supporting cumulative diffs (where the client gets a single
diff from the server to bring them up to date rather than downloading a ton of
tiny incremental updates and applying them in sequence), Debian's core
repositories are not configured to generate these. I can tell you from
experience that providing cumulative diffs is seriously important.

So, while a 20x speed-up applying a diff is cool and all, users of Debian's
servers are doing this 20x more often that they need to, applying diff after
diff after diff to get the final file. This is an example of an optimization
at a low-level that may or may not be useful as the real issue is at the
higher-level in the algorithm design.

What is extra-confusing is that the most popular repository management tool,
reprepro, can build cumulative diffs automatically, and I think it does so by
default. Debian really should switch to using this feature: I keep seeing
Debian users complain on forums and blog posts that APT diff updates are dumb
as you end up downloading 30 files... no: the real issue is that debian isn't
using their own tool well :(.

2) The #1 performance issue I deal with while using APT even on my server is
the amount of time it takes to build the cache file every time there is a
package update. It sits there showing you a percentage as it does this on the
console. On older iPhones this step was absolutely brutal. This step was
taking some of my users _minutes_ , but again: that is the step I most notice
on my server.

I spent a week working on this years ago, and made drastic improvements. I
determined most of the time was spent in "paper cuts": tiny memory allocations
and copies distributed through the entire project which over the course of
running the code hemorrhaged all the time.

The culprit (of course ;P) was std::string. As a 20-year user of C++ who spent
five years in the PC gaming industry, I hate std::string (and most of STL
really: std::map is downright idiotic.. it allocates memory even if you never
put anything into the map, and I can tell you from writing my own C++ red-
black tree tools that there is no good reason for this).

Sure, maybe APT is using C++11 by now and has a bunch of move constructors all
over the place that mitigate the issue somewhat (I haven't looked recently),
but it still feels "weirdly slow" to do this step on my insanely fast server
(where by all rights it should be instantaneous) and frankly: APT's C++ code
when I was last seriously looking at the codebase was abysmal. It was
essentially written against one of the very first available versions of C++ by
someone who didn't really know much about the language (meaning it uses all
the bad parts and none of the good; this happens when Java programmers try to
use C++98, for example, but APT is much much worse) and has no rhyme or reason
to a lot of the design. It reminds me a little of WebKit's slapped together
"hell of random classes and pointers that constantly leads to use-after-free
bugs".

Regardless, I rewrote almost every single usage of std::string in the update
path to use a bare pointer and a size and pass around fragments of what had
been memory mapped from the original file whenever possible without making any
copies. I got to be at least twice if not four times faster (I don't
remember). I made the code entirely unmaintable while doing this, though, and
so I have never felt my patches were worth even trying to merge back (though
it also took me years to ever find the version control repository where APT
was developed anyway... ;P). To this day I ship some older version of APT that
I forked rather than updating to something newer, due to a combination of this
and the gratuitous-and-unnecessary ABI breakage in APT (they blame using C++,
but that isn't quite right: the primary culprit is their memory-mapped cache
format, and rather than use tricks when possible to maintain the ABI for it
they just break it with abandon; but even so, the C++ is buying me as a user
absolutely nothing: they should give me a thin C API to their C++ core.)

If I were to do this again "for real" I would spend the time to build some
epic string class designed especially for APT, but I just haven't needed to do
this as my problem is now "sort of solved well enough" as I almost have never
cared about the new features that have been added to APT, and I have back
ported the few major bug fixes I needed (and frankly have much better error
correction now in my copy, which is so unmaintainably drifted due to this
performance patch as to not be easily mergable back :/ but we really really
need APT to never just give up entirely or crash if a repository is corrupt,
and so they are also critical for us in a way they aren't for Debian or
Ubuntu).

If anyone is curious what these miserable patches look like, here you go...
check out "tornado" in particular. (Patches are applied by my build system in
alphabetical order.) (Actually, I have been reading through my tornado patch
and I actually did at some point while working on it build a tiny custom
string class to help abstract the fix, but I assuredly didn't do it well or
anything. I really only point any of this maintainability issue out at all, by
the way, as I don't want people to assume that performance fundamentally comes
at the price of unmaintainable implementations.)

[http://svn.telesphoreo.org/trunk/data/_apt7/](http://svn.telesphoreo.org/trunk/data/_apt7/)

~~~
zodiac
Is it possible to set up a mirror of debian core repositories that do use
cumulative diff, so that end users can use that repository instead?

I'd be happy to try and configure/host such a thing if there's no
difficulties.

~~~
saurik
Someone just needs to talk to the Debian people about fixing this. Honestly, I
thought this was fixed two years ago, but a friend of mine (one who works in
the Cydia ecosystem, ironically), was complaining about diff updates a couple
weeks ago, and it turns out it was not. I mostly outsourced "upstreaming to
distributions" (as opposed to to projects) to a friend of mine who now works
for RightScale, and so I never really gained for myself the right contacts in
that beurocracy, but if no one else does I should probably get around to
sending this suggestion to them...

~~~
sandGorgon
Just asking, have you gotten one of the Debian CTTE guys to file a top level
proposal (just like the systemd proposal [https://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=727708](https://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=727708)).

I think you are one of the few people that truly understand the ecosystem and
probably are the right person to influence this.

Incidentally, I had a pretty interesting bthread today on the question
bwhether we will see a single package manager format on Linux. Wonder what
your thoughts are...and in general the state of package management on Linux.
Would you build Cydia on something else today?

~~~
JoshTriplett
The technical committee isn't necessary unless you've already asked and been
told "no".

The ftpmasters and the apt maintainers would be the appropriate pair of
contacts.

------
IgorPartola
I am really surprised that something as frequently used as apt had these
obvious performance issues. Was there a technical reason for it? I noticed
that it ran painfully slow on devices with slower disks. I suppose that is
going to change now.

~~~
thrownaway2424
I doubt this is by any means the most important performance issue that Debian
package installation faces. It looks like the speedup discussed here is in the
no-update case, which is nice, but not the only case. I stopped using Debian
at home several years ago, and one of the main reasons was the incredible
slowness of groping around in GNOME's moronic XML settings database during
every update. Upgrading schemas and whatnot. I still use Debian at work and I
note that it is still prone to building and rebuilding initrd during an update
when it should only be built once. On a recent update initrd was built for the
kernel that was being _uninstalled_ which was awesome.

~~~
kasabali
consider using eatmydata. living on the edge but it is worth it.

[https://packages.debian.org/jessie/eatmydata](https://packages.debian.org/jessie/eatmydata)

------
legulere
> The reason for this is that our I/O is unbuffered, and we were reading one
> byte at a time in order to read lines. This changed on December 24, by
> adding read buffering for reading lines, vastly improving the performance of
> rred.

Why not use memory mapped files and let the operating system deal with caching
in a way that is most efficient for that specific case?

~~~
ars
I believe mmap is not fully portable without extra work, so they avoid it -
debian runs on a LOT of different systems.

~~~
legulere
It was developed inside the original BSD and is part of POSIX. So it's
supported on all unixes today. The only place it isn't supported is Windows.

~~~
icefox
It is everywhere even Windows. There is a little bit of weirdness on some
older versions of Windows CE and some undefined behavior with edge conditions
on some OS's if I recall, but you can find the capability everywhere. I added
the map API to QFile back in the day and went through checking different OS's
for compatibility. A quick google search here is a link to the windows method:

[https://github.com/radekp/qt/blob/master/src/corelib/io/qfsf...](https://github.com/radekp/qt/blob/master/src/corelib/io/qfsfileengine_win.cpp#L1948)

unix version:
[https://github.com/radekp/qt/blob/master/src/corelib/io/qfsf...](https://github.com/radekp/qt/blob/master/src/corelib/io/qfsfileengine_unix.cpp#L1238)

------
aidos
This is great.

Even more so because after gem, or pip, or something (can't remember) had a
similar issue a while ago (think they had an n2 algorithm) a lot of people
jumped on it as being bad computer science. There were all sorts of calls
about how web people were not real computer scientists.

Either way, good useful products were made and they've been further optimised.
That's great. More of that more of the time.

~~~
shadeless
You're probably thinking of this, rubygems was optimized using a linked list:
[https://news.ycombinator.com/item?id=9195847](https://news.ycombinator.com/item?id=9195847)

------
k_bx
It's great to see apt improving. I am still waiting for "apt-get update" to
become git-pull-like thing to have much faster update (especially when no
recent changes were made) though.

