Hacker News new | comments | ask | show | jobs | submit login
Much faster incremental apt updates (juliank.wordpress.com)
172 points by edward on Dec 27, 2015 | hide | past | web | favorite | 57 comments



Another (fairly) recent improvement for apt, but on the UI side, is the apt command [1]. It "combines the most commonly used commands from apt-get and apt-cache. The commands are the same as their apt-get/apt-cache counterparts but with slightly different configuration options." So no need to remember that it's apt-cache search, and it's a shorter command. (I know you can set aliases, but this works by default, so on a live USB or another user's machine.) Bash completion didn't work at first, at least on Ubuntu, but as of 15.10 it's working.

[1]: https://mvogt.wordpress.com/2014/04/04/apt-1-0/


This does the same I believe, if you'd like to try something like this right now:

https://pypi.python.org/pypi/apt-wrapper

It also runs sudo for you when needed.

The wrapper part is only for the package name as it was already taken on pypi.


this is good! i'm mostly homebrew on mac these days, but every once in a while i find myself on an apt-based linux system, and i've always blown away at how obscure and unintuitive apt is.


It's good but does not provide tab-completion on bash yet.


It does in Ubuntu starting with 15.10 (as I mentioned). I don't know about Debian.


Ugh can't even count how many times this happens:

Apt search somepackage

sigh

Apt-cache search somepackage


then:

sigh

apt-cache search somepackage | grep '^somepackage'


    apt-cache policy <somepackage>
does that work better?


wajig(1) is your unifying interface friend in this space.


This is crazy, apt has been reading diffs a byte at a time? Think of the millions of hours that have been wasted due to this.


That's unfair. Pick any big program and you'll find examples of really dumb inefficient code. We all write it, no matter how smart we think we are.

Instead, think of all the millions of satisfied users of the software, none of whom bothered to fix up this problem until now.


I wouldn't even call it dumb, it's just what happens when a big group of people are writing software together. Somebody writes a function, other people see it and start using it in ways the original author didn't intend.

One byte at a time could've been a quick hack that was "good enough" in the original author's use case, because he wasn't reading a lot of data or only calling it once or something. Somebody else saw "readline" and decided it was exactly what he needed and decided to use it everywhere.


OT, but that's one reason comments are important. Also in the alternative hypothetical where reading a byte at a time was required: if that's not documented someone might come along and thing that was merely incidental or even buggy behavior and change it. That's why I'm not sympathetic to programmers who think that they write "self-documenting" code or something. If you don't document (even briefly) what your code is supposed to do then bugs become diffuse things that no one is responsible for, and which are difficult to spot. </ramble>


I profoundly disagree with that. Comments always end up not following the actual implementation. In this example, "self documenting" the code would be to name the function "unbufferedRead" or something similar that shows that not having a buffer is part of the contract. You make it buffered, you're breaking the contract. Anything else is merely applying temporary (because they'll rot), hidden (because if someone is reading code that calls myVeryWellDocumentedFunction, they won't see the documentation, especially if the function is just incompletely named) and inaccurate (because the only accurate thing is the code) band aids. There's a reason it's said that naming stuff is one of the hardest things in programming.


I disagree. Even with the name "ReadOneUnbufferedByteAtATime", the code doesn't (and can't) explain why the author is doing it that way instead of reading multiple bytes at once. If that decision is important then there needs to be a comment explaining it.

In my experience, "self documenting" code is brittle and hard to change because nobody knows (or remembers) which implementation details are important and which aren't, so they're afraid to change it. Is it reading a byte at a time for a reason? Or does it not matter and it's just happens to be done that way?

There's more to software development than just churning out code, and being disciplined about keeping comments up to date is one of those things that just needs to be done, IMO.


I agree that naming things well is a great way to document them. However I think that in most of the codebases I've worked on it is quite common for a function/method to do many things some of which are incidental and some necessary, and it would be difficult to include the full set of these in the method's name. (comments also help to make it obvious when a function is, perhaps, too complex and arbitrary).

In addition to good naming, strong static types and functional purity help a great deal (but I won't try to evangelize haskell here :))

> ...hidden (because if someone is reading code that calls myVeryWellDocumentedFunction, they won't see the documentation, especially if the function is just incompletely named)

what is your proposed alternative? Ideally a function would do the only possible sensible thing imaginable, but that is frequently not attainable.

> ...inaccurate (because the only accurate thing is the code)

Well when the code doesn't match the spec in the comments, we've found a bug in the code, or a bug in the spec. Either way we can find someone to blame. This is vastly better than finding a bug and not knowing whether the function or the caller should be fixed.


No, they made the right choices. (1)Make it work. (2)Make it correct. (3)Make it fast.

Can take a long time to get to step 3 with something as safety critical as an update tool.


Sometime between the start and the end, there's often a mandatory burn it all down step required to more fully satisfy either (2) or (3). Watching hundreds of MB of data getting written to my HD for an apt-get update makes me 90% certain that Debian indeed needs to burn it all down.

To their credit, they started really early. And burning it all down seems to be an increasingly better option for purely extrinsic reasons; the late mover advantage keeps increasing, and better and better options keep emerging. A LMDB-based refactor perhaps? This is probably a pipe dream, alas: Debian developers must have a hundred different programs that all build and extend forward the original "make it work" technology, to make it survivable. And that's really my final counter to your cute linear notion of ongoing improvement: the mold is set, and there are dozens of things bound up in what has happened.


Where do you want LMDB? apt has a super fast binary cache, the repositories and dpkg store their package lists as text files.

RPM systems have the advantage IIRC that they can basically just download a bunch of sqlite3 databases, which means they do not have to parse any data.


AFAIK RPM uses BerkeleyDB, though it will probably switch to LMDB https://bugzilla.redhat.com/show_bug.cgi?id=1086784


> implying apt is a young tool and the maintainers didn't have many years to go from 1 to 3


According to the developer the bug was present only in the 1.1 series, which is just a few months old (it was pre-released at debconf 2015, and is not in Jessie if i'm not mistaken).


Furthermore pdiffs were introduced in ~2006: https://www.debian-administration.org/article/439/Avoiding_s...


I run an ecosystem of tens of millions of "end users" who are working with a GUI package manager I developed (Cydia) built on APT (using libapt). We stress APT to its limit, with users often using very large numbers (thirty is common, but I have seen well over a hundred) repositories that are often hosted by random people of varying skill at IT (and so tend to be slow or have errors in their metadata; DNS errors and 200 OK HTML error pages abound).

We have so many non- and semi- skilled people using APT that if you Google many APT errors, you actually tend to come across people using Cydia as the primary people discussing the error condition ;P. Our package churn and update rate is faster than Debian, and we have run into all of the arbitrary limits in various versions of APT (total number of packages known about, total number of delete versions, total number of bytes in the cache): really, we use APT a lot.

1) Despite APT supporting cumulative diffs (where the client gets a single diff from the server to bring them up to date rather than downloading a ton of tiny incremental updates and applying them in sequence), Debian's core repositories are not configured to generate these. I can tell you from experience that providing cumulative diffs is seriously important.

So, while a 20x speed-up applying a diff is cool and all, users of Debian's servers are doing this 20x more often that they need to, applying diff after diff after diff to get the final file. This is an example of an optimization at a low-level that may or may not be useful as the real issue is at the higher-level in the algorithm design.

What is extra-confusing is that the most popular repository management tool, reprepro, can build cumulative diffs automatically, and I think it does so by default. Debian really should switch to using this feature: I keep seeing Debian users complain on forums and blog posts that APT diff updates are dumb as you end up downloading 30 files... no: the real issue is that debian isn't using their own tool well :(.

2) The #1 performance issue I deal with while using APT even on my server is the amount of time it takes to build the cache file every time there is a package update. It sits there showing you a percentage as it does this on the console. On older iPhones this step was absolutely brutal. This step was taking some of my users minutes, but again: that is the step I most notice on my server.

I spent a week working on this years ago, and made drastic improvements. I determined most of the time was spent in "paper cuts": tiny memory allocations and copies distributed through the entire project which over the course of running the code hemorrhaged all the time.

The culprit (of course ;P) was std::string. As a 20-year user of C++ who spent five years in the PC gaming industry, I hate std::string (and most of STL really: std::map is downright idiotic.. it allocates memory even if you never put anything into the map, and I can tell you from writing my own C++ red-black tree tools that there is no good reason for this).

Sure, maybe APT is using C++11 by now and has a bunch of move constructors all over the place that mitigate the issue somewhat (I haven't looked recently), but it still feels "weirdly slow" to do this step on my insanely fast server (where by all rights it should be instantaneous) and frankly: APT's C++ code when I was last seriously looking at the codebase was abysmal. It was essentially written against one of the very first available versions of C++ by someone who didn't really know much about the language (meaning it uses all the bad parts and none of the good; this happens when Java programmers try to use C++98, for example, but APT is much much worse) and has no rhyme or reason to a lot of the design. It reminds me a little of WebKit's slapped together "hell of random classes and pointers that constantly leads to use-after-free bugs".

Regardless, I rewrote almost every single usage of std::string in the update path to use a bare pointer and a size and pass around fragments of what had been memory mapped from the original file whenever possible without making any copies. I got to be at least twice if not four times faster (I don't remember). I made the code entirely unmaintable while doing this, though, and so I have never felt my patches were worth even trying to merge back (though it also took me years to ever find the version control repository where APT was developed anyway... ;P). To this day I ship some older version of APT that I forked rather than updating to something newer, due to a combination of this and the gratuitous-and-unnecessary ABI breakage in APT (they blame using C++, but that isn't quite right: the primary culprit is their memory-mapped cache format, and rather than use tricks when possible to maintain the ABI for it they just break it with abandon; but even so, the C++ is buying me as a user absolutely nothing: they should give me a thin C API to their C++ core.)

If I were to do this again "for real" I would spend the time to build some epic string class designed especially for APT, but I just haven't needed to do this as my problem is now "sort of solved well enough" as I almost have never cared about the new features that have been added to APT, and I have back ported the few major bug fixes I needed (and frankly have much better error correction now in my copy, which is so unmaintainably drifted due to this performance patch as to not be easily mergable back :/ but we really really need APT to never just give up entirely or crash if a repository is corrupt, and so they are also critical for us in a way they aren't for Debian or Ubuntu).

If anyone is curious what these miserable patches look like, here you go... check out "tornado" in particular. (Patches are applied by my build system in alphabetical order.) (Actually, I have been reading through my tornado patch and I actually did at some point while working on it build a tiny custom string class to help abstract the fix, but I assuredly didn't do it well or anything. I really only point any of this maintainability issue out at all, by the way, as I don't want people to assume that performance fundamentally comes at the price of unmaintainable implementations.)

http://svn.telesphoreo.org/trunk/data/_apt7/


(1) PDiffs are merged before being applied, and the downloads are pipelined, so it does not matter much anymore.

(2) The cache generation is somewhat slow, it currently takes about 2 seconds on a 2 generation old laptop.

I sped it up a bit now, and replaced a map in that path with an unordered_map in the process.

Profiling it does not show anything obvious that can be optimized, though, most of the time is spent parsing files.

Most importantly, as you can see in https://people.debian.org/~jak/gencache.svg only about 10% are spent in the string code.

(3) We don't want to recover from errors. Error recovery masks problems.

(4) ABI breaks: We try to avoid them a lot. We added tons of dpointers everywhere these days, so we can extend stuff as needed without breaking ABI.

Sometimes you do need to break ABI. For example, offsets were 32-bit sometimes causing large files not to work correctly.


FindGrp() is somewhat slow as we do not store the untruncated hash value in the groups, so we have to compare all strings (well, we order them on insertion, but it's still 2 to 3 on the average - an average 3 integer comparisons would be nicer).

But fixing that requires an ABI break, so it has to wait for a chance. Like in a year or something.


Out of curiosity, have you looked at raptorial? I don't know much about it (aside from knowing the author) but it claims to be a much faster drop-in replacement for the apt family of tools:

https://github.com/dankamongmen/raptorial


No. I have actually never heard of this project (either raptorial or SprezzOS). FWIW, I don't actually care about the performance of APT on my servers (where I mostly am running Ubuntu), and for Cydia I link against libapt and thereby don't use the tools (at some point jailbreaks actually stopped shipping the command line tools by default as they were large and unused by 99% of our users).

The web page for raptorial seems to be offline, the project is two years stale, and I spent a few minutes glancing through the code and came across a "fix me" comment next to the version comparison function (one of the most important parts of anything that even sort of works with APT, and something I have now written many implementations of in various languages) saying it doesn't work correctly for numbers, and so the project has earned a "vote of no confidence" from me (sorry :().


So, we have basically rewritten the tornado patch idea using the C++17 string_view API (subset implemented in APT::StringView). This improved performance by about 10% on my test systems.

Here's the commit:

http://anonscm.debian.org/cgit/apt/apt.git/commit/?id=eff0c2...

also further improvement:

http://anonscm.debian.org/cgit/apt/apt.git/commit/?id=dd5927...

and finally, dropping strings out of StoreString(), so it's now almost entirely gone:

http://anonscm.debian.org/cgit/apt/apt.git/commit/?id=869700...


> If I were to do this again "for real" I would spend the time to build some epic string class designed especially for APT,

... or you could instead re-work it to use what are now called string_view and what used to be called string_ref.

* http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n392...

* http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n344...

* http://www.boost.org/doc/libs/master/libs/utility/doc/html/s...


Is it possible to set up a mirror of debian core repositories that do use cumulative diff, so that end users can use that repository instead?

I'd be happy to try and configure/host such a thing if there's no difficulties.


Someone just needs to talk to the Debian people about fixing this. Honestly, I thought this was fixed two years ago, but a friend of mine (one who works in the Cydia ecosystem, ironically), was complaining about diff updates a couple weeks ago, and it turns out it was not. I mostly outsourced "upstreaming to distributions" (as opposed to to projects) to a friend of mine who now works for RightScale, and so I never really gained for myself the right contacts in that beurocracy, but if no one else does I should probably get around to sending this suggestion to them...


Just asking, have you gotten one of the Debian CTTE guys to file a top level proposal (just like the systemd proposal https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=727708).

I think you are one of the few people that truly understand the ecosystem and probably are the right person to influence this.

Incidentally, I had a pretty interesting bthread today on the question bwhether we will see a single package manager format on Linux. Wonder what your thoughts are...and in general the state of package management on Linux. Would you build Cydia on something else today?


The technical committee isn't necessary unless you've already asked and been told "no".

The ftpmasters and the apt maintainers would be the appropriate pair of contacts.


CTTE is when you have a dispute over a technical decision. From my reading above comments, there is no conflict but just lack of interest.


You don't really need cumulative diffs anymore. Sure, they would be nice to have, but with pipelining and client-side merging of the patches before applying, things are not that much slower.


I have to take that back. For Contents-Source, it's ugly. The rest is fine though.


It's surely too late for your project, but: I was a committed Debian user for ~10 years (I even have the shirt from staffing a booth at a conference!), tried out Arch Linux, and have been delighted and am even a cautious evangelist for it. With Arch you still get the binary packages and huge coverage of Debian's packaging system but the tools are super simple and super fast. I hadn't realized how much time I spent waiting for dpkg until I had a system that didn't use it.


So libapt is a critical piece of your widely used tool, and you're just happy to sit back, bag on it and talk about the one un-maintianable patchset made. Typical. Ever wonder why apt isn't better in the first place?


If you spend any time looking into me, you will find that I actually care a lot about getting things upstreamed or at least reported in general. A few days ago I was doing a demo where I was laughing about how almost all node.js modules are one line of non-working code, using an example I had found earlier that day, and yes: I reported the issue. I have offered to do all the work to port some projects to fix things, and handed over patches. I am subscribed to a million project mailing lists and have weighed in on random things in random projects over the years.

The reality here is that I don't have the months required to really fix the one place here you could say I have "bagged on" APT, and it is not clear to me that the real problem (which is that the APT community is using C++ in weird or even simply "too many" ways) is fixable.

That said, I will also ask you a question: have you ever really cared how fast APT is? I notice it being slow on my server, but I know why it is slow and where it is slow... it is certainly much faster than Ruby bundler, for example. Is there any good reason at all for anyone but me to change all this code?

The OP complained about some performance issue applying diffs, but as I explained, the real issue with diff updates is that Debian isn't sending cumulative diffs: it doesn't really matter that that is slow either.

So I don't know what you really think I should be doing here. Do you want me to try to train the APT developers in some way? That won't work and is frankly kind of arrogant in a way that commenting from the sidelines isn't. The truly asshole "Web 2.0 GitHub generation" solution at this point would be to fork APT and "compete" with it, but I really really think people who push public forks of key projects are assholes: I am not going to do that.

But if we are going to talk about performance issues in this tool, I am going to try to provide some access to my background about what is going on and why: you essentially will never see me sit around and complain about APT in a context outside of this (and in fact you will generally find me defending APT against developers who are quick to judge something by quality of detail rather than quality of design: reddit makes it difficult to find old comments, but if you really cared you would find me constantly pointing out that APT can be improved rather than thrown out).


> Do you want me to try to train the APT developers in some way? That won't work

Try it out, we're always listening and around in #debian-apt on OFTC.


I am really surprised that something as frequently used as apt had these obvious performance issues. Was there a technical reason for it? I noticed that it ran painfully slow on devices with slower disks. I suppose that is going to change now.


A 20x speedup as reported here is not that unusual in my experience. If I work on optimizing something which has never had detailed performance analysis, I expect to achieve 3x to 10x speedups in most cases. Even 100x is not really rare.

I once found a similar thing in a commercial setting. A program had been written which could set one bit in a shared memory region for use as a feature flag. It was slowly expanded in use until it was being run thousands of times a day on hundreds of very expensive computers. I wrote a replacement program which read commands from a file and set all the bits in one invocation. This saved well over 100k USD per year if you just added up the cost of computer time without staffing. I imagine this APT fix is worth more than that.


I doubt this is by any means the most important performance issue that Debian package installation faces. It looks like the speedup discussed here is in the no-update case, which is nice, but not the only case. I stopped using Debian at home several years ago, and one of the main reasons was the incredible slowness of groping around in GNOME's moronic XML settings database during every update. Upgrading schemas and whatnot. I still use Debian at work and I note that it is still prone to building and rebuilding initrd during an update when it should only be built once. On a recent update initrd was built for the kernel that was being _uninstalled_ which was awesome.


consider using eatmydata. living on the edge but it is worth it.

https://packages.debian.org/jessie/eatmydata


> I noticed that it ran painfully slow on devices with slower disks. I suppose that is going to change now

No, most of the time is spent fsync'ing and it is a problem at a lower layer (related to dpkg), but ot apt. Too many bug reports to list but see [0] for the latest one.

You can use apt with eatmydata as a workaround.

[0] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=613428


Apt is quite slow on a Pi running OSMC so I'm hoping this makes it downstream at some point


Based on the sort of issues fixed, it looks like no one ever bothered to benchmark it until now.


As a Debian user, i find that updating can be slow, but it looks like spending much of its time on network traffic. Also 30 or 40 seconds is not large enough to be _really_ annoying. Nevertheless, i'm looking forward to this!


> The reason for this is that our I/O is unbuffered, and we were reading one byte at a time in order to read lines. This changed on December 24, by adding read buffering for reading lines, vastly improving the performance of rred.

Why not use memory mapped files and let the operating system deal with caching in a way that is most efficient for that specific case?


Not all our files are uncompressed, so directly using mmap does not make sense for us, because we also need to be able to use files compressed with gzip, bzip2, xz (and soonish probably lz4).


I believe mmap is not fully portable without extra work, so they avoid it - debian runs on a LOT of different systems.


It was developed inside the original BSD and is part of POSIX. So it's supported on all unixes today. The only place it isn't supported is Windows.


It is everywhere even Windows. There is a little bit of weirdness on some older versions of Windows CE and some undefined behavior with edge conditions on some OS's if I recall, but you can find the capability everywhere. I added the map API to QFile back in the day and went through checking different OS's for compatibility. A quick google search here is a link to the windows method:

https://github.com/radekp/qt/blob/master/src/corelib/io/qfsf...

unix version: https://github.com/radekp/qt/blob/master/src/corelib/io/qfsf...


This is great.

Even more so because after gem, or pip, or something (can't remember) had a similar issue a while ago (think they had an n2 algorithm) a lot of people jumped on it as being bad computer science. There were all sorts of calls about how web people were not real computer scientists.

Either way, good useful products were made and they've been further optimised. That's great. More of that more of the time.


You're probably thinking of this, rubygems was optimized using a linked list: https://news.ycombinator.com/item?id=9195847


It's great to see apt improving. I am still waiting for "apt-get update" to become git-pull-like thing to have much faster update (especially when no recent changes were made) though.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: