
Downloading a file regularly - how hard can it be? - joeyespo
https://adblockplus.org/blog/downloading-a-file-regularly-how-hard-can-it-be
======
sophacles
A common solution to this problem, is to make a 2 stage process, where step 1
is a request of "should I download?", where there are 2 possible replies: "no,
check again in N time" and "yes, here is a token". Step 2 is then presenting
the token to the api point for download, and getting the file.

On the server side, you don't even need specific instance tracking, just a
simple decision based on current resource usage, and a list of valid tokens
(optionally, they can expire in some short time to avoid other thundering herd
type issues). Say, you set a max number of file transfers, or bandwidth or
whatever metric makes sense to you, and you simply reply based on that metric.
Further, you can smooth out your load with a bit of intelligence on setting N.

Even better, you get a cool side-effect: since the check isn't so resource
intensive, you can set the time between checks lower, and make the updates
less regular.

Now that I think of it: it seems that this would be a nice nginx plugin, with
a simple client side library to handle it for reference. Anyone want to
collaborate on this over the weekend? Should be relatively straight-forward.

~~~
masklinn
> A common solution to this problem, is to make a 2 stage process, where step
> 1 is a request of "should I download?", where there are 2 possible replies:
> "no, check again in N time" and "yes, here is a token". Step 2 is then
> presenting the token to the api point for download, and getting the file.

You don't even need two steps, just have one step with previously known data.
That's how HTTP conditional requests (Last-Modified/If-Modified-Since and
ETag/If-None-Match) work: the client states "I want this file, I already have
one from such moment with such metadata", and the server replies either
"you're good" (304) or "here's your file (200).

Issue is, that only works when the file changes rarely enough, or you need
additional server logic to reply that the file is still good when it's not.

> Now that I think of it: it seems that this would be a nice nginx plugin,
> with a simple client side library to handle it for reference. Anyone want to
> collaborate on this over the weekend?

I'd be _very_ surprised if nginx didn't support conditional requests already.

edit: according to [0] and [1] — which may be outdated — Nginx provides built-
in support for last-modified on static files, it does not provide ETag support
(the developer believes this is not useful for static files — which is usually
correct[2]) but [1] has apparently written a module to do so [3]. The module
being 4 years old, it might be way out of date.

[0] [http://serverfault.com/questions/211637/what-headers-to-
add-...](http://serverfault.com/questions/211637/what-headers-to-add-for-most-
efficient-file-caching)

[1] [https://mikewest.org/2008/11/generating-etags-for-static-
con...](https://mikewest.org/2008/11/generating-etags-for-static-content-
using-nginx)

[2] There are two situations in which it is not (keep in mind that this is for
_static_ content, dynamic is very different): if somebody willfully touches a
file, it will change its Last-Modified but not its checksum, triggering a new
send without ETag but not with it; and ETags can be coherent across servers
(even in CDNs), the chances of last-modified being exactly the same on all
your servers is far smaller.

On the other hand, no etag is better than a shitty etag, and both Apache and
IIS generate dreadful etags — which may hinder more than help — by default.

[3] <https://github.com/mikewest/nginx-static-etags/>

~~~
sophacles
Yes, this work for cache updating, and it is fantastic for that purpose. It
does not solve the actual stated problem, which is that periodic checks in an
attempt to smooth server loading away from peaks don't usually drift towards
extremely bursty behavior. When the file does change, you still get a large
number of clients trying to download the new content all at once. The solution
I was suggesting is similar to what you are talking about, but also has the
feature of smoothing the load curves.

 _Issue is, that only works when the file changes rarely enough, or you need
additional server logic to reply that the file is still good when it's not._

My algorithm is that logic -- albeit implemented with client side collusion
rather than pure server side trickery (this allows better control should the
client ignore the etags).

~~~
masklinn
> The solution I was suggesting is similar to what you are talking about, but
> also has the feature of smoothing the load curves.

It has no more feature of smoothing the load curve than using Cache-Control
with the right max-age.

> My algorithm is that logic

It is no more that logic than doing what I outlined with proprietary
behaviors.

> this allows better control should the client ignore the etags

by making the whole client use a custom communication channel? I'd expect
ensuring the client correctly speaks HTTP would be easier than implementing a
custom client from scratch.

~~~
sophacles
You still seem to be missing the point. Cache-Control as implemented commonly,
and by your description, will instantly serve every request the new file as
soon as a new file is available. It takes into account exactly one variable:
file age.

The algorithm I describe takes into account variables which affect current
system loading, and returns a "no, try again later", even when the file is
actually different, because the server is trying to conserve some resource
(usually in such cases it is bandwidth). Like I said, this can be done with
etags, but a more explicit form of control is nicer. Which brings us to this:

 _> this allows better control should the client ignore the etags

by making the whole client use a custom communication channel? I'd expect
ensuring the client correctly speaks HTTP would be easier than implementing a
custom client from scratch._

A client speaking proper http would be perfect for this. So point your http
client to:

domain.com/getlatest

if there is a token available, respond with a:

307 domain.com/reallatest?token=foo

If no token is available and no if-modified headers are sent, reply with:

503 + Retry-After N

if there is not a token available, and the requestor supplied approrpiate if
modified headers respond with a:

304 + cache control for some scheduled time in the future (which the client
can ignore or not)

Of course that last condition is strictly optional and not really required,
since then it would be abusing cache control, rather than the using 503 as
intended.

(also note, a request to domain.com/reallatest with an invalid token or no
token could result in a 302 to /getlatest or a 403, or some other form of
denial, depending on the specifics of the application).

edit: Strictly speaking, the multiple url scheme above isn't even needed, just
a smart responder associated with the 503 is needed, however the url redirect
method above was there because there may be a larger application context
around system, in which getlatest does more than just serve the file, or in
which multiple urls would redirect to reallatest, both easily imaginable
situations.

~~~
masklinn
> If no token is available and no if-modified headers are sent, reply with:

> 503 + Retry-After N

That's cool. There's still no reason for the second url and the 307, and
you're still getting hit with requests so you're not avoiding the request
load, only the download. You're smoothing out bandwidth, but not CPU &
sockets.

~~~
sophacles
This is sort of true. I don't know of a way to simply limit the number of
incoming sockets without getting a lot of ISP level involvement or just
outright rejecting connections. It does limit the number of long-lived sockets
for file transfer. On static file serves, I am assuming the cpu has plenty of
spare capacity for doing the algorithm, so I am not worried about that.
Finally I am assuming the limiting factor is bandwidth here, so bandwidth
smoothing the main goal.

------
moe
I assume changes are usually small, you may want to try serving diffs?

I.e. have the clients poll for the md5 of their _current_ list-version.

On the server store the diff that will upgrade them to the current version
under that filename. If a client requests an unknown md5 (e.g. because he has
no list or his list is corrupted) default him to a patch that contains the
full file.

This requires a little logic on both ends (diff/patch), but would probably
slash your bandwidth requirements to a fraction.

A little napkin math:

25 lists * 150kb * 1mio fetches = ~3.75T

vs

25 lists * 1kb (patch) * 1mio fetches = 25G (0.025T)

~~~
pjscott
This is probably the Right Way, but it would be more work than minor tweaks to
the delay logic.

------
K2h
call me oldschool, but having a huge peak demand is the perfect application
for distributed source, like torrent. I know it is much more complicated to
introduce P2P and way more risky if it gets poisoned, but it seems to me this
underlying problem of huge peak demand was solved 10 years ago.

~~~
nitrox
but there is a problem with bittorrent. Most Schools and works places block
bittorrent. We would need to fallback to http or any other method that works
in restricted places.

~~~
skeletonjelly
I wonder if there's a market for Bittorrent over HTTP? Node.js,
websockets...surely it's possible?

~~~
icebraining
All of those are strictly client-to-server, not P2P. You could in theory proxy
bittorrent over it, but you wouldn't gain anything over just serving the file
from the server.

You can probably write a true P2P client as a Firefox extension, since its API
gives you very low level access (raw sockets, for example), but certainly not
for e.g. Chrome.

~~~
AntiRush
WebRTC[1] seems to be the perfect platform for these sorts of things. It's in
Chrome dev channel / Firefox Alpha right now.

[1] <http://www.webrtc.org/>

------
fleitz
I love random numbers for distribution. I had a similar problem with a set of
distributed clients that needed to download email, but only one client
downloading at a time. The email servers also had an issue where a large
number of emails in the inbox would cause the server to slow down
exponentially. (eg. it didn't matter how many MB of email were in the inbox
but it did matter if there were more than about 1000 emails)

The downloaders would download the list of inboxes to be fetched, randomize
them and then lock the inbox when they started downloading, then the
downloader would randomly pick a size cutoff for the max email size it would
download, 10K, 1 MB, unlimited with a n inversely proportional maximum email
count so that about 100MB could be downloaded at anytime.

We even had an issue with one server behind an old cisco router that barf'd on
window scaling, so a few machines in the pool had window scaling disabled and
that account would naturally migrate to those servers with window scaling
disabled.

It worked wonders for distributing the load and keeping the Inbox counts to a
minimum.

------
fromhet
I know it's overkill for a browser extension, but wouldnt this be easily
solved by having built-in bittorrent for updates?

The publisher would always be seeding the latest version, and the clients
would connect maybe every other day. It would lower the preassure on the
publishers servers and make sure everyone could always have the latest
version.

With theese fancy magnet links, the publisher would only have to send the
magnet and the actual file a couple of times, and then the peer to peer swarm
would do the rest.

------
kogir
I would just sign it, stick it on S3, and forget it. Did I miss why that
wasn't considered?

~~~
nitrox
It is too expensive. 1TB of bandwidth costs about $120. A project like adblock
plus will be consuming about 3 - 4 TB a month which will add up to around $450
a month.

Adblock list subscriptions are maintained and hosted by individual people who
do at their spare time. They mostly pay for the servers out of their pockets.
As one of the co-author of popular adblock list, I wouldn't want to break my
bank to pay for S3 hosting. Our current solutions works out and when we reach
our bandwidth limit, we could just simply buy addition TB of bandwidth at a
much cheaper price than S3.

Btw, i just made a rough calculation using AWS simple monthly calculator. So
correct me if I am wrong about S3 pricing.

~~~
tedunangst
Terabytes per month? That's insane. That's a million users (I can believe)
downloading a megabyte (I can't quite believe). It appears my patterns.ini
file is 600K, or about 150K compressed, so if I download it 30/5 = 6 times a
month, that's... a megabyte. Wow.

~~~
tripzilch
Wow, that suggestion elsewhere in the thread, to serve diffs instead seems
rather important now :)

------
antihero
Why not assign people a day and time, and then if they regularly miss that
time, assign them a different one?

------
tantalor
> with the effect that people always download on the same weekday

What's so bad about that?

~~~
rmc
Server load goes really high on that day, and if you get more popular, you'll
need more servers and hence more money.

~~~
rb2k_
Isn't that something that nginx/varnish should easily be able to handle? It is
just a static file download after all...

~~~
ComputerGuru
CPU and bandwidth are entirely different issues. Sure, nginx can handle the
processing. But do you have the piping to match?

A run of the mill dedicated server has a 100mbit uplink. Do the math. (Hint:
it's easy to saturate in no time).

~~~
prostoalex
Has anybody tried <https://developers.google.com/speed/pagespeed/service> for
this?

~~~
oconnor0
This is just downloading a single static text file so there's nothing to
optimize.

