> 28 MB extra every month shouldn't even be noticeable
Parent comment suggests phishing site life-cycle <15hrs, at 400k a month that's 8333 every 15 hrs. To give an idea of how frequency sensitive this is, assume URLs are added equidistributed in time: that would be a new one every 154ms - for such time critical information it makes no sense to attempt to synchronize clients, it would require constant polling or push updates to have _any_ chance of catching a malicious URL.
At such a frequency, efficiency becomes less about bandwidth and more about the overhead of continuously synchronising so many clients (think of that 28 MiB spread out over 400k separate messages over one month, one every 154ms, that not only inflates the size, but causes a constant network usage and processing that is far less efficient than a single 28MiB download).
Or you could just send the URL hash when you visit a URL... (do you request any where near 8k URLs every 15hrs?, 1 URL every 154ms? no), it's so clearly a simpler solution that will be faster for everyone without letting bad URLs slip through before a latent sync.
There's no need to sync every time a new phishing URL is added - only every time a URL is visited by a client.
The delta can be derived just from the version number of the client's URL database, and should be a total of 1 MB in size for a whole day's worth of updates. So ~1 MB for the 1st URL visited in a day, and considerably less afterwards. Compared to average webpage size, that's nothing.
Really, only thing that changes is instead of sending a URL hash, you send the URL DB version, and the reply is the list of changes since that version.
> Really, only thing that changes is instead of sending a URL hash, you send the URL DB version, and the reply is the list of changes since that version.
Or none at all and a simple confirmation that the list is up to date. Yes this is a way better idea.
Although it's always going to be less efficient. For instance i'm not sure how it would scale into the future. Checking URLs server side is optimal, it's always going to be relatively constant in proportion to the URL size, but with DB deltas each URL is now related to both the URL size and the DB update frequency, i.e as the malicious URL rate increases over time, individual URL lookups will incur greater network cost... this is probably not a big deal for the client, but It would make a significant difference for the provider of the deltas - or maybe network caching would disolve it again? I mean there would be a lot of duplicate deltas flying around every minute... basically a content distribution problem but with a high frequency twist.
Do you really think Tencent is detecting a new phishing site every 154ms?
I'd seriously question how many of the total new phising sites they detect to start off with, and then how frequently they do so.
If a user only downloads the deltas periodically they'd risk being out of sync with the master list (which might not be updated even once a month or at all, for all we know), but that's the price they'd need to pay to have to send any information about their web browsing to parties they don't trust.
One other thing to consider is the likelihood that the URL you happen to be surfing is both a phishing URL to begin with and one of the ones that just appeared since the last delta download you did, compared to the likelihood that it's one of those already in the entire phishing URL database you've already downloaded. I'd expect those odds to be very low.
> the likelihood that the URL you happen to be surfing is both a phishing URL to begin with and one of the ones that just appeared since the last delta download you did, compared to the likelihood that it's one of those already in the entire phishing URL database you've already downloaded. I'd expect those odds to be very low.
Ignoring the first condition (otherwise why bother with a list at all)... Consider that this information is very transient (average 15hrs), this is pretty simple: deltaT / 54000
This is still horrible, because your safety is determined by how frequently you can sync with the DB.
> If a user only downloads the deltas periodically they'd risk being out of sync with the master list (which might not be updated even once a month or at all, for all we know).
Being updated monthly doesn't match the statistic of average 15hr lifecycle, because it would be useless after that length of time. And while I don't claim to know 15hrs as a fact, it is intuitive that the average will become ever shorter as malicious URL checkers become updated ever faster.
> but that's the price they'd need to pay to have to send any information about their web browsing to parties they don't trust.
Full URL information need not be sent, a hash of the URL domain and path would probably suffice... if that's not enough then it's a dilemma, but that doesn't make continuous syncing a good or fail safe replacement.
"Being updated monthly doesn't match the statistic of average 15hr lifecycle, because it would be useless after that length of time."
And maybe it is useless. We don't actually know, but we should at least recognize that there may be a difference between how frequently phishing sites allegedly appear and how frequently they appear in Tencent's malware URL database.
"This is still horrible, because your safety is determined by how frequently you can sync with the DB."
And being identified by the Chinese government as someone who surfs to forbidden websites might be even more horrible, for some.
Parent comment suggests phishing site life-cycle <15hrs, at 400k a month that's 8333 every 15 hrs. To give an idea of how frequency sensitive this is, assume URLs are added equidistributed in time: that would be a new one every 154ms - for such time critical information it makes no sense to attempt to synchronize clients, it would require constant polling or push updates to have _any_ chance of catching a malicious URL.
At such a frequency, efficiency becomes less about bandwidth and more about the overhead of continuously synchronising so many clients (think of that 28 MiB spread out over 400k separate messages over one month, one every 154ms, that not only inflates the size, but causes a constant network usage and processing that is far less efficient than a single 28MiB download).
Or you could just send the URL hash when you visit a URL... (do you request any where near 8k URLs every 15hrs?, 1 URL every 154ms? no), it's so clearly a simpler solution that will be faster for everyone without letting bad URLs slip through before a latent sync.