BeGoneAds – A Python script that blocks ads by installing common hosts files

AdmiralAsshat · on June 10, 2019

Python scripts to modify system files make me a little skittish, even with source code available. I think I would just as soon grab the hosts file from https://www.someonewhocares.org/ and drop it in myself.

digitalni · on June 11, 2019

That hosts file is depressing. Blocking trackers by a1.tracker.name, a2.tracker. name etc. Today it seems easier to have a whitelist rather than blacklist...

Redoubts · on June 10, 2019

I also don't understand systemd integration in the todo.

anned20 · on June 10, 2019

The goal here is to make it automatically update the hosts file once in a while

stuntkite · on June 11, 2019

Doesn’t automation of editing your hosts file potentially open people up to fairly insidious man in the middle attacks?

gumby · on June 10, 2019

Isn’t that what cron is for?

josteink · on June 10, 2019

Systemd has a cron system too.

I’m just guessing, but it might be easier to programmatically install a systemd cron job (and making sure it runs) than doing so for the old/conventional crond?

gumby · on June 10, 2019

Just using cron is back compatible

Spivak · on June 10, 2019

Things you get with systemd that you don't get with cron unless you implement them in the script.

* Not running if the network is down.

* Not running if the download path isn't available.

* Running if the machine was off during the scheduled time.

* Monitoring and retry logic.

* Logging to syslog.

* Resource constraints.

* Random wait.

It ends up being a lot of code factored out of the actual application.

behringer · on June 10, 2019

so instead of code in one place you have it in 2?

JustSomeNobody · on June 11, 2019

I don’t understand this logic. Can you explain why having code in more than one place is a bad thing?

behringer · on June 11, 2019

For a simple script that does one thing, it's overkill.

josteink · on June 11, 2019

Yes. So you keep that simple script simple, and you let systemd do the heavy lifting.

behringer · on June 11, 2019

But systemd doesn't do any of the above listed things automatically... You'd need to write an entire script for sytstemd to take advantage of it. An entire script you could just have run with cron on boot.

naikrovek · on June 10, 2019

He probably wants to learn more about systemd and needs/wants a project to do it.

Yes, obviously cron would work. If systemd can do what this dev needs, what's the harm?

anned20 · on June 10, 2019

You're right, but I like the extensibilty of Systemd. Things like running every week or whenever the next start is. Or only if the network is online.

dvfjsdhgfv · on June 10, 2019

If I considered constructing a botnet of Linux workstations, your software would be an excellent candidate.

farisjarrah · on June 10, 2019

You could say the same thing about Ansible, or Chef or Puppet or any of the other millions of systems automation tools available. There are lots of ways to misuse software, I don't think that that should preclude people from writing it.

dvfjsdhgfv · on June 10, 2019

I'm quite precise about about what these tools can and cannot do on my systems. Downloading random files form the Internet isn't on the list. When you hand over the control of the hosts file to someone else, you're basically transfer control of your DNS queries.

michaelmior · on June 10, 2019

At least for HTTP, you're hopefully using TLS for anything important and failing if the certificate isn't valid. That certainly won't remove all the risks of losing control of your DNS, but it's one good safeguard.

IggleSniggle · on June 11, 2019

Some of my favorite hacks in the last year have been about using valid certs for bad actions. When you can have a cert from Microsoft (Azure), there’s a lot of things people will trust.

Along similar lines, I think I heard that 30% of detected malware was signed with a “trusted” authority last year.

joyjoyjoy · on June 10, 2019

joyjoyjoy · on June 11, 2019

I don't understand the downvote.

1. What is so bad about python in specific?

2. If you worry about root privileges, required for modifying the host file, you can use app armor to put the thing on a leash

LASR · on June 10, 2019

Why not?

Modifying system hosts configuration requires privileged file system access.

The mindset here should be default deny.

sh-run · on June 10, 2019

Yep, finding and modifying a script that runs with root privileges, but is writable by non-root users is the oldest privesc trick in the book.

With the proper permissions something like this should be ok, but I'd tread lightly. Especially with something that dynamically updates your hosts file.

anned20 · on June 10, 2019

It won't be able to run if the user that is running it doesn't have the proper privileges. You could even protect the files by giving them other permissions so only the root user can use them.

sh-run · on June 10, 2019

Above you mentioned setting this up as a scheduled job. In this case the job would need to run as root (or you'd need to assign the appropriate permissions in the sudoers file, but people are lazy). If a non-root user had write privileges to the file, they could modify the script and thereby gain root code execution.

Naturally it's on the user to properly configure the permissions.

I'm not saying this isn't a worthy project, I'm just adding to the discussion on why people should be cautious when running scripts with root permissions.

daverobbins1 · on June 10, 2019

How is this different or better than Steven Black's project?

Repo: https://github.com/StevenBlack/hosts

anned20 · on June 10, 2019

To be completely fair with you, I didn't know this was out there. Then again, I think my solution is more elegant, especially if all the todo's are finished.

tempsolution · on June 10, 2019

Patching a critical system file with python on a regular basis... What could go wrong?

notatoad · on June 10, 2019

I don't understand the aversion to python here. What is it about python that makes it less reliable than any other piece of software?

Yes, downloading hosts files from 3rd-party sites is kind of sketchy. But using python to do it is what you're worried about?

anned20 · on June 10, 2019

I mean, if you don't trust the sources, you can use your own. Or if you don't trust the software, don't use it. It's up to the end user to decide what is in their best interest. I believe the included lists are trustworthy and I use them myself.

DyslexicAtheist · on June 10, 2019

my first question too. Steve Black's updateHostesFile.py is extensible and can be automated and is well trusted and gets tested by a large community. I don't want to sound critical but would like to understand the value add in comparison. Both are in python as well so I don't get it

fuzz4lyfe · on June 10, 2019

I assume you wrote a hello world once or twice, what was the point of that?

daverobbins1 · on June 10, 2019

Who posts their hello world on HN? My question was a valid one given the circumstances.

pvg · on June 10, 2019

People posts simple projects, learning projects, projects that are inadvertently a dupe of some other project, projects that are deliberately patterned after existing projects, etc, etc, etc all the time.

DyslexicAtheist · on June 10, 2019

sure I did but I didn't bother with a "Show HN", also this is well beyond the scope of a Hello World. It just made me wonder if OP hasn't studied what was already out there and if he did why no explanation or credit or mention of Steven's work ...

even if he disagreed with Steven solution and chose to reimplement, it would have been interesting to understand what the motivation was. I'm not saying he shouldn't, but just that it would be nice to know what motivated his design and why he thinks it's better to redo ...

inlined · on June 10, 2019

Since accepting Host files from someone on the internet can be dangerous I dug into the code:

The list of hosts to exclude comes from several sites here: https://github.com/anned20/begoneads/blob/2c90fcee221edf71f8...

The actual application of the hosts file is here: https://github.com/anned20/begoneads/blob/2c90fcee221edf71f8...

I missed something though. Is a simple domain name per line enough to send that content to /dev/null? I haven’t used that form in /etc/hosts.

My primary concern was that this technique could be used to send ad traffic to a site that returns 404 but gathers metrics on the web regardless.

js2 · on June 10, 2019

I think you may be misreading the code. It concatenates the host files at the various URLs and then inserts their contents into /etc/hosts. I only looked at a couple of those files but the ones I did used either 0.0.0.0 or 127.0.0.1 as an address combined with a domain.

To be honest, this code is over-engineered. It could be a single script with a handful of functions. At the same time, it’s missing functionality such as deduplicating entries from the different lists.

anned20 · on June 10, 2019

Those are things that are still on a small personal todo list. I agree that it could be much simpler, but I like the elegance of it.

anned20 · on June 10, 2019

This is actually sent to the IP address 0.0.0.0, it roughly means that all the traffic of the listed hosts is routed back to localhost

unfunco · on June 10, 2019

In this context, it actually means a "non-routable meta-address used to designate an invalid, unknown, or non-applicable target" [1] - 127.0.0.1 is localhost, 0.0.0.0 is its own thing.

[1]: https://www.howtogeek.com/225487/what-is-the-difference-betw...

anned20 · on June 10, 2019

Yes, you're completely right. Mixed those 2 up.

tx-0 · on June 10, 2019

"if a 0.0.0.0 packet falls in a forest, will it make a sound a 0.0.0.0 listener can hear?"

your link was to an informative but still sloppily written article and in this context, your summary of the article isn't clarifying.

to write clearly, people gotta stop throwing around the word localhost because at the level of n.n.n.n there are no names, only numerical addresses and localhost is a name, one defined in a text file: 127.0.0.1 points to, not localhost, but to the local host, always; localhost (the name) points to 127.0.0.1 iff it is defined to (which should be all the time).

what I learned from the article is that a local host server listening on 0.0.0.0 will listen to everthing it can hear. But the question context here is, where will a packet sent to 0.0.0.0 go?

The point that 0.0.0.0 is not routable does not answer the question because 127.0.0.1 is also not routable; however, 127.0.0.1 will arrive someplace, at the local host. The question is whether 0.0.0.0 will also arrive into "the pool o' packets", that place that packets arrive on the local host prior to their disposition being determined (a. routed out of the local tcp/ip pool o' packets, b. listened to within the local tcp/ip pool o' packets, or c. dropped on the floor) because routing isn't only what Routers do, it's what tcp/ip does within a local host. (and btw, the article also describes that 0.0.0.0 means the default routing of last resort address in the context of a route address, also not the same as a packet destination address)

lelf · on June 10, 2019

0.0.0.0 ≠ localhost

inlined · on June 10, 2019

Indeed. Traffic served from 0.0.0.0 can be seen from other machines. Traffic on 127.0.0.1 or localhost cannot. Important to know when you’re doing local development vs a local demo.

sp332 · on June 10, 2019

IP addresses with 0 as the first octet are invalid and hopefully will not be routed. I prefer them for hosts files over localhost because localhost has to wait and time out, but 0.0.0.0 will fail right away.

inlined · on June 12, 2019

Not sure why Ive been downvoted. I’ve used this all the time for demoing projects. It’s like asking for port 0. You don’t get the actual 0.0.0.0 IP address, but you use a physical network instead of the loopback virtual network device. Heck; I’ve had to change this in web hosting projects as a security vs feature trade off.

sherincall · on June 10, 2019

I get that this is just someone's side project, I'm glad it exists and they're free to write it in their favorite language/environment and all; but the effort to actually run this is equivalent to actually copying the hosts files manually, and I already have all the dependencies installed. I could never get my non-techy parents to run this properly.

If the goal of the project is actual adoption, a native executable without external dependencies would have been a much better option.

anned20 · on June 10, 2019

This is a todo, It's already on PyPI and I'm working on getting it packaged for all the main distros of Linux/Windows and MacOS.

sherincall · on June 10, 2019

I saw the todo item, but didn't realize it also included providing an executable for Windows and macOS. Thanks!

barbecue_sauce · on June 10, 2019

Anybody have a sense of the performance overhead of using hosts files versus a detached hardware solution like a pihole?

NikolaNovak · on June 10, 2019

My understanding is that difference is in scope, not performance.

Hosts files will only affect the host (workstation/desktop/laptop etc) they're installed on.

Things like piHole try to make it easy to apply the solution to all members of your network - which even in household cases these days can number in dozens, making it impractical to manage hosts files for all of them (This includes items like phones which are typically unfeasible to mess with hosts file).

ImprovedSilence · on June 10, 2019

Pie hole also has a nice browser interface to debug blocked requests that are breaking the site you don’t want To be broken. Which happens inevitably when you pull together 10 different sources of blocked lists.... or just one persons whose ideal blacklist doesn’t match yours.

ycombonator · on June 10, 2019

It would be nice to see a performance hit based on the number of hosts entries.

dredmorbius · on June 10, 2019

About 3ms for 68k entries: https://news.ycombinator.com/item?id=20148457

dredmorbius · on June 10, 2019

Not much.

62,448 line (63,370 actual '0.0.0.0' entries) /etc/hosts file, 100x resolving 'www.google.com', Debian GNU/Linux, Thinkpad with spinning rust.

The short version has 32 lines, with 14 active entries, mostly defaults and local systems.

Short hosts:

    $ for i in {1..100}; do time host www.google.com; done 2>&1| grep real |  sed 's/^real[       ]*//; s/0m//; s/s$//' | mean
    n: 100, sum: 2.209, min: 0.015, max: 0.052, mean: 0.022090, median: 0.02, sd: 0.007450
    %-ile:  5: 0.016, 10: 0.016, 15: 0.016, 20: 0.016, 
    25: 0.0165, 30: 0.02, 35: 0.02, 40: 0.02, 45: 0.02, 
    55: 0.02, 60: 0.02, 65: 0.02, 70: 0.021, 75: 0.022, 
    80: 0.0245, 85: 0.029, 90: 0.033, 95: 0.0385

Big hosts:

    $ for i in {1..100}; do time host www.google.com; done 2>&1| grep real |  sed 's/^real[       ]*//; s/0m//; s/s$//' | mean
    n: 100, sum: 2.517, min: 0.016, max: 0.063, mean: 0.025170, median: 0.023, sd: 0.009818
    %-ile:  5: 0.016, 10: 0.016, 15: 0.016, 20: 0.016, 
    25: 0.017, 30: 0.0185, 35: 0.02, 40: 0.021, 45: 0.022, 
    55: 0.024, 60: 0.0255, 65: 0.0265, 70: 0.028, 75: 0.029, 
    80: 0.03, 85: 0.0325, 90: 0.0395, 95: 0.042

The delta of means is .003080s -- call it 3ms slower for the large hosts file.

("mean" is an awk script for computing univariate moments.)

As others have mentioned, the main benefit of a centralised LAN service is that all devices on the LAN are protected. The hosts file on this system (a laptop) is effective regardless of where I am. It also pre-dates my configuring OpenWRT's adblock package about a month ago, though I'd had a hand-rolled DNSMasq configuration earlier. The laptop hosts file is almost certainly a few years out of date -- another occupational hazard of such things.

The OpenWRT solution runs on the Knot Resolver (kresd) caching nameserver. I've not noted any lag for it. The blocklist there is currently 231,627 hosts/domains (roughly doubled: specific + wildcard matches), from 0-29.com to zzzpooeaz-france.com.

memco · on June 10, 2019

I used one of the popular hosts files on my local machine for a while: the networking didn’t seem to suffer, but the boot time for my machine slowed noticeably. And manual updates were painful because loading the file in an editor is slow so if you use your hosts file for other reasons it can inhibit your workflow. I would recommend automated process on some dedicated device so you don’t impact your normal usage.

Another experience I had was that certain sites failed to work correctly. I didn’t do extensive testing but when I disabled the hosts nocking the sites worked, when I enabled it they broke. These were companies with whom I was trying to do account related business: so it wasn’t just that something didn’t render correctly it actively prevented me from updating my accounts when I tried to submit requests.

I still like the approach and will continue to use it, but it hasn’t been frictionless.

unethical_ban · on June 10, 2019

When the network goes down, it can take several minutes for it to come back up. I was having DNS and connectivity issues at a LAN party. Wouldn't get connectivity for minutes after a link bounce.

Then I removed the hosts file, and it worked instantly.

Maybe for a static workstation it wouldn't be bad, but for a laptop or something that loses link frequently, it could be an issue.

LyndsySimon · on June 10, 2019

I don't have any evidence (I've not attempted to benchmark it or anything), but my gut says that the stack is checking the hosts file first anyhow, so it shouldn't be much. It might actually be an improvement over a separate appliance.

gregw2 · on June 11, 2019

I have cron jobs on my mac that update my hosts files (to block "addictive" sites in my case (not ads)). It doesn't really work.

Browsers cache and use outside DNS servers despite the hosts files. Chrome and sometimes Safari don't really honor the hosts files 100% of the time. Every once in a while I google around to try and restore my control, try to tweak my browser settings but I have yet to find anything that makes using hosts files bulletproof.

mywittyname · on June 11, 2019

I think firewall rules would be your next line of defense. I'm not sure how configurable most home routers are though.

rafaelvasco · on June 10, 2019

Reading the code one clearly sees why Python is so well suited for these kinds of applications, one-shot script executables: Really nice string ops, regex, file io etc. One of my favorite languages. The other is C# for everything else, that Python is not that suitable for: Huge complex codebases, type safeness, more strict performance requirements etc. Specially the static typing. The dynamism and lack of type annotations of Python really bothered me when I was developing a somewhat complex desktop app in it some years ago. I guess I'm a static typing guy with optional dynamism kinda person.

misterdoubt · on June 11, 2019

If you haven't checked back lately, type annotations in Python are getting better and better. Built-in support via the typing module and a strong community package in mypy.

bigend · on June 11, 2019

If you let someone else manage the hosts your computer resolves, you are trusting that someone as much as your ISP. A man in the middle.

ris · on June 10, 2019

> You ran WHAT script on your machine?!

dataflow · on June 11, 2019

Hosts files slow down the system as well as the browser itself. Get/create a browser extension to actually block the request (at least while your browser supports this) so you get immediate results.

firefoxd · on June 10, 2019

I wish the hosts file could have an include directive. Since I regularly add or remove entries, the file becomes a mess.

zactato · on June 10, 2019

Serious question. We all realize that the economics of the internet is largely fueled by ads, so why are we so keen to block them? It’s ad revenue that have allowed technology to flourish so strongly over the last two decades.

dsswh · on June 10, 2019

Not long ago, the economy was largely fueled by slavery. Yet we got rid of that.

layoutIfNeeded · on June 10, 2019

Citation needed

briandear · on June 10, 2019

Not exactly Godwin’s law, but pretty close. Comparing advertising and marketing to the ownership of human beings? Slavery infringed on the inalienable of human beings, the existence of advertising doesn’t take away my freedom or potentially subject me to beatings.

It’s a ridiculous comparison. I am not a friend to intrusive ad-tech, but making a moral equivalence to slavery is to trivialize slavery. It’s like comparing parking tickets to the death penalty.

dsswh · on June 11, 2019

It's a valid comparison. Long ago it was ok to kill your enemy. Not long ago it was ok to have slaves. Today either is a sure way to end up in prison. Standards are rising. IT is a very new thing and the society and the laws are behind a bit. Adtech uses this to extract profit while it can. But this will end. Soon it will be a crime to store personal data: names, location, anything like that. GDPR is just the beginning. Adtech will fight, but it will lose. This business will disappear entirely, just like slave labor. In far future it will be a crime to be intrusive: any unwanted ads; and mining personal data will be seen like cannibalism today, i.e. even criminals will consider such people as freaks. Right now we are in the era of wild west in IT.

jancsika · on June 10, 2019

> Comparing advertising and marketing to the ownership of human beings?

That's not exactly what happened here.

OP repeated the beginning of a truism: "ads fund the development of the web while at the same time causing a whole host of severe problems for its users (individually and as a whole)."

OP left a hole where the italized part of the truism should be.

OP asked HN to fill that hole.

On a side devoted to tech/software, it's either low effort or bad-faith to ask others to fill a hole in such a well-known truism.

In light of this I offer up a countervailing law, "Loki's Law:"

"If you leave a Hitler-sized hole in your argument, expect it to be filled accordingly."

aagd · on June 10, 2019

I'd say it's not so much about the ads, more about the way they're delivered with all those shady tricks for tracking and fingerprinting. IMHO ad blocking is more about privacy than anything else...

ycombonator · on June 10, 2019

Ads have gotten to a point where they hamper user experience. My problem with most of the ads is the amount of js tracking junk and 3rd party A/B calls that grinds the browser to a halt.

cm2012 · on June 10, 2019

Do you mind FB ads which are pretty smooth experiences?

ycombonator · on June 10, 2019

I rarely use FB. However their Ads are lightweight and pleasant to interact with but I don’t think they have an ad serving platform for publishers on the web like Google and others do. If they do I am not aware of it.

cm2012 · on June 10, 2019

They don't

jormungand · on June 10, 2019

You are partially right. There are ads that are truly useful and those which are malicious, such as popunders which often include scammy pops. I use adblock to prevent the latter. There are popunder ad networks which are trying to fight against adblock by introducing "solutions" like anti-adblock, see here: https://propellerads.com/blog/anti-adblock-3-monetize-99-per...

I side with the adblock solutions in this war.

pinguinFromY · on June 10, 2019

I would not mind contextual ads that don't get in the way of viewing the actual content a website or mobile app is offering. But no, we get huge banners that cover the whole background, pop-ups, every click opens a new tab redirecting to an ad and "hot chicks in your area". If I'm seeing an intersting ad I'll click on it by myself, don't need your help really. Conclusion: it's not the ads themselves but how they get in your way. See reddit ads, google ads. They are part of the actual content.

geggam · on June 10, 2019

Internet was a better place when the economics of the internet was driven by porn.

</2 cents>

mtgx · on June 10, 2019

Whether or not ads on the internet are good is not that clear cut. But I would say 99% of tracking is both evil and mostly useless, probably with the 1% being first-party analytics to track some very simple stuff like page views.

I'm glad Firefox is now blocking third-party trackers by default (not that I needed it for myself, but it's important for others to have this).

joyjoyjoy · on June 10, 2019

I use host flash: http://host-flash.com/

Does anyone know an up-to date list for blocking social networks?

DyslexicAtheist · on June 10, 2019

Steve Black's hosts file ... just specify "-e social" or "--extension social" option, or use a "myhosts" file to name your own domains for a subset (e.g. all of facebook or whatever)

https://github.com/StevenBlack/hosts

_bz2r · on June 12, 2019

how does the list compare to the pi hole hostfile?

jakeogh · on June 10, 2019

Nice to see projects using click!

Here's another one to toss on the pile (works, I use it, supports wildcards, *nix only): https://github.com/jakeogh/dnsgate

hlau · on June 10, 2019

I'll be the first to admit that the existing advertising ecosystem is broken, primarily due to misaligned incentives across the board. But, given a choice, would you rather have a clearly labeled thing that you know is an ad transparently trying to influence you or a sneaky human billboard, err "influencer" coming up to you with an agenda along with tons of product placement in whatever you watch/read/listen to?

harry8 · on June 10, 2019

There's no either/or decision to be made here. You get compromised, paid for content with our without ads as well. Critical thinking I'd a requirement always.

hlau · on June 11, 2019

There definitely is an either/or because blocking of one channel will naturally necessitate money/barter flowing to the other channel. One is at least transparent and regulated, the murky world of influence peddling isn't since it's hard for anyone to tell in the moment whether something is "organic" or not.

harry8 · on June 12, 2019

Not when the other channel is already at capacity. And it is. Blocking ads has no effect on that. You never agreed to being tracked either, so blocking that is the right and proper thing to do. Blocking surveillance capitalism might push businesses toward honesty, it's at least with a shot.

hlau · on June 12, 2019

If you think influencer marketing and product placement are already at capacity you have no idea how much worse it's about to get if ad blocking gets much worse. And the irony is that, by design, you won't know a good chunk of the time and other times it'll just merely be implied without being explicitly stated. Continued use of social networks, including this one, collects way more identifiable data than what the non-Google/FB/Amazon ad market collects. Ad blockers have had near 0 impact on FB's operations. Google and others have paid to ensure that their search ads still make it through most ad blockers.

Blocking ads does not drive businesses to be more "honest". They'll just spend more on PR and influencers. And given how hostile this community is to ads and perhaps even marketing overall, (how YC ever backed a marketing or ad startup is beyond me), companies already realize that getting a fawning TC article purchased thru connections and favors and PR chicanery is going to be more effective than ad campaign even though the ad campaign is more honest, upfront and transparent with its agenda.