Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Hackernews-Button – Bloom Filter-based, private HN browsing extension (github.com/jstrieb)
74 points by jstrieb 46 days ago | hide | past | favorite | 15 comments

Hello! This Firefox extension shows approximately how many points (if any) the current page scored on Hacker News using Bloom filters. Clicking the extension navigates to the best discussion for the current page. It is great for reading interesting comments about things you may not have known were submitted to HN in the past. I have been using it for the past two months while building it, and honestly it has been really fun to read insightful comments I otherwise never would have seen. It's also handy for figuring out whether to submit a page, since it shows whether it's been submitted before or is new territory.

Rather than determining if the current site has been submitted by querying the Firebase/Algolia APIs with every page you visit, the extension contains regularly-updating Bloom filters for all submitted HN stories to preserve user privacy. There is a single C library underlying both the code to generate Bloom filters, and the actual extension code (it is compiled to WebAssembly so the C functions can be called from JavaScript). I have included an architecture overview in the README to help a prospective reader get through the important parts of the code, and have tried to justify design decisions via extensive comments within the source.

I am happy to answer any questions! I'm also always looking for constructive criticism.

> Rather than determining if the current site has been submitted by querying the Firebase/Algolia APIs with every page you visit, the extension contains regularly-updating Bloom filters for all submitted HN stories to preserve user privacy.


I built a pi-hole esque stub dns-resolver that uses Bloom Filters generated from hostfiles (60 MiB, 5M entries --> 2 MiB with 1% false positives) and it worked like a charm. At some point, I also looked into Xor Filters which are apparently even lighter and faster but couldn't find a JavaScript implementation for it [0].

I; however, stopped using Bloom Filters because its immutability meant building it over and over again which was a pain. Inverted Bloom Filters [1] or Spectral Bloom Filters [2] might have been useful since they can be updated in-place. Instead, I went for storing hostnames in a Finite State Automata [3], which while not as compact as Bloom Filters, could be updated in-place, are deterministic, and search faster. Likely not a fit for your use-case however.

PinSketches, otoh, might be a fit for accomplishing efficient set reconciliation [4] of the filters you're distributing.

[0] https://github.com/FastFilter/xorfilter#implementations-of-x...

[1] https://www.youtube.com/watch?v=eIs9nJ-JFvA

[2] https://pncnmnp.github.io/blogs/spectral-bloom-filters.html

[3] http://stevehanov.ca/blog/?id=115

[4] https://github.com/sipa/minisketch

Thanks for the links! I am not sure how well Xor Filters would fit here because they're meant to be immutable, whereas the extension has a content script that adds any submissions from /new or the front page to the Bloom filter so you can navigate directly back to the discussion.

I had not heard of PinSketches before this, and will definitely look into them. As I mentioned in another comment: the way the project is set up, it should be possible to do a drop-in replacement with a better data structure, should one come along. Thanks!

I just installed it. I don't have much to add, but wanted to say: wow, this is awesome. I appreciated the How It Works section. I made a tiny open-source web app in a similar vein, though it just uses the naive approach (querying algolia on each page load).

Thanks! Extensions are powerful and dangerous, so I wanted to do everything possible to make it easy to read the code for this one and understand how it works.

I also really like using <details> tags so that interested users can dive into the documentation without everyone having to be hit with a massive README.

Just add a feature to submit a page when it isn't listed. I wouldn't be using it, but it make sense to have that too.

This is a good idea, thanks! The only feature I haven't added but eventually plan to is a right-click context menu with a bunch of additional actions. For example "search by page title" (as opposed to URL), "see all past submissions," "manually add a site to the Bloom filter," etc.

Your idea to add a "submit page" option would fit very nicely with this.

Nice, so it periodically retrieves a list of HN posts and queries that list locally, so you're not telling algolia any specific site details.

There are millions of posts on HN, how many submissions does it retrieve? Surely if you find an obscure site, it might not be in the local list.

Maybe you could just always query many urls together, with only 1 of them being the real url you want. That would make it hard to track too.

Also I made a similar extension, but it queries sources on every page load.

(extension) https://newsit.benwinding.com/

(source) https://github.com/benwinding/newsit

The extension retrieves a (few) Bloom filter(s). In case you're unfamiliar, Bloom filters are data structures approximately equivalent to condensed representations of all of the post URLs.

The code uses GitHub Actions to generate Bloom filters once every 24 hours, corresponding to the update frequency of the Hacker News BigQuery database that it pulls from. The 16MB Bloom filter for all stories ever has about 4M entries at the moment. As far as I know, this is every single story that has been previously submitted to HN.

Since Bloom filters are just long bit strings, you can OR them together to add stuff. This means you can download a smaller Bloom filter to update your current one, rather than downloading 16MB files all the time. The extension also auto-adds any stories linked on news.ycombinator.com pages you visit, so that stories too new to be in the Bloom filter are still added.

Thanks for clearing that up, yes I'm not that familiar with Bloom filters, seems like an interesting and useful concept. It could probably (pun intended) be applied to many applications to increase privacy.

I like the [1] Workflow file you've made, the comments really help with reading shell code. I'm also amazed you can query 4M entries everyday with BigQuery, I thought that might be fairly expensive to do right? Or is this below a free tier?

[1] https://github.com/jstrieb/hackernews-button/actions/runs/61...

Thank you for reading the code! The comments will be helpful for me when I have to fix something six months from now, but I'm glad to hear they're useful to others, too :)

Querying that many entries processes (according to the BigQuery web interface) less than 300MB each time. Not sure how they arrive at that figure, but whatever. BigQuery has a free tier that includes up to 1TB/month of data processing. Since the Bloom filter is compiled once each day, I'm well under that limit.


Have you compared this with using Merkle Trees? Would be curious how big the size of a merkle branch is in comparison to the bloom filter set.

I haven't investigated this (or considered Merkle Trees), but will follow up if/when I get a chance to test it.

I've tried to design the code so it shouldn't be too hard to do a drop-in replacement of Bloom filters with something else if a better data structure comes along.

Isn't this basically going to depend on the acceptable false positive rate?

great bloom filter use case, thanks a lot for building and sharing this!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact