
Show HN: Browser extension to read HN comments for any url, in ClojureScript - jdormit
https://github.com/jdormit/looped-in
======
jdormit
Well this is awkward. Seems that shortly after I posted this here both Mozilla
and Google rejected the extension from their stores. Mozilla's primary concern
was that the analytics were opt-out, not opt-in, so I'm patching that in and
resubmitting.

Typically, Google just said that it "did not comply with their policies" with
no additional information, but I'm guessing their concerns are along the same
lines.

Meanwhile, if anyone is interested in trying it out, the README has
instructions on running the extension locally in Firefox or Chrome.

~~~
staunch
It's not awkward to violate people's privacy, it's wrong. Maybe you didn't
know that before, but you still don't seem to get it.

~~~
bluejekyll
Person posts cool tool they wrote, that they find useful. Shows it to the
community, this community, which many may really appreciate. Then gets ripped
for not thinking through the complex privacy issues with said tool. Not
everything is malicious.

It would be better to give constructive criticism, like:

This is a really cool tool! I might even choose to use it, but I have a few
concerns: 1) it sends too much data unfiltered to the algolia search servers.
Could you instead make it a button that only then triggers the request or
opens another browser window? 2) the analytics are also a concern. In general,
you should always make these opt-in; better, ask the user through a dialogue
to enable them; even better, request feedback through some other system
(github for example).

Let’s try and be nice here and give people the benefit of the doubt.

~~~
staunch
I agree with the sentiment but not this example.

Sharing people's browsing data with third-parties can threaten their lives in
some circumstances. There's no excuse for a developer creating browser
extensions not to take this threat seriously in 2018. It's negligent.

~~~
bluejekyll
I did read your comment (and I didn’t downvote you b/c I agreed with you). The
reason I wrote this response was because your comment read as being very
aggressive in response to an honest attempt at explaining an issue the dev was
having. What your comment lacked was any additional information about better
approaches.

I get the concern and share it, but your comment only said that what they did
was wrong. It didn’t offer any ideas of how to fix it. Telling someone they’re
on the wrong road but not telling them how to get to the correct one leaves
them just as lost.

~~~
staunch
What the developer didn't seem to understand in this case was the gravity of
the issue. I don't think your explanation would have conveyed it as well.

~~~
jklinger410
Take care of yourself.

~~~
staunch
Thank you!

And let's all try not to get the users of our software imprisoned or killed:
[https://boingboing.net/2018/01/28/30000-accused.html](https://boingboing.net/2018/01/28/30000-accused.html)

------
nkurz
As other comments here point out, there are significant privacy issues with
sending every visited URL to a "trusted" server to check for comments. There
are also load issues for the server that is getting all the unwanted requests.
So if doing this, you'd probably want at least the initial lookup to be local.

So, can this be done with an initial download of some small number of
megabytes, then incremental updates of a few kilobytes as often as desired? I
think so.

Guessing at numbers (see other comments here), there are probably fewer than
1,000,000 URL's that have comments associated with them on HN. For each, you
store hash of the canonicalized URL and a comment count. Collisions aren't
deadly, so you could probably get down to 8B per hash (7B for hash, 1B for
count). Updates can be a list of all new and revised URL-hash-counts since a
given date.

Lookup the hash of the URL of each visited page with binary search on the 7B
prefix in the ~8MB ordered list of data. If found, report the number of
comments in something clickable that loads a sidebar. The only data that
leaves the machine is based on the active click to load the comments. Maybe
store a "false positive list" so that the rare collisions are only visible to
the user once. Maybe use a bit for "visited" so you can distinguish pages with
new comments?

The numbers seem surprisingly manageable.

~~~
jdormit
Excellent idea, and a nice summary of the discussion of this from further down
the thread.

I've opened an issue to implement this here:
[https://github.com/jdormit/looped-
in/issues/4](https://github.com/jdormit/looped-in/issues/4)

As time allows, I will implement this later this week.

------
chmod775
Will this instantly look up any URL I visit, or does it only look up HN posts
for the current URL when I click some icon?

I'd rather not have a browser extension that sends everything I visit to some
US servers - but I do like the idea behind this.

~~~
jdormit
The current implementation will look up any URL you visit so that it can
populate the "number of comments" text that appears over the button. The data
only goes to 1 server - hn.algolia.com, a HN search API provided by the
company Algolia.

~~~
jacquesm
At least change it so it sends a hash of the URL to a server that knows the
hashes of all the HN submitted posts. That way if there is no match you'd have
to have hashes of _all_ legal URLs to leak anything.

~~~
tzs
Doesn't that still leak a fair amount?

Suppose I'm an evil overlord and someone in my organization has been tipping
off the FBI about my upcoming operations. I'm pretty sure it is one of my
work-at-home minions.

If I can obtain the hashes of the URLs that my minions have visited, I can
look for a minion that has in their history the hash of
[https://tips.fbi.gov/](https://tips.fbi.gov/) and now I've got a good
suspect.

~~~
jacquesm
How will you tie the relevant minion to the hash?

You'd have to have a lot more than just that hash, the log would at least have
to include a static IP or something that you can isolate by window-of-
opportunity, for instance all the other minions were at a ballgame and the
timestamp indicates that that one minion that wasn't at the ballgame visited
tips.fbi.gov right then from an IP not associated with a stadium hotspot.

Regardless, if you're going to leak stuff on 'evil overlord's organization
you'd better make sure you don't do it from anything that can be associated
with you, so not your laptop, not your IP, not your browser and certainly not
with all kinds of weird plugins installed. Burner device and a location and
time chosen so it could be anybody in 'the organization' leaking.

------
Gys
This comment refers to a similar extension but was asked by HN to stop doing
it:
[https://news.ycombinator.com/item?id=15938700](https://news.ycombinator.com/item?id=15938700)

And there is also: [https://github.com/powerpak/hn-
sidebar](https://github.com/powerpak/hn-sidebar) which was in the Chrome Store
but not anymore...

~~~
jdormit
:fingerscrossed: this one doesn't end up in the same bucket.

Any idea why the earlier extension was shut down by YC?

------
tedchs
Why send each URL to a server to be checked, instead of doing a periodic
download of the (very small) list of HN links and comment counts, to be
checked offline like an ad blocker? It's probably 5kb compressed.

~~~
diggan
What makes you think it's a very small list? HN has been around since 2007,
with a lot of submissions. Any guesses on the size of that? I think it's
bigger than

~~~
tbirrell
Well... including comments it looks like we are pushing 16.3 million posts.
The id in the url is sequential. If you are saving url, HN id, and comment
count, that's probably no more than a couple megs, if even that.

~~~
Ajedi32
~391 MB if we store SHA-1 hashes of the URLs (160 bits each) and HN ids and
assume 16.3 million posts[1]. (Probably less, since, as you said, some posts
are just comments.) If we're okay submitting one out of every hundred URLs as
a SHA-1 hashed value to an external server, we can reduce that further to ~18
MB with a bloom filter[2].

[1]:
[https://www.google.com/search?q=(160+bits+%2B+32+bits)+*+16....](https://www.google.com/search?q=\(160+bits+%2B+32+bits\)+*+16.3+million&oq=\(160+bits+%2B+32+bits\)+*+16.3+million)

[2]:
[https://hur.st/bloomfilter?n=16300000&p=0.01](https://hur.st/bloomfilter?n=16300000&p=0.01)

~~~
tzs
I don't think we need either a strong hash or to submit hashed URLs to a
server. We can use an ordinary hash and the only URLs we need to submit to a
server are story requests to HN, if we do this thing like this:

Include in the extension a hash table construction as follows:

    
    
      foreach ID of an HN story submission
        URL = the URL of the submitted story
        URL = normalize(URL)
        insert_into_hash_table(URL, ID)
    

insert_into_hash_table(key, val) is a function that inserts val into a hash
table with key key. The hashing function does not need to be cryptographically
secure.

normalize(URL) is a function that takes a URL and normalizes it. What
normalize means in this context is a little fuzzy, but the basic idea is that
if URL_1 and URL_2 are different URLs to the same article, normalize(URL_1) ==
normalize(URL_2).

NOTE: what you include with the extension is the hash table itself.
Conceptually it is probably just a sparse array containing HN IDs, with maybe
a little more depending on how collisions are handled.

In the extension, do this:

    
    
      URL = normalize(URL_of_current_page)
      ID_list = lookup_URL_in_hash(URL)
      foreach ID in ID_list
        story = get_HN_story(ID)
        if (normalize(URL_of_story(story)) == URL
          show_story_comments(story)
    

The hash is only used for data retrieval from a local hash table, so does not
need to be cryptographically secure.

After the hash lookup we have a list of candidate stories on HN that might
match the browser story. It's a list because due to collisions there might be
more than one HN story with matching hash.

Note that all that is ever fetched from the server during operation of this
are HN stories, so there is minimal information leakage.

~~~
Ajedi32
Don't hash tables typically store the key itself in the table though? Wouldn't
that take up _more_ space per entry than a 160-bit hash? Using a sufficiently
collision-resistant hash function allows you to eliminate the need to store
URLs entirely, which in theory should reduce the size of the hash table
significantly.

~~~
tzs
Whether or not you need to store keys depends on how you handle collisions.

For instance, if we had a hash table whose keys were names, and whose values
were telephone numbers, we'd probably have to store keys with the phone
numbers so that in the case of a collision we could figure out which phone
number matches the search key.

If, on the other hand, we had a hash table that stored record numbers of
employee records from our employee database, keyed by employee name, then we
probably would not need to store keys with the hash table values. If there is
a collision, we can just retrieve all of the colliding records from the
database. Those records will contain the employee name, and we can use that to
figure out which is the right one.

For the HN comment extension we are closer to the second case. The HN story
contains the URL, so in the case of a collision we can fetch all the colliding
HN stories and see which one is the right one.

------
ComodoHacker
Are there any other websites/apps that are _known_ to respect DNT flag?

~~~
jdormit
I believe that medium.com respects DNT. I'm sure there are others, but I
haven't looked into it too much.

------
bfred_it
Does this have to be an extension with `<all_urls>` permission? Can it be a
bookmarklet that will just open the full-fledged HN in a new tab?

~~~
mistakevin
Here's a bookmarklet I use to launch a quick search for a page I'm looking at.

    
    
        javascript:(function()%7Bwindow.open('https%3A%2F%2Fhn.algolia.com%2F%3Fquery%3D' %2B (window.location.hostname %2B window.location.pathname %2B window.location.hash).split('%2F').join(' ').split('%23').join(' ')%2C '_blank')%7D)()

------
yread
There is also Kiwi

[https://chrome.google.com/webstore/detail/kiwi-
conversations...](https://chrome.google.com/webstore/detail/kiwi-
conversations/pkifhlefpamigmobjmjjjnjglpebflhp?hl=en)

Also "researches" Reddit, Product Hunt and Google News. And only on demand

------
latte
Thank you for posting this!

What does your development workflow for CLJS browser extensions look like? Do
you have to rebuild and reload the extension on your development machine each
time you change the code? Is it possible to use Figwheel when writing browser
extensions?

~~~
jdormit
I'm still working out the kinks in the workflow. I believe it is possible to
use Figwheel while writing browser extensions [0], but I haven't set that up.
I have `lein cljsbuild auto` running in the background to automatically
recompile the JS when I change the source code, and use Mozilla's web-ext
utility [1] to automatically reload the web extension once the JS has
compiled.

This system has some disadvantages, though. The main one is the glacial
feedback loop - due to strict CSP restrictions in web extensions, I had
trouble compiling the CLJS with {:optimizations :none}, so each save-compile-
reload cycle takes ~30 seconds from when I save the source file to when I see
the results of the change in the browser. I also did not figure out how to set
up a REPL environment connected to the code running in the web extension.

Lots of room for improvement here, basically.

[0]: [https://github.com/binaryage/chromex-sample#chromex-
sample-p...](https://github.com/binaryage/chromex-sample#chromex-sample-
project-has-following-configuration)

[1]: [https://developer.mozilla.org/en-US/Add-
ons/WebExtensions/Ge...](https://developer.mozilla.org/en-US/Add-
ons/WebExtensions/Getting_started_with_web-ext)

------
ungzd
Do you have REPL both for background and content scripts?

I tried to use clojurescript for Chrome extensions few years ago but failed to
configure fast reload cycle (either repl or figwheel).

~~~
jdormit
No. In fact, the development cycle on this was pretty awful - due to browser
extensions' strict CSP, I couldn't even compile the CLJS with {:optimizations
:none}, so every code change took ~20 seconds to recompile.

Do you know any good resources on REPL-driven ClojureScript development?
Preferably with a REPL that lives in the page environment so that browser
variables etc. can be accessed.

~~~
ungzd
Chromex-sample ([https://github.com/binaryage/chromex-
sample](https://github.com/binaryage/chromex-sample)) mentions :optimizations
:none and figwheel support (but with disabled repl), but I didn't try it yet.

~~~
jdormit
Interesting, looks like they got :optimizations :none to work for background
scripts but not content scripts (which makes sense). Figwheel support sounds
great, I'll look into adding that to Looped In.

------
tosh
Reminds me of hoodwink.d by why the lucky stiff. Anyone remembers it?

------
dustingetz
sick nice job Jeremy

