
Show HN: I built a tool to remove news articles from HN - polote
http://hn.luap.info/
======
polote
Hey guys,

Being a long time reader of HN I'm sometimes frustrated that top HN content is
often coming from popular websites or talk too much about the same thing (Zoom
security issues, Facebook leaks, ...) I wanted to understand if this was
possible to get only the 'original content' posts of HN (from the not-as-
popular blogs and sites). Because to me this is the most interesting part of
HN.

So I started analyzing HN new posts, and made a few discoveries :

\- Some HN users post a lot of content, and post several links in a row, which
will push away your post if they publish just after you

\- You get a 30 min time window at busy hours (and 1h time window at non busy
hours) between the time a new link is posted, and the link disappear from the
1st page of new links

\- There is a second-chance pool for good stuff if a moderator detects it

As a result HN is overflowed with not really useful content and it is not
always easy for original content to be noticed (Even if I think HN is doing a
very good job compared to any other link aggregator that you can find)

So I tried to built a tool to filter out things like news websites, words that
I want to blacklist, and users whose posts haven't been relevant to me.

That way I'm able to remove almost 80% of content and I can go through the
list of all the links of the day before going to bed

For those who are interested, this is just a cron job querying the HN API
every 3 minutes inserting the new links into a db, and a web server rendering
the last 500 links.

You can see more on how the filters works here:
[http://hn.luap.info/about](http://hn.luap.info/about) and you can also
understand which links have been filtered here:
[http://hn.luap.info/links_flagged](http://hn.luap.info/links_flagged)

~~~
optimuspaul
clearly we have different interests in what we like from HN. many of the sites
you filter are sites that I like prefer their content because they do offer
commentary that provides context and perspective that I don't always get from
what might be considered 'original content' by this. Interesting experiment
though, might provide more value to someone like me if rather than a black or
white list it was more of a content assessment model using NLP.

~~~
polote
Hello, clearly the goal is not to replace HN, I love also reading the comments
of people on global topics, but I also love reading original content

~~~
amelius
Perhaps what we really need is a personalized feed.

~~~
KhoomeiK
I'd honestly be surprised if something like this hasn't been built yet by the
HN community. A service where you can input your likes and interests (and
maybe a blacklist) to get a tailored feed of HN posts including/excluding the
right content. Maybe you could even feed in a few links to examples of content
you like and it would cater to you based on that.

~~~
oceanbreeze83
maagnit.com also similar to lobsters tags

------
tacon
I guess Hacker News loses its memory after a few years, but six years ago this
item[0] hit the front page about using machine learning to train a filter for
Hacker News. It is still running to this day[1]. I asked him if he could share
his training set or code, but nothing happened. I think the training set may
be showing its age, as there used to be more green items on the home page. Or
maybe the quality of Hacker News has just gone down. (Perish the thought!)

"Enough Machine Learning to Make Hacker News Readable Again"

[0]
[https://news.ycombinator.com/item?id=7712297](https://news.ycombinator.com/item?id=7712297)

[1] [http://hn.njl.us/](http://hn.njl.us/)

------
kylek
I used to have a greasemonkey script that would filter articles or domains
they come from based off of keywords.

Found it! (this was made by hn user furgooswft13)
[https://gist.github.com/m00g00/e539ec22bf588edca0e6dfe1a05eb...](https://gist.github.com/m00g00/e539ec22bf588edca0e6dfe1a05eb60c)

I think I used it for "737" and "Boeing" for a while.

------
alexozer
Thanks for this! I was quickly able to find some interesting new and niche
types of things to read, in contrast to what my normal strolls through
news.ycombinator and hckrnews give me.

------
pvg
It's an interesting experiment and reflection of your personal interests but
looking at

[http://hn.luap.info/links_flagged](http://hn.luap.info/links_flagged)

I can't tell the difference between most removed things and the things left
alone - either in terms of quality or thematically.

Having a page that tries to summarize the workings of the filter on the inputs
is pretty great though - more people who propose alternative rankings/filters
should think of ways to do that.

~~~
kick
There's a ban on *.org? Yikes.

~~~
pvg
The full list is at

[http://hn.luap.info/about](http://hn.luap.info/about)

It's definitely idiosyncratic, as one would expect. A more interesting
question is 'does it produce interesting results'. To my eyes and tastes, not
really. The filter easily misses piles of the sort of 'news' it is trying to
avoid and the quality of the rest of what passes doesn't appear to be any
better (to put it mildly) than the HN front page.

Mercilessly culling even slightly frequent submitters (this includes people
who, say, mis-posted something and then quickly made another post to correct
the problem) is a pretty fun idea though, I wonder what you'd end up if you
applied this iteratively over a long period of time.

~~~
polote
Hello, this is very good feedback, I don't try to compete with HN front page

My goal is more to 'compete with' /newest in the sense that I don't think it
is easily possible to get the best content publish directly out of an
algorithm. If you have some ideas I would be interested to test them

Maybe as you said it doesn't produce interesting results, but I have the
motivation to go through the full list every night and I always found some
interesting content, whereas I never had the motivation to go through several
pages of /newest

~~~
pvg
_My goal is more to 'compete with' /newest_

Oh! That makes an awful lot of sense, thanks. I wonder if you'd have got less
confused feedback if you'd described it like that initially, I think a lot of
the commentators (including me) somewhat misunderstood what you're trying to
do.

------
chadlavi
> New 13“ MacBook Pro51 min ago

> www.apple.com/macbook-pro-13/hn linktga

I feel like that counts as news/not original content/a popular site

------
Tomte
> flagged because : .org/

That is peculiar.

~~~
polote
Yes it is, at first I didn't filtered it, but statistically domains with org
tld are much more likely to be the website of an organization, even if I agree
it is not always the case.

The complexity of the task is that if you don't want to miss ANY quality
content you will end up filtering almost nothing. I took the risk to miss few
good content if that reduce the number of links to go through overall. But
this is not an optimum

~~~
LinuxBender
We all have different views on this, of course. For example, in my proxies, I
filter .biz and .info, as spammers and malware authors were able to acquire
thousands of those domains super cheap. I probably miss out on a decent site
here or there, but its a small price for me to pay.

------
dang
> Some HN users post a lot of content, and post several links in a row, which
> will push away your post if they publish just after you

> You get a 30 min time window at busy hours (and 1h time window at non busy
> hours) between the time a new link is posted, and the link disappear from
> the 1st page of new links

Should /newest list the last-N-hours of new stories, instead of the 30 newest
stories?

~~~
DanBC
I think /lists/ should have something like OP's idea, especially if you want
people to be reading new to find good articles that deserve a second chance.

A list of the last N hours of stories would be good, but a bit overwhelming.

~~~
eslaught
Or perhaps, instead of having the 30 absolute newest articles, you could have
30 articles, randomly selected from those submitted in the last hour. That way
your chance at success isn't biased so incredibly heavily on how you do in the
first 5 minutes. (And also, might make it harder to get brigades set up, since
the list is randomly generated.)

~~~
dang
Interesting idea! Will think about that.

~~~
Tomte
It might have the effect that people submit more at once, because the random
selection will probably only return one or two anyway. So you don't need space
them out manually anymore.

------
kgwxd
When I opened your "about" page, the first thing I see is "If one of these
terms is present in text : [nothing]" and "If it comes from these domains :
[Nothing]" It's because I built an FF add-on for myself that hides elements
based on a regular expression matching on text and/or element attribute
values. Since you have several text and domains I already had in my personal
filter list, I couldn't see those blocks of text :)

------
Seb-C
Interesting tool. I am using hnrss.org to get posts in my aggregator and did
not have to complain so far. That is probably because I filter the posts by
points, so I only get the most popular ones. Sure, I may be missing a few
posts I would have liked, but this way I am avoiding the distracting habit of
checking the homepage too often.

[https://hnrss.org/newest?points=100](https://hnrss.org/newest?points=100)

------
downerending
Back in the day of USENET, readers typically had kill files, and it was quite
easy for each user to arrange for items they didn't care about to be elided
(based on author, keyword, etc.). Not unlike this.

I'd kill for a general form of that that worked uniformly on sites like HN,
reddit, etc., and perhaps random forums, comment sections, and so on. The new
interfaces are nice in many ways, but that was a true killer feature, and it's
pretty much lost.

~~~
tannhaeuser
Strange enough, I recently noticed there's still life in Usenet's comp.lang
groups for things like formal anouncements and language spec casuism with
content not found elsewhere (after the decline of mailing lists and
degeneration of StackOverflow). Seriously considering posting there once
again.

------
tleb_
I once thought about filtering HN based on the content served by the link:
page size, JavaScript amount, image count, special items (Facebook share
button, GA), etc.

The goal would be to focus on small and light websites which is what I like
the most. I doubt it would work effectively though.

~~~
adamsea
I can't resist: "you can't judge a book by its cover" ;)

------
alden_penny
I highly recommend hckrnews.com to bypass the HN frontpage all together. I
usually sort by top 20% for a quick digest or simply sort by new. Sorting by
new makes it pretty easy to see all the submissions on a single day

------
pizzicato
I consume my news through RSS subscriptions to specific websites, many of
which frequently show up on the front page. I can see this becoming a very
useful supplement to HN to find original content, for me at least.

Thanks for making this!

~~~
guybedo
I you like RSS and want to find more content related to your feeds, maybe you
can give Aktu a try ([https://aktu.io](https://aktu.io)).

It's an online RSS reader that i built, and one of the features might be of
interest to you: It automatically aggregates news articles to items in your
RSS feeds. That means that for most articles in your feeds, you have the
original item from the website you subscribed to, but you also have a list of
articles from different sources talking about the same story.

A nice side effect is that it can help avoid filter bubbles by giving more
context to the stories you read.

------
jgwil2
It would be nice to still show the number of points and comments per article
in your interface, so that one could quickly scan and see what's generated the
most interest.

------
RocketSyntax
wasnt this shared a few days ago? [https://eaj.no/a-guide-to-big-o-
notation](https://eaj.no/a-guide-to-big-o-notation)

------
terrycody
Somewhat useful if someone not want to see news though. However, this reminds
me lobster, so same service already existed. You can give it a check.

------
jppope
thank you so much for getting rid of all the garbage that is chronically
clogging HN. great work!

------
somishere
where there's no link you should maybe consider linking to the article here on
hn?

------
DoreenMichele
_A user is flagged if:

He posted more than 2 two links in the last 1 hour

He posted more than 5 links in the last 5 days

He has posted more than 5 links in the last 30 days and among the posts he
posted 30% were flagged_

Some thoughts:

First, not everyone here is male. I'm a woman and a demographic outlier in
other ways. If you want stuff that's "different," in theory, you are looking
for people like me and your criteria would probably flag me plus your implicit
assumption that everyone here is male de facto reinforces the very thing you
say you want to combat: Homogeneity.

I don't post links daily anymore. I did at one time when I was homeless and
trying to find 2-4 good stories to post daily was my cheap hobby because it
amused me to try to make it to the leader board while I was a homeless woman
and it was a hobby within my budget. I made it to the leader board under my
old handle about a month after I got back into housing and then I changed
handles cuz reasons.

When I do post links, I tend to post a few links within about an hour because
I'm checking the news as part of my daily routine and if I see anything
interesting, the odds are good that's when I will see it. And I do that in
part because I am a demographic outlier and I have a pretty terrible track
record of trying to predict ahead of time what will fly on HN and what won't.

So I try to look for a certain level of quality and that's about it. I really,
really suck at trying to predict what HN wants to read.

I also post a lot of my own stuff, which ironically gets me flak at times.
Some people complain that the only thing I post is my own writing, which isn't
actually true. So that kind of feedback makes me feel like I "should" be
posting a certain amount of stuff not by me in order to be acceptable to the
community. Though, in practice, as my life gets busier, I simply fail to post
as many articles to HN because I simply don't have the time to do that.

But some people are interested in some of the things I write and some of what
I write does well and makes it to the front page. Among other things, I still
write about homelessness and some people here are actually interested in my
perspective on that topic. So I do continue to post my stuff here and let it
sink or swim based on votes because I suck at predicting what will do well.

I'm not asking or even suggesting you change your process in some way. I'm
just telling you what I see from my perspective and I'm doing that because I'm
an indie writer who takes Patreon and tips to support my work. Most of my
sites have no ads on them and I handle things the way I do so I can give a
fresh perspective on topics.

I post my own blog writing because other people almost never post my stuff.
That's extremely rare and my stuff would never see the light of day if I
didn't post it myself.

So to my ear it sounds kind of like you are looking for people like me and
your formula for flagging stuff probably already has me flagged. Which you may
be perfectly happy with. You may know who I am and you may be reading this
going "Good! You are one of the people I'm tired of hearing from!"

You do with that feedback whatever the heck you want. I don't need a reply or
an explanation or a justification. I don't care.

Have a good evening.

