Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: I built a tool to remove news articles from HN (luap.info)
131 points by polote 57 days ago | hide | past | web | favorite | 64 comments



Hey guys,

Being a long time reader of HN I'm sometimes frustrated that top HN content is often coming from popular websites or talk too much about the same thing (Zoom security issues, Facebook leaks, ...) I wanted to understand if this was possible to get only the 'original content' posts of HN (from the not-as-popular blogs and sites). Because to me this is the most interesting part of HN.

So I started analyzing HN new posts, and made a few discoveries :

- Some HN users post a lot of content, and post several links in a row, which will push away your post if they publish just after you

- You get a 30 min time window at busy hours (and 1h time window at non busy hours) between the time a new link is posted, and the link disappear from the 1st page of new links

- There is a second-chance pool for good stuff if a moderator detects it

As a result HN is overflowed with not really useful content and it is not always easy for original content to be noticed (Even if I think HN is doing a very good job compared to any other link aggregator that you can find)

So I tried to built a tool to filter out things like news websites, words that I want to blacklist, and users whose posts haven't been relevant to me.

That way I'm able to remove almost 80% of content and I can go through the list of all the links of the day before going to bed

For those who are interested, this is just a cron job querying the HN API every 3 minutes inserting the new links into a db, and a web server rendering the last 500 links.

You can see more on how the filters works here: http://hn.luap.info/about and you can also understand which links have been filtered here: http://hn.luap.info/links_flagged


clearly we have different interests in what we like from HN. many of the sites you filter are sites that I like prefer their content because they do offer commentary that provides context and perspective that I don't always get from what might be considered 'original content' by this. Interesting experiment though, might provide more value to someone like me if rather than a black or white list it was more of a content assessment model using NLP.


Hello, clearly the goal is not to replace HN, I love also reading the comments of people on global topics, but I also love reading original content


Perhaps what we really need is a personalized feed.


For some of us, the whole reason we're here is to get away from personalized filter bubbles.


I'd honestly be surprised if something like this hasn't been built yet by the HN community. A service where you can input your likes and interests (and maybe a blacklist) to get a tailored feed of HN posts including/excluding the right content. Maybe you could even feed in a few links to examples of content you like and it would cater to you based on that.


maagnit.com also similar to lobsters tags


lobste.rs tags?


I've been constantly using your site, thank you


I wonder what % of readers move on to Page 2 of HN? Or Page 3?


I agree your mindset but links from github and gist may still be original. Maybe we can use the username of the poster and linked resource to district that.


this offers a fresh perspective. thank you! once in a while there is a blogger that offers a different view or perspective on a topic that has been beaten to death here. them being noticed at all is simply magical since they have to compete with people's friends and colleagues around here...

i absolutely love that you have more than 30 articles per page.


30 minutes during busy hours seems optimistic. I've seen it get as low as ten, but very few people check past the first five links on /new anyway.


thank you, it had to be done before or after... good job!


I guess Hacker News loses its memory after a few years, but six years ago this item[0] hit the front page about using machine learning to train a filter for Hacker News. It is still running to this day[1]. I asked him if he could share his training set or code, but nothing happened. I think the training set may be showing its age, as there used to be more green items on the home page. Or maybe the quality of Hacker News has just gone down. (Perish the thought!)

"Enough Machine Learning to Make Hacker News Readable Again"

[0] https://news.ycombinator.com/item?id=7712297

[1] http://hn.njl.us/


I used to have a greasemonkey script that would filter articles or domains they come from based off of keywords.

Found it! (this was made by hn user furgooswft13) https://gist.github.com/m00g00/e539ec22bf588edca0e6dfe1a05eb...

I think I used it for "737" and "Boeing" for a while.


Thanks for this! I was quickly able to find some interesting new and niche types of things to read, in contrast to what my normal strolls through news.ycombinator and hckrnews give me.


It's an interesting experiment and reflection of your personal interests but looking at

http://hn.luap.info/links_flagged

I can't tell the difference between most removed things and the things left alone - either in terms of quality or thematically.

Having a page that tries to summarize the workings of the filter on the inputs is pretty great though - more people who propose alternative rankings/filters should think of ways to do that.


There's a ban on *.org? Yikes.


The full list is at

http://hn.luap.info/about

It's definitely idiosyncratic, as one would expect. A more interesting question is 'does it produce interesting results'. To my eyes and tastes, not really. The filter easily misses piles of the sort of 'news' it is trying to avoid and the quality of the rest of what passes doesn't appear to be any better (to put it mildly) than the HN front page.

Mercilessly culling even slightly frequent submitters (this includes people who, say, mis-posted something and then quickly made another post to correct the problem) is a pretty fun idea though, I wonder what you'd end up if you applied this iteratively over a long period of time.


Hello, this is very good feedback, I don't try to compete with HN front page

My goal is more to 'compete with' /newest in the sense that I don't think it is easily possible to get the best content publish directly out of an algorithm. If you have some ideas I would be interested to test them

Maybe as you said it doesn't produce interesting results, but I have the motivation to go through the full list every night and I always found some interesting content, whereas I never had the motivation to go through several pages of /newest


My goal is more to 'compete with' /newest

Oh! That makes an awful lot of sense, thanks. I wonder if you'd have got less confused feedback if you'd described it like that initially, I think a lot of the commentators (including me) somewhat misunderstood what you're trying to do.


I'd be interested in a filter that removed domains that have been posted more than X times.


HN's lack of subdomain- & tilde-checking would make this work strange, I think.


Banning (2005-2019) is a strange decision as well. Wow!


> New 13“ MacBook Pro51 min ago

> www.apple.com/macbook-pro-13/hn linktga

I feel like that counts as news/not original content/a popular site


> flagged because : .org/

That is peculiar.


Yes it is, at first I didn't filtered it, but statistically domains with org tld are much more likely to be the website of an organization, even if I agree it is not always the case.

The complexity of the task is that if you don't want to miss ANY quality content you will end up filtering almost nothing. I took the risk to miss few good content if that reduce the number of links to go through overall. But this is not an optimum


We all have different views on this, of course. For example, in my proxies, I filter .biz and .info, as spammers and malware authors were able to acquire thousands of those domains super cheap. I probably miss out on a decent site here or there, but its a small price for me to pay.


A lot of frameworks, libraries, programming languages etc. are .org websites and thus you're probably causing more damage by filtering .org


> Some HN users post a lot of content, and post several links in a row, which will push away your post if they publish just after you

> You get a 30 min time window at busy hours (and 1h time window at non busy hours) between the time a new link is posted, and the link disappear from the 1st page of new links

Should /newest list the last-N-hours of new stories, instead of the 30 newest stories?


I think one thing that could be done, is to offer 3-4 /newest possibles lists, for example by having a select at the top of the /newest page, possible options of the select could be:

- all new links (default)

- links from not frequent domains (less than 2 times this domain in the last 2 days)

- links from not frequent posters (less than 2 posts in the last 2 days)

- 1 link max per user


I wish /newest had a “hide all stories on this page from /newest” button, so that it always showed me new content and didn’t make me try to remember if I’d seen this page 4 before as page 2 or something. Then the heuristics wouldn’t be necessary since it would just show me anything I hadn’t hidden.


https://hckrnews.com/ has HN stories in reverse-chronological order, and shows where your last visit's newest story was (if you keep the cookie, of course). Sounds like exactly what you're looking for.


Not quite, but it’s certainly one of the heuristics I would expect to see someone implement. It isn’t a good fit for my use case though.


I don't think so, because very few people would scroll down further than before. And for those it only save one or two clicks on "More".


I think /lists/ should have something like OP's idea, especially if you want people to be reading new to find good articles that deserve a second chance.

A list of the last N hours of stories would be good, but a bit overwhelming.


Or perhaps, instead of having the 30 absolute newest articles, you could have 30 articles, randomly selected from those submitted in the last hour. That way your chance at success isn't biased so incredibly heavily on how you do in the first 5 minutes. (And also, might make it harder to get brigades set up, since the list is randomly generated.)


Interesting idea! Will think about that.


It might have the effect that people submit more at once, because the random selection will probably only return one or two anyway. So you don't need space them out manually anymore.


I don't think enough users look at /lists for that to have much influence. I should check the numbers though.


There should be options in the user panel to customize that when logged in.


When I opened your "about" page, the first thing I see is "If one of these terms is present in text : [nothing]" and "If it comes from these domains : [Nothing]" It's because I built an FF add-on for myself that hides elements based on a regular expression matching on text and/or element attribute values. Since you have several text and domains I already had in my personal filter list, I couldn't see those blocks of text :)


Interesting tool. I am using hnrss.org to get posts in my aggregator and did not have to complain so far. That is probably because I filter the posts by points, so I only get the most popular ones. Sure, I may be missing a few posts I would have liked, but this way I am avoiding the distracting habit of checking the homepage too often.

https://hnrss.org/newest?points=100


Back in the day of USENET, readers typically had kill files, and it was quite easy for each user to arrange for items they didn't care about to be elided (based on author, keyword, etc.). Not unlike this.

I'd kill for a general form of that that worked uniformly on sites like HN, reddit, etc., and perhaps random forums, comment sections, and so on. The new interfaces are nice in many ways, but that was a true killer feature, and it's pretty much lost.


Strange enough, I recently noticed there's still life in Usenet's comp.lang groups for things like formal anouncements and language spec casuism with content not found elsewhere (after the decline of mailing lists and degeneration of StackOverflow). Seriously considering posting there once again.


I once thought about filtering HN based on the content served by the link: page size, JavaScript amount, image count, special items (Facebook share button, GA), etc.

The goal would be to focus on small and light websites which is what I like the most. I doubt it would work effectively though.


I can't resist: "you can't judge a book by its cover" ;)


I highly recommend hckrnews.com to bypass the HN frontpage all together. I usually sort by top 20% for a quick digest or simply sort by new. Sorting by new makes it pretty easy to see all the submissions on a single day


I consume my news through RSS subscriptions to specific websites, many of which frequently show up on the front page. I can see this becoming a very useful supplement to HN to find original content, for me at least.

Thanks for making this!


I you like RSS and want to find more content related to your feeds, maybe you can give Aktu a try (https://aktu.io).

It's an online RSS reader that i built, and one of the features might be of interest to you: It automatically aggregates news articles to items in your RSS feeds. That means that for most articles in your feeds, you have the original item from the website you subscribed to, but you also have a list of articles from different sources talking about the same story.

A nice side effect is that it can help avoid filter bubbles by giving more context to the stories you read.


It would be nice to still show the number of points and comments per article in your interface, so that one could quickly scan and see what's generated the most interest.


wasnt this shared a few days ago? https://eaj.no/a-guide-to-big-o-notation


Somewhat useful if someone not want to see news though. However, this reminds me lobster, so same service already existed. You can give it a check.


thank you so much for getting rid of all the garbage that is chronically clogging HN. great work!


where there's no link you should maybe consider linking to the article here on hn?


A user is flagged if:

He posted more than 2 two links in the last 1 hour

He posted more than 5 links in the last 5 days

He has posted more than 5 links in the last 30 days and among the posts he posted 30% were flagged

Some thoughts:

First, not everyone here is male. I'm a woman and a demographic outlier in other ways. If you want stuff that's "different," in theory, you are looking for people like me and your criteria would probably flag me plus your implicit assumption that everyone here is male de facto reinforces the very thing you say you want to combat: Homogeneity.

I don't post links daily anymore. I did at one time when I was homeless and trying to find 2-4 good stories to post daily was my cheap hobby because it amused me to try to make it to the leader board while I was a homeless woman and it was a hobby within my budget. I made it to the leader board under my old handle about a month after I got back into housing and then I changed handles cuz reasons.

When I do post links, I tend to post a few links within about an hour because I'm checking the news as part of my daily routine and if I see anything interesting, the odds are good that's when I will see it. And I do that in part because I am a demographic outlier and I have a pretty terrible track record of trying to predict ahead of time what will fly on HN and what won't.

So I try to look for a certain level of quality and that's about it. I really, really suck at trying to predict what HN wants to read.

I also post a lot of my own stuff, which ironically gets me flak at times. Some people complain that the only thing I post is my own writing, which isn't actually true. So that kind of feedback makes me feel like I "should" be posting a certain amount of stuff not by me in order to be acceptable to the community. Though, in practice, as my life gets busier, I simply fail to post as many articles to HN because I simply don't have the time to do that.

But some people are interested in some of the things I write and some of what I write does well and makes it to the front page. Among other things, I still write about homelessness and some people here are actually interested in my perspective on that topic. So I do continue to post my stuff here and let it sink or swim based on votes because I suck at predicting what will do well.

I'm not asking or even suggesting you change your process in some way. I'm just telling you what I see from my perspective and I'm doing that because I'm an indie writer who takes Patreon and tips to support my work. Most of my sites have no ads on them and I handle things the way I do so I can give a fresh perspective on topics.

I post my own blog writing because other people almost never post my stuff. That's extremely rare and my stuff would never see the light of day if I didn't post it myself.

So to my ear it sounds kind of like you are looking for people like me and your formula for flagging stuff probably already has me flagged. Which you may be perfectly happy with. You may know who I am and you may be reading this going "Good! You are one of the people I'm tired of hearing from!"

You do with that feedback whatever the heck you want. I don't need a reply or an explanation or a justification. I don't care.

Have a good evening.


[flagged]


The usage of "he" was most probably unconscious and due to being a non native english speaker, rather than a deliberate act.

kroltan 56 days ago [flagged]

Oh no the text contained a conventional form of expression, let's purposefully misunderstand it to make some vague point without suggesting any actionable improvement!


Okay the comment was a bit sarcastic but I think the improvement is pretty obvious: use gender-neutral language!


Sure, and I agree with the suggestion. It's just not very effective to be hostile and vague toward people when you're trying to get them to change their habits.

(and my previous comment intentionally committed the same mistake)

Either give objective change requests or try to explain your point. ("your" as in an arbitrary reader, not you jgwil2)

While we are typing English, many places still do not mind using a default gender in their native tongue, and a non-native speaker can end up writing like that when writing in English. Heck, even English (the language) natives haven't fully settled in using "they" or which of the other variations the grand-uncle to this post mentioned.

I have no clue about the original poster's nationality, but I consider it rude to presume people are using this kind of language maliciously, especially on the Internet where you can't know what is acceptable in someone's culture.


You are right. Hostility and sarcasm are not the way to convince people. I think in the most generous reading, the commenter is trying to cushion the criticism by taking a teasing tone, but may have missed the mark a bit.


Usage of "he" when the gender of a singular subject is not known or relevant has a long established history, particularly on the Web.

Alternatives are "they", awkward constructs like "he or she", or explicitly naming the subject every time it is mentioned "the user".


And another is assuming “she”




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: