Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Hacker News Title Edit Tracker (hackernewstitles.netlify.com)
366 points by petercooper 64 days ago | hide | past | web | favorite | 93 comments



I built this several months ago as I was interested to see how the titles of posts evolve/change on HN. It turns out lots of titles are edited every day (in both subjectively and objectively good and bad ways!) and I've found it interesting to see how titles have evolved.

The whole thing is automated and is built around a Ruby script that pulls down the titles on the front page at a frequent interval. Any titles for specific story IDs that change get tracked and rendered out to a static HTML page hosted on Netlify. It runs by itself without incident so far.

This has been on Show HN before but I was kindly invited by dang to repost it.


Cool! Since you have the infrastructure already, it would be neat to track URL changes as well. Probably less frequent, but I bet they happen at least daily (e.g. if a submission is flagged as blogspam.)


Yes, that would be possible, though I think it should rendered to a totally separate page to avoid confusion. I didn't realize URLs were edited here, so it was off my radar till now :)


Please tell me you plan on going meta and editing the title of this post.


Good idea. I put "test" up there. When will it show in the list?

Edit: there it is. Title reverted now.


I just posted a very similar thing implemented as a browser extension (HNtitles[0]) so that it shows the original text in the page without any effort for the user.

(I had written a first version of this some time ago[1] as a bookmaklet, but a browser extension is simpler to use.)

[0] https://news.ycombinator.com/item?id=21621411

[1]: https://news.ycombinator.com/item?id=6573188


I think it works well as an extension for people who want to keep an eye on this sort of thing. I only wanted a casual, occasional look at the edits, but HNtitles looks ideal for the power user - good work.


Thanks. The list you are making is very good, it's a destination in itself.

As said elsewhere, the extension helps to avoid clicking on the same story twice when the title changed. (I will add url monitoring too.)


This would be really good to apply to watching spin on various verticals.

Such as entertainment and financial articles...

Recall during the banking crisis of 2008 there was the article stating that the EU was about 16 TRILLION in debt crisis, but then that was too alarmist and they didnt. Want that getting out, so they edited the title, but they forgot to change the UrL which had the original title in it.


News outlets do that all the time. NYT in particular. Often they don't change the URLs, and I've always assumed that's for technical reasons—either there's a system that's treats URLs as immutable, or it's copied through multiple systems that the people who change the title can't edit, or some such thing.


Perhaps you’ll appreciate this:

https://mobile.twitter.com/nyt_diff


Oh god yes. I hate seeing that an article was « updated ». Well what the heck changed since I read it yesterday?!?!?

And this is a well-regarded national newspaper in Canada.


Nicely done! It would be fantastic to see this embedded between post and comments on every piece here!


Any plans to release the code? Which scraper? Chrome headless?


Since there's no interactivity or logins or anything complex curl would probably do the job.


It's not very good, very basic and naive (but works!)

https://gist.github.com/peterc/b325cdea7561528c506f073f47741...

The actual turning the JSON into HTML takes place on Netlify. If you wish to render the JSON for your own benefit, the current version is always at http://s3.amazonaws.com/jeanfromeastenders/current.json


> Any titles for specific story IDs that change get tracked..

Can you talk more about the data infrastructure?


love simple but useful apps like this! how's the hosting with netlify?


I'm very happy with Netlify for numerous things. In this case, Netlify is just running a build script that converts the JSON I create at http://s3.amazonaws.com/jeanfromeastenders/current.json into the final HTML page.


OOI, where does the Ruby script run?


Just on a random VPS I have via a cronjob. Ideally I'd run it as a Lambda job and then I wouldn't have any infrastructure at all. I just didn't get round to it after prototyping.


That is such a great UI. I feel like the list should be longer. Can you do that without it costing more?

Not all of those edits are by mods, of course. Some are made by submitters (edit: as the site points out!). Also, some are because we switched the URL and thus to the title of the new article (example: https://news.ycombinator.com/item?id=21616157). Those look weird if you assume they're moderation edits.

It looks like the list is sorted by reverse ID, which means articles that were submitted earlier are lower down on the page. But sometimes we re-up those (https://news.ycombinator.com/item?id=11662380), so from a front-page perspective some 'newer' stories are below older ones.


Can you do that without it costing more?

Yes, I just increase a LIMIT in an SQL query. I'll try that now. I didn't want the page to look too cumbersome.


> I feel like the list should be longer

Isn't there a small window of time to edit a submission before it gets committed permanently to HN? I imagine the list would be bigger if the grace period lasted much longer.


Yes, but moderators sometimes edit titles and do other moderatory things after that limit has expired.


Thanks - really interesting.

Take this one, for example:

nvidia Drops Support for CUDA on macOS

--> CUDA 10.2 is the last release to support macOS

--> CUDA Toolkit Release Notes

The first two titles are quite interesting to me, as a macOS user and a general follower of the tech space (and I'd note neither are sensationalised, or click-bait, from what I can see). The last...? I'm not going to click on that in a million years, as I don't work with CUDA.

I'm not totally clear what the moderators' motivations always are, but might it be true that in some cases, maybe they're prioritising strict accuracy over interest, or discoverability? And as a result, their actions are actually diminishing the value of HN as a discussion forum?


You have the answer here:

https://news.ycombinator.com/item?id=21617317

"[...] The big no-no is extracting some interesting fact from inside the article and using it as the title"


Thanks - makes sense from a ‘they’re following the rules perspective’, but less so from a ‘enable discovery and subsequent interesting discussion’ perspective.

I wonder whether a subtitle giving an editorialised “so what” would help... but that would of course mean twice as much to moderate, and twice as much for people to complain about...


Interesting, I had forgotten about this rule. Doesn't that encourage blogspam links instead of the source with context?


Ironically, that's exactly what dang often does, including today.


Often? That doesn't seem true to me. Can you give examples?


You make a good point! but this is a thorny area. When an article's title is completely generic in a way that conceals what's interesting about it, we sometimes don't revert the title. Such cases are covered by the site guidelines as 'misleading'. Other times we do revert it—for example, when a title is generic for what seems to be a creative reason, we'll often put the generic title back out of respect for the author. Or if it's generic in a way that is playful rather than bland.

There are hybrid cases too, where we'll leave an editorialized title up because of what you call discoverability until the post is well-established on the front page, then change it at that point. Also, we're responsive to feedback, so when a user makes a good case for us having made the wrong edit, we'll change it. When users disagree with each other, though, as happened here, that gets harder.

Btw I wrote about that CUDA title here: https://news.ycombinator.com/item?id=21618991. We'd normally have left the second-last edit up, but I changed it in response to a user complaint. Had it not been midnight on Saturday I'd have just asked the readers whether the macOS detail was really the single important thing in the article or not, and decided based on that.

Your question about moderator motivation doesn't quite land with me though. We're not excluding any of those values, we're just trying to balance them in a good way. What that means is that you're going to get a different answer, often a surprisingly long answer, about specific cases.


They are prioritizing accuracy and just what's interesting to them, sadly.

Whether a thread will get changed or not seems very inconsistent.


Emphasis on 'seems'. Anything that complicated is hard to do consistently, but the inconsistency is around edge cases. If you're perceiving wild inconsistency, that's probably because the problem is more complicated than it appears.

I think most readers would be surprised by how much thought goes into these title edits and how they are based on principles, i.e. the site guidelines and lots of heuristics that derive from them. That's why we're always happy to answer questions about specific cases.

More here:

https://news.ycombinator.com/item?id=20429573

https://news.ycombinator.com/item?id=21617954


This sequence provided some comedy:

> 21:00 Trends in the San Francisco poop crisis

> 21:15 Trends in the San Francisco (dog) poop crisis

> 22:25 Trends in the San Francisco (mostly dog) poop crisis



Title was updated because of some primary research on the data set. Thats kinda cool.


Some of these edits seem really questionable when laid out like this.

Pew Research: 2.2% of Americans produce 97% of political tweets

Small share of U.S. adults produce majority of tweets on national politics

Why remove the exact figures?

Former Apple chip executives found company to take on Intel, AMD

Three of Apple and Google’s former star chip designers launch NUVIA

Isn't "star designers" more subjective than "executives"?

1.2B people exposed in data leak includes personal info, LinkedIn, Facebook

Personal and social information of 1B people discovered in data leak

Why make the headline less informative? Data leaks happen regularly. Data leaks from Facebook and LinkedIn has different implications than a leak from LexisNexis or a random blog.

Cloudflare open-sources Flan Scan, a network vulnerability scanner

Flan Scan: Lightweight Network Vulnerability Scanner

Again, why remove info? The fact that CloudFlare is behind this is more interesting than yet another random tool.

Mozilla: “Dear Facebook: Stop cross site tracking by default”

Dear Facebook: Stop cross site tracking by default

Same complaint. This distinguishes a random person making a random gripe from freakin' Mozilla who has the control to make Facebook's tracking more difficult.

----

Every single one of these headlines actually are less informative or less interesting (in general, of lower quality) than their original submissions. They actually served to make HN less informative. WTF?

That gripe aside, most of the edits are useful (typo fixing, adding dates, and such). These just leave me scratching my head.


> Why remove the exact figures?

That submitter broke the site guidelines by changing the article title when it was neither misleading nor clickbait—so we changed it back. Also, we've found that when a title is gratuitously numerical, it makes for worse discussion. Why? I don't know. It just does. Therefore, if anything, we take numbers out rather than add them in. For the same reason, we wrote software to abbreviate "1,000,000" to "1M", "1,000,000,000" to "1B" and so on. Numbers in titles are baity and long numbers baitier.

> Isn't "star designers" more subjective than "executives"?

That title changed because we switched to a different URL and updated the title to match the new article. See https://news.ycombinator.com/item?id=21616157.

> Why make the headline less informative? [...] Data leaks from Facebook and LinkedIn has different implications than a leak from LexisNexis or a random blog.

That submitter broke the site guideline against editorializing. It's editorializing to cherry-pick the details that you consider important and put them in the title. That amounts to the power to determine the story for everyone else, and on HN, submitters don't get such power. We prioritize authors; submitters have no special rights over a story. If a submitter wants to say what they think is important about an article, they're welcome to do that in the comment thread, on a level playing field with everyone else.

In fact there was a lot of data leaked in that leak, not just LinkedIn's and Facebook's. That's another important. Putting famous names in a title makes it baitier and evokes lower-quality discussion, because it activates everyone's pre-cached responses about the famous names. If anything, we are inclined to take famous names out of a title, and certainly not to add them in.

> Again, why remove info? The fact that CloudFlare is behind this is more interesting

Because cloudflare.com is right next to the title: https://news.ycombinator.com/item?id=21605719. From the guidelines: If the title includes the name of the site, please take it out, because the site name will be displayed after the link.

> Same complaint. This distinguishes a random person making a random gripe from freakin' Mozilla

Same answer: mozilla.org is right next to the title: https://news.ycombinator.com/item?id=21599496. Avoiding repetition is part of HN being organized around curiosity.


> cloudflare.com is right next to the title

> mozilla.org is right next to the title

Hmm... hey, petercooper, perhaps you should consider including the submission website in the edit tracker somehow? Without that, anyone who reads the edit tracker is missing an important piece of information that actual HN readers see.


> Also, we've found that when a title is gratuitously numerical, it makes for worse discussion. Why? I don't know.

This is interesting and an example of this happened recently in a post that ended up on the front page with 50+ comments. It was titled "100k+ page views a month for $5 with a self-hosted static site".

I chose that title because it kind of sets the stage of what to expect (a small / medium tier site being hosted cheaply) but it did bring in a number of comments where some people dropped in with "but that's only 0.04 posts per second, anything could host that!" which kind of detracts from the content of the submission which had nothing to do with saying those numbers are impressive in any way.

It's definitely a tricky balance and is so context specific. I think that post without the numbers wouldn't have gotten much engagement because "How I build, deploy and host my static site" isn't that interesting at a glance and I wonder if you came to the same conclusion because the title wasn't edited other than capitalization.


Yes, that one is a borderline case. Actually I included it in my GP comment as an example, and said we decided to leave it up except for downcasing it! But then I removed that bit, because the comment is so long.


Most of these edits seem to be to make the title match the title of the article instead of being editoralized. From the guidelines:

"Please don't do things to make titles stand out, like using uppercase or exclamation points, or saying how great an article is. It's implicit in submitting something that you think it's important.

Please submit the original source. If a post reports on something found on another site, submit the latter.

If you submit a link to a video or pdf, please warn us by appending [video] or [pdf] to the title.

If the title includes the name of the site, please take it out, because the site name will be displayed after the link.

If the title begins with a number or number + gratuitous adjective, we'd appreciate it if you'd crop it. E.g. translate "10 Ways To Do X" to "How To Do X," and "14 Amazing Ys" to "Ys." Exception: when the number is meaningful, e.g. "The 5 Platonic Solids."

Otherwise please use the original title, unless it is misleading or linkbait; don't editorialize."


Any definition of "editorializing" I've ever heard involved injection of opinion - none of the above did that (and in a couple of cases, the title reversions went from a statement of objective fact to a subjective statement of opinion, in direct contravention of the guideline).

Isn't that exactly the opposite of how it's supposed to be done? "Editorializing" is a meaningless term if it means adding neutral, factual clarifying statements (example: "small share" -> 2.5%)

Further examples:

PIA (PrivateInternetAccess) VPN bought by company known for distributing malware

PIA: Our Merger with Kape Technologies – Addressing Your Concerns

Useful -> Non-useful. Who is PIA and why should I care that they're merging? Expanding the acronym was useful in this case. Not doing so is a form of clickbait all on its own.

The difference between an expert’s brain and a novice’s

Differences between expert and novice brains in mice: study

Note that the original headline is the one used by the source. Okay, so clarifying statements and general rewording can be added to non-clickbait, non-editorialized headlines in some cases, but not others?

Seriously, I'd like to not have to make work for the mods, but I see this lack of consistency combined with unintuitive, seemingly against-rules changes here that I actually do not know what is expected of me when submitting an article given that a plain reading of the guidelines doesn't match the application in the real world.

When you have a rule that's actively making things worse, and is this confusing even for the people who enforce it (I'm seeing a lot of headlines that have multiple changes, hours apart) it's not a bad idea to consider revising the rule.


> Any definition of "editorializing" I've ever heard involved injection of opinion - none of the above did that

Taking that PIA headline as an example, even if the chosen title is factual ("bought a company known for distributing malware"), the submitter is still making a judgement about what part of the story is most important. That is editorializing.


Cherry-picking details is the quintessential form of editorializing. See https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... for more.

Let's look at your further examples.

1. Re "PIA (PrivateInternetAccess) VPN bought by company known for distributing malware", that one was outrageously editorialized. This is a complicated case though, because we often allow changed titles when the article is a corporate press release. Companies tend to use bland nothing-to-see-here language that can be misleading in its own right. But "bought by company known for distributing malware" is a massive claim and we have no idea whether it's true. We're in no position to adjudicate the truth or falsehood of everything people put in titles. So in this case, both choices were bad: the editorialized title as well as the bland original. If we could have come up with a better (i.e. accurate and neutral) title, we would have used it. But it wasn't obvious how to do that, so we went with the lesser evil of the original title. (We also got an email complaining about the editorialized title, but the complaint didn't cause us to change it, it just caused us to take a closer look.)

As for the acronym, it would exceed the 80 char limit to expand it in the title, so I didn't. Also, "Private Internet Access" is so generic a phrase that I'm not sure it's all that clarifying. Anyone who would know what that phrase means in this context would already know what PIA stands for. For those who don't, it would only take a tiny bit work to find out, and it's not a bad thing for readers to have to work a little. https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu.... One more thing, too: when an acronym is used in a title unexpanded (e.g. "CS", "AWS", "AMA"), it often contains the subtle additional information that the acronym is a widely-used one in the community, which PIA is: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu.... So in this case I'd say the acronym was a wash. By the way, we really do consider all these details when weighing how to edit a title, and since I was the moderator in this case I can tell I did that; more, even—but this answer is already too long.

2. We've learned from long experience that science articles lead to shallow, angry arguments when their titles make excessive claims. Instead of talking about the actual finding, the threads fill up with objections to how overstated the title is, and (worse) generic ventings about title inaccuracy and the decline of journalism and western civilization. In other words, such titles are baity.

We've learned that the best way to de-bait them is simply to narrow their scope, i.e. shrink the title down to the size of the real story. In this case we applied two scope-narrowers: "in mice", which is a great de-baiter when the story is about a mouse study, and ": study", which makes it clear that the topic is just a study, not the god's-lips-to-your-ears revelation that headline writers love to imply. Basically, it's their job to sex up the title and our job to knock it down to size. Those two devices work well for preventing science-article threads from going off track in predictable ways.

---

I know all of this can appear inconsistent, but that's because there's so much more going on, so many more concerns and details than one would ever imagine in the title-editing domain. For the first few years I did this job I used to find that irritating—how can an issue as trivial as internet titles be so important? Over time I learned how much more there is to it. I wrote about this here if anyone is interested: https://news.ycombinator.com/item?id=20429573.

The one thing I feel most comfortable defending about all this is how consistent we are—not completely consistent, but for sure relatively. You may not agree with the principles and that's fine, but we're not applying them arbitrarily. A nice aspect of the job is that we don't have to apply them arbitrarily: we have a good set of principles to rely on and they cover most of what comes up. We could talk for the rest of the year about specific cases and why exactly we edited them the way we did. But it all comes down to the site guidelines and the fact that intellectual curiosity is the organizing value of this site.


it's nitpicking... i just saw "A History of APL in 50 Functions" (the original title) changed to "A History of APL in Fifty Functions".

HN's hive-mind at work. social platforms need to rotate mods imo.


I am very thankful for this policy. The HN titles are essentially always better and more honest representations of the article contents.

Keep up the good work!


One thing I dislike about title edits, even when it is for the better, is that it makes it hard to know if you already have read the article.

Popular topics often get multiple submissions and trigger related topics. When the title changes it is hard to know whether it is a new link with new/different information or whether it is the same one that I've already read.

In my opinion once something hits the front page it is too late to edit it (other than minor things such as adding year or [pdf/video]).


My browser (FF) displays the link in a slightly lighter grey if I've visited it.

To clean up my feed, I hide topics that I am "done with." That is, I gave it the amount of attention I'm willing to give it. This keeps my feed updating quicker as new items enter the stack at the bottom to replace the ones I've hidden.

I started doing this because it mimics the behavior of one of reddit's settings, which is to hide all topics I've voted on. I found the feature so convenient that I started using the hiding behavior here in the same way.


Exactly! That was my motivation too when writing a similar tool as a browser extension [0].

So for each title you can see the original text.

[0] hntitles.adgent.com


20:40 Is the San Francisco Poop Crisis Out of Control?

↓ 21:00 Trends in the San Francisco poop crisis

↓ 21:15 Trends in the San Francisco (dog) poop crisis

↓ 22:25 Trends in the San Francisco (mostly dog) poop crisis


It's missing one now though.


This is great! I have a feature request:

One thing that worries me about HN is that I've had a _link_ changed by a moderator in the past. And I don't mean removing query junk or changing between mobile/non-mobile sites -- I mean changing the link to an entirely different site. Even worse, by the time I noticed, I couldn't edit or delete my post (which I wanted to do because I disagreed with the new link). I was essentially forced to post content I didn't want to post! I think this is crosses the line from curation to impersonation.


What was the link?

We change URLs every day, for lots of different reasons—for example if one article is mostly just copying from another source, or if users suggest a better URL. When we do that we nearly always post something saying what we did, and including the previous link:

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...


It was over a year ago I think. I can do some digging, but not right now.


If you find it, please email hn@ycombinator.com so we can take a look. If we got something wrong, we're definitely interested in learning from that.


One such instance:

https://news.ycombinator.com/item?id=21378471

I remember commenting on a different article but the one that it points to now is a different article: https://news.ycombinator.com/item?id=21384151


I checked the logs and you're right. The URL was originally https://techcrunch.com/2019/10/28/google-reportedly-in-talks.... That article is simply pointing to the Reuters source, so a moderator changed it in accordance with the site guidelines: "Please submit the original source. If a post reports on something found on another site, submit the latter."

Normally when we do that we explain so in the thread, like this: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que.... But this one happened overnight when no one who is public as a moderator was awake. That's a flaw in the current moderation system.

https://news.ycombinator.com/newsguidelines.html


As someone who frequently submits title change requests to the HN mod address (hn@ycombinator.com), this is an interesting and insightful tool. No, none of my recent requests (the most recent is a few days ago) are shown. Which suggests I'm not too much of an annoyance, yet.

HN's mods (dang and sctb) are amazingly responsive and tolerant. I try to make their job easier in requests by keeping those short and clear. Others may find this useful or have futher suggestions.

Generally:

- Action in the subject: title, clickbait, spam, link disintermediation (pointing to a primary rather than secondary source), self-promotion (I've landed on "one-note flute" to describe this), and behavioural issues. Also occasionally vouches (for flagged posts/comments), or best-of nominations (there's a curated list put out by HN monthly).

- Followed by the title-as-presently standing. Should make identifying the post easier.

- Link to post (or comment) as first line.

- Often a link to the submitted article.

- A suggested change or revert. Often these are clear, sometimes not. My view is that submitting both a "this needs changing" and "here's my suggestion", along with a possible rationale (usually a subhead, lede line, occasionally a good overview line from the article) makes the editors' job easier.

These are often accepted, sometimes with a slight change, sometimes as is. I generally don't follow up with a thanks or further acknowledgement, but usually do note when the request isn't accepted that I'm OK with it.

There are times I've differed with views (usually tech-politics intersections). I really wish HN could discuss such topics better than it does, though that also seems to be ... somewhat ... improving with time. Discussion on HN is almost always superior to other venues I frequent online.

Response times are generally a few minutes to hours, longer during off-hours. Rarely, less-critical issues may take a few days to generate a response. But there's very nearly always one, which I appreciate.


Arguably a bug: some of the titles listed have subsequent edits that don't show up on the page. Examples:

https://news.ycombinator.com/item?id=21619671

https://news.ycombinator.com/item?id=21609572

Does the script stop checking them at some point?


I understood the script only checks the front page. Could the change have happened when the article was not on the front page? Or are all articles that have ever been on the front page supposed to be checked "forever"?

I have no idea how often / how quickly an article could oscillate between rank 30 and 31 for example.


This is really cool! It'd be great to have a tool that tracks all editing/curation/censorship on hacker news.


Or you know . . . we could just throw it onto the blockchain . . . ;)


I'd love to see the domain added to the end of these, like they are on HN itself. For example, one might wonder why NVIDIA CUDA Toolkit is shorted to CUDA Toolkit... unless you see that (nvidia.com) is after it anyways.

It might add some context to some of the edits.


you should also track the comments moving from thread to thread (when an admin moves comments from one thread to anoter), comment shadow banning, post popularity based on upvote counts (lots of times, post get to the front page with only a few votes), or maybe even track how fast posts get deranked


HN has an API: https://github.com/HackerNews/API

I'm using it to track common items here and on Lobste.rs (and Proggit):

http://gerikson.com/hnlo/

Here's the endpoint for the latest 500 submissions:

https://hacker-news.firebaseio.com/v0/newstories.json

here's the one for the current top stories:

https://hacker-news.firebaseio.com/v0/topstories.json

It's actually quite nice to work with. I don't know how to keep track of comments moving from thread to thread, because that's not a metric I'm interested in, but it should be possible to track somehow.


This shows how much work goes into moderation here. Almost all titles seem more accurate and less clickbaity.


What is the rule for titles being edited? Is it just no editoralization in the titles?


See the guidelines. The titles should be the original article title, unless that title is clickbaity, in which case either a less baity subhed can be used or a carefully written neutral headline. The big no-no is extracting some interesting fact from inside the article and using it as the title; submissions are community property, and the person who submits them isn't entitled to decide what the most important angle in the article is.


I remember there being an extension for Firefox which showed editorialized (news) site titles and content. I forgot the name though. Does anyone have any recommendations for this feature?


Have you considered adding a "sort by amount of change" option? It would be very interesting to see why larger changes happen compared to smaller ones.


This would be interesting for a number of news sites too.


This is one for the new york times [0].

[0]: https://mobile.twitter.com/nyt_diff


Someone I know runs something like this for news content. See https://www.newssniffer.co.uk/


This is great! Transparency, accountability, predictability etc should be more prevalent in the web.


Could you also match to the source article/post title?


It would be interesting to let submitters provide multiple titles and a/b test them from the outset.


The entire point is to make titles less clickbaity. A/B testing them to get more clicks exactly defeats the entire purpose why mods edit titles.


Yes but the vast majority of folks on the Internet, including HN, don't get to directly experience the effect that headlines have on engagement.

I never said it was a good idea, just interesting. :)


What frequency of sending GET requests to the servers of HN is an acceptable rate for a bot? I tried to look for an answer on this but didn't find any.

In the past I got the IP of my server banned from accessing HN for sending too many requests in too short of a time span. I found the unban interface that you provide, lowered the request rate of my crawler and tried again but was still sending too many requests in a limited amount of time and got the IP of my server banned again. If I recall correctly, I got it unbanned a third time and lowered the request rate even more but then got banned again and then I think it would not allow automatic unbanning.

Don't remember if I just gave up at that point or if I sent an email about it, or if I just waited some amount of time if there was a statement about how long I would have to wait before being able to use the unban interface for the IP of my server again.

Anyway, an official answer about the acceptable request rate would be nice. Perhaps put it in the FAQ?

Also, if people doing automated GET requests were to create a unique UA string for their scrapers that include a way for HN staff to get in touch, like for example (but with actual names of bot and site)

    Examplebot/0.95 (+http://www.example.com/bot.html)
where the page on that URL would list an e-mail address for getting in touch, as well as having a statement about how to verify that a given crawler belongs to that service, would that help in not getting server IP banned automatically?


Once per 30 seconds. That's in our robots.txt: https://news.ycombinator.com/robots.txt. We've been working for a long time on (edit: what we expect to be) some serious performance improvements that might allow us to relax that limit. For now though, HN's process still runs on a single core and we don't have much performance to spare.

If you need more than that, you should use the Firebase-based API (https://github.com/HackerNews/API). The public dataset is also available as a Google BigQuery table: https://bigquery.cloud.google.com/dataset/bigquery-public-da....

Edit: since this subthread is not really on topic I detached it from https://news.ycombinator.com/item?id=21617478.


>single core

Instead of further optimizing the already light design surely it would be easier to do a gofund me and get a server upgrade? I certainly wouldn't mind chipping in a buck or two if it relaxes limits.


YC is not actually a non profit charity, FYI.


I think the problem is more the software is single threaded. I’m sure YC can afford more than a single core machine.


The software is multi-threaded but it runs on a platform (Racket) which implements that as green threads, meaning it runs them all on a single core.


ah. That makes a lot more sense than hn being short $100. Thanks for modding.


Note that Racket besides threads also has places, which allows programs to use more than one core.


Single core! Raspberry Pi 4? ;)


Oh. Single core.

That's...sad.


I disagree.

I think it's awesome that there's a site this popular out there that _doesn't_ run on resume-fashionable Google/Facebook scale infrastructure that it does not require.

"Yeah boss, we're going to have to pull out memcached and replace it with a kubernates cluster of MongoDB shards. For 'reasons'..."


Why?


BTW, you don't need to hit HN to create something like this. I'm using https://hacker-news.firebaseio.com/v0/topstories.json and https://hacker-news.firebaseio.com/v0/newstories.json to get the latest story IDs.. then https://hacker-news.firebaseio.com/v0/item/ID-GOES-HERE.json to get the titles.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: