Hacker News new | past | comments | ask | show | jobs | submit login

Right, archive.org is very useful if you have a URL, but if you're searching for a question that was answered in a forum, and that forum post no longer exists and no longer shows up in search results, then it's effectively undiscoverable as far as I'm aware.

It amazes me how companies will have free volunteers help people to use their (often expensive) paid subscription products, and then delete all that info those volunteers wrote up. Don't they want people to use their products?! They're less likely to renew their subscription if they struggle or are unable to use the product for their particular use case.

Unaffiliated forums not ran by the company are better in that the company can't decide to just delete all old posts one day (and while the owner could, certain types of unaffiliated forums are usually a bit easier to clone and republish.) The downside is you don't get assistance from people who work for that company, but often you rarely get that in official forums. The usual reason to use official forums is just that they have significantly more users asking and answering questions than unofficial ones.






>that forum post no longer exists and no longer shows up in search results

I dream of someone taking the internet archive data, capping it at 2010 or so, then making a search engine out of it. I mean if AI companies are looking to gobble all the data they can get, then surely they'd jump at the chance to train on (higher quality) data from the past that simply no longer exists on the web. So it'd seem like a win-win situation if IA gave them a copy of the data on the condition that they maintain a permanent backup and provide some sort of searchable index on the data (maybe even via LLM), and in turn the AI companies got access to high quality data on obscure topics that simply no longer exists.


Yup, let's not tie such an important endeavor to AI and AI startups though, we need something robust and lasting :-)

> I dream of someone taking the internet archive data, capping it at 2010 or so, then making a search engine out of it

It sounds like you're describing CommonCrawl.org, and yes, it's already popular with AI companies.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: