Hacker News new | past | comments | ask | show | jobs | submit login
Searching the Creative Internet (crawshaw.io)
119 points by EdiX 4 months ago | hide | past | web | favorite | 46 comments

The type of site the author is discussing began to disappear around the time "blogs" entered normal people's vocabulary, and the commercialization of Internet content really began. I remember the 2004 US Pres election being all about "the hottest blogs", which had Washington insider information the major media outlets didn't have. Once "blogging" became a business (ad-supported), quality collapsed as the KPIs were now pageviews and clickthroughs.

This caused a a shift away from creating content for content's sake. 15 years ago, someone may have posted an in-depth technical article for no other reason than to share knowledge. Now, that article would be posted on Medium.com in a gimped form, as a lead-in to their 10-hour video course on that topic.

The early period of Twitter briefly brought the "weird web" back, but once brands and businesses descended upon it, it became all about retweets and follower counts. Quality nosedived as a result.

The blogs of old are still there, I mean, this article even links to a few. They are just really hard to find, and that’s the problem.

The internet didn’t stop being interesting, HN is a great example of this in action. The blog article about rejected Disney princesses is likely the most interesting piece anyone of us read on the internet today.

Only Google didn’t deliver it, a HN link to an article about the creative net did. I say google, but really, any of the gape keepers is equally guilty.

Interestingly I think all the gate keepers are at a place that yahoo, AOL and others where when google disrupted them.

So maybe, just maybe, we’ll see the dawn of something better soon.

I immediately though of HN-style sites too. What we need is some sort of gumbo made from spanning all of these specialized , human-curated sites, stripping out the Alexa top 100, 1000, top-whatever sites and then re-ordering the results by some new measure that encapsulates "neatness", "quirky" or "weirdness", or what ever we can call it.

This sounds really hard, and for me I don't know where I'd start, but I also didn't create google or pagerank which at it's dawn also seemed like magic, so I have faith this could be done.

It's the sort of mission to which I could dedicate my life, until I too was corrupted by the success and lure of endless wealth, at which point today-me sincerely hopes some other smart oblivious kids would come along and displace me.

So I too hope to see something better in my lifetime.

For the same reasons I like to browse Mastodon, and spend time on the deep web. It's not that quality content doesn't exist anymore, it's just drowned out by the crap.

I've been meaning to try something out for some time -

Take the results of a big search engine, and programmatically filter out everything that contains something from the Adblock filter list. Not just getting rid of the ads, but ignore the entire page if it contains advertising. Iterate through enough results until you actually get the first 50 or 100 hits not containing any advertising, and return those as the basic search result.

There would be collateral damage of informative pages that attempted to good-faith "monetize", and you'd miss out on the stackoverflow results etc, but I would hope this would surface many of those super informative sites of yore. That is assuming they're even still indexed.

Yahoo (circa 2000) used to let you filter search results by no ads or no monetization, I forget the exact phrasing. It was sometimes useful then but it would be more interesting now. There is also https://millionshort.com/ which was a Show HN product from 2012, stripping out X number of top ranked websites from Google results.

These ideas are half the battle, I think the other half is the curation, which is a little more abstract, but possibly machine-learnable to some degree?

> That is assuming they're even still indexed.

That makes me kinda sad, like a cool drawing getting thrown away that no one saw. These pages would sit out there, only with inbound links, but never a search result.

This would be really easy to do, just fetch several blocks of 100 links programmatically for a search term, then immediately apply a filter to them that would first pull out everything on a shared community blacklist, you could remove items from a blacklist if you desired.

I think the author may like "Million Short" [1], a search engine that lets you remove up to the top million most popular websites from your results.

At a more general level, I miss the days in which I could type a search query like `intitle:"index of" mp3 mb` and get actual, unfiltered results. I've toyed with the idea of indexing the web myself and use simple filters, but I think I'll wait until someone here gets funding for it instead.

[1] https://www.millionshort.com/

Every time I see an article lamenting the loss of the early internet I think that it boils down to community quality vs. scale. As communities get too large, then content and discussion will gravitate to whatever seems to get a response from the biggest group. That can easily leave things feeling bland and make efforts to start conversations feel too competitive. Smaller specialized communities that are somewhat insulated do have the chance to avoid this phenomena, though often times they fail under their own success or are too insulated to pick up new individuals. It's a hard balance to be made.

I personally miss the prevalence of technical and personal web logs of the late 90s. That's not to say that they aren't still around, there's simply more alternatives to shift through and many of them do focus on marketing themselves for visibility (which seems to take something away from the feeling of the older net IMO). A large focus on centralization certainly seems to have shifted the broader tone online, though there's still plenty of gems floating out there.

Can you give some examples of those type of websites? I've been really interested in the idea of stripped down personal static sites.

I don't think "stripped down" or "static" would be what would really describe the sites. Things were broken, varied, and sometimes simplified in handcrafted ways. I would say that old university pages for professors would be a good example of something basic which reflected the era, though there's been such a big churn in university URLs over the past decade (and presumably other times as well) that it's difficult to find an example which hasn't been mutated by university IT into something more 'presentable' or just hosted at a different URL.

While most of what I could still locate dates only back to the mid-2000s , consider https://web.archive.org/web/20050122063815/http://matt.might... , around that time blogspot was pretty active and well ranked for search engines at the time, freestanding wordpress instances were fairly popular after individuals moved past writing html posts manually and uploading them over ftp to a simple host, phpbb instances were available for most topics in the mid 2000s. The key thing is that once you move before the easy availability of php based webhosts, then you get into the territory of hand edited plain html (or microsoft frontpage) and very very little to no-javascript.

https://www-user.tu-chemnitz.de/~heha/ (German, some English pages)

Lots of interesting stuff about hardware and software. Minimal fluff, maximal content.

The demise of Geocities (in 2009, much later than I imagined) has caused a mass migration of these types of sites to places like Tumblr, whose search functionality leaves a lot to be desired.

Now that Tumblr is dead there's some hope the alternative will offer better features.

Try Millionshort.com

By trimming the top 1000 results, I was able to find some truly random websites that just happened to have a deep archive of interviews of a band I was looking up. All interviews were from '80s magazines, pretty much all defunct now.

Had I stuck with Google/Bing, every top search result would either an eCommerce store, Spotify artist page, Allmusic, or a snarky Pitchfork/Vice piece about them from the 2000s.

For this search, Spotify, Pitchfork and Vice were at the top of the search results because they are SEO-optimized This means that Google/Bing search will show the links from the domains that perform best under their page rank algorithms. Since Pitchfork and Vice are domains with a high number of backlinks and lots of active traffic, those were the ones that rank the best.

Given that so much of web content is just rehashing what was orgiinally reported/said somewhere else, finding niche content is going to be harder and harder.

"The second link is a NASA press release. (Why does NASA even have those?)"

Oo oo I know this one. They have those so that they can report discoveries about the nature of the universe to the taxpayers who are paying them to make those discoveries.

I think the implied full question was, "Why does NASA even have press releases next to the actual research they publish, and why those releases occupy higher spots in search results than the research?".

The answer is still the same. Press releases are accessible to the general public and have a wider appeal than research papers, so they appear higher in the search ranking. I'm not sure why the author of the article thinks only people capable of reading and understanding the consequences of a research paper are interested in reports from NASA.

Also, in case anyone didn't realize, press releases are what the press use to write popular articles. Hence the term. Journalists generally don't read scientific papers.

what' stopping them from publishing the executive summary or abstract as a generic non-optimized, non-gamified SEO'd document? that's what universities and research organizations did for the first 30+ years of the internet...

I completely agree that it is hard to find things - and it is getting worse. If you search for a recipe for chicken soup, you now get a 20 page life history, filled with ads (ha ha) and finally a recipe can be as perfunctory as "cook chicken in water". This is click-bait-world. In my opinion this comes from intrusive advertising as the business model for the internet. (See Bruce Schneier).

I want a search engine that deprioritizes results based on the number of trackers or ads on the page. Like, rank = (relevance * 1.0) + (trackers * -0.5) + (ads * -0.5).

I made an attempt to do something a bit like this this by a) seeding from HN links and, b) passing all links through uBlock Origin via Puppeteer. My assumption is that more uBlock hits correlates with lower quality (a crude metric, I know, but turns out not far off the mark)

My first-run result is here http://kakapo.susa.net:8080/cfs/ - I think the results are promising, even on a tiny index (around 1M pages, I think).

That's an awesome idea; which leads me to think, what would be needed to experiment with alternative page ranks and levels of bubble-encapsulation? What resources are out there from which one could experiment-- as crawling the whole web is too large of an endeavour? How about a distributed database of the text-internet that anyone could clone and build indexes from? A layered approach? This must surely already exist?

You don't need to crawl the web yourself. http://commoncrawl.org/ will give you the data for free. It's a little out of date by the time you get it but that shouldn't effect a project like this.

You could build another Google, why not?

A brilliant idea! I'm in if anyone want talks about how to do this. Whats the best way to host a discussion? Slack/github/other?

I think Github would get higher visibility, and there also wouldn't be a limit on comment history.

I set up a github thing if anyone is interested.


I wish DuckDuckGo had some better filtering features like this.

I really like this in concept.

Maybe there should be a search engine for the non-tracking web.

Yes! That is a brilliant idea. It would be straightforward to implement. Just exclude any page that loads resources from any domain on the major ad-blocking lists.

Heck, I'd go even further and create a JavaScript-and-third-party-cookie-free search engine. Block any domains that host tracking pixel images too.

Bring back the old web!

I know that's not what you meant, but isn't that essentially what AMP is? How well that then turned out...

AMP isn't about not having trackers... it just means implementing trackers and ads the AMP/Google way. I'm pretty sure that Google doesn't treat AMP pages any differently based on the number of trackers so long as the pages are considered "valid" AMP.

Totally agree with this. Though I often want something similar but not necessarily “creative”. I’d like to be able to ask “show me blogs or discussions that are substantial about x”. x could be a scientific paper or something.

It is a question of search / discovery mechanisms. Mostly this kind of query is “satisfied” by things like Twitter. But I wish there were a good blog / discussion search engine. Those died a long time ago. As you say the results I am looking for only show up on lower pages in Google. Maybe there is a better search engine for that I don’t know about.

It really saddens me that the best way to get to actual research these days is via Twitter. It always feel like I'm forced to wade through a sea of manure because that's the only place you can find diamonds.

>> the best way to get to actual research these days is via Twitter.

I'm usually disappointed from twitter search(unless it's for finding smart people, and endlessly browsing through their stream), how do you manage to get so much out of it?

I don't, that's my point. But apparently I'm supposed to, since for some reason everyone chooses Twitter as a platform for research discussion.

Wouldn't be too hard to build a simple MVP - just search, but block/omit any domain owned by an organization worth more than $XX million.

Specific threshold/blacklist TBD/subjective, and you'd get false positives (people posting truly original content on their Facebook page or Blogspot blog). But by and large, a lot of the truly "labor of love" content out there is done by folks both savvy and invested enough to set up their own domain name, and would pop out if you just filter out all the corporate domains.

"a site filled with excellent original stories based on historical figures. Some Disney executive should buy them."

And turn it into the mainstream content the author doesn't like?

I mostly don't get this reminiscing about the past state of a technology. When I see it I always suspect someone is remembering their youth in a candid way.

Everyone's a content creator these days. Even your uncle is posting pictures of his meals and writing about how #blessed he is. And you know, while that's not my cup of tea, there's nothing inherently wrong about that. It's great even, so long as it's authentic. But it usually doesn't strike me that way; it usually feels more like they've become a social media coordinator, sharing—and selling—a fake version of themselves to you. He's become a brand

If you were to stumble across Uncle Brand's online presence, you'd likely be amazed with what an interesting and humble and cultured and well traveled person Uncle Brand is. And is that so wrong? He's only creating what he's seen other people do after all. Only giving you what you want. Why not put your best self out there?

And maybe Uncle Brand has a little something unique he does, that one emoji he always uses or that obsession with ramen. And maybe he starts attracting an audience. He's reliable! He's relatable! He's authentic! He's safe!

But an audience is something you have to maintain, something you have to grow. The audience didn't come for Uncle Brand the man; they came for Uncle Brand the brand. So he starts refining his brand, churning out more content, gets a better camera for his photos. He's got more resources now and can ape what big brands do.

In some ways, the internet became too real, too tied to the real world. You can even make real money on the ol' www! But when this happened, rather than the internet liberating us from the old, we just recreated the old incentives and shallowness and commercialism. But shittier and more random. Youtube celebrities are mostly just shittier celebrities. Instagram is mostly just shittier magazine and food and travel photography. Internet journalism is mostly just shittier journalism. So much online content is pre-internet content just pushed on a new channel.

And what's so wrong with getting real? Uncle Brand is all in. He's a souper star! The Martha Stewart of ramen. By now, Uncle Brand is using his brand to hawk stuff too. He's using his brand to hawk other people's brands. Promote, promote, promote. Sell, sell, sell.

The internet defies generalization. There are certainly great communities and forums and subcultures and people creating amazing stuff out there today. More than ever even. But the internet as many people experience today is indeed quite different from what I original loved. It feels like all the incentives are wrong; platform incentives resold to creators and users as their own.

I don't want to be a brand. I don't want what old world is selling: the celebrities, the popularity contests, the consumption, the fear of judgement. I just want to create awesome stuff and have fun and try something new. And I want connect with people who are doing the same!

A fine article, but since the author singles out NPR and Wikipedia, I would just like to say a word in their defense. And then some other words.

Thank God for Wikipedia, it is a miracle. Long may it stay antifragile!

And long may NPR use its supporters' money to produce consistently archive-worthy content!

Donations to Wikimedia foundation or your location public radio station make great "solstice" gifts, if you're into that sort of thing.

So yeah, I just turned this comment into an ad, because let's not throw out the baby...

And yes, I'm old enough to remember the "good old days." There is every bit as much signal now. And every bit as much more noise.

And yes, Google's hegemony is a threat to the capital-I Internet's antifragility. (I just got Taleb's book, can you tell?) Guess who powers the analytics for crawshaw.io? It's all of a piece, people. Walk the talk.

And yes, I have a little Shakespeare site that's "better" than the top-ranked ones in many ways, but I accept that if I wanted those top spots---like any other top spots---I would have to sweat and hustle and fight for them. I don't, and no contrarian anti-Google is going to hand them to me.

I'm all for the better thing. Blue sky, every day! Step away from the machine! But c'mon, it's 2018, let's not dis NPR and Wikipedia! We're fighting the good fight!

EDIT Also, to eksemplar's point that "The blog article about rejected Disney princesses is likely the most interesting piece anyone of us read on the internet today." I picked this piece at random and it was so amazing. I emailed it to my wife who's an artist. Man. Thanks for that alone, ye OP!


The weird internet is still there. It just isnt on the web. Join a pubnix, use gopher, find a good bbs, get back on IRC. Those are just a few great spaces. The bit of technical knowhow to get to or use them reduces their viability as capitalist enterprises, which keeps them at least slightly more creative/old school. Generally it is easy to find a community with the right balance of size, content, and participation in these palces than things hosted on the web.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact