Basically the idea is that you take sites that perform too well on search metrics and remove them. Of course, that only works as long as a majority uses services like google.
However the top result when I removed top 10k sites was the second result when I did it without removing anything. That was kind of disappointing.
HN's search won't return results with a domain if you don't put the TLD in. Not sure why.
Some of the results remind me of the Web from the late 1990s: Definitely more than just the original group of geeks writing pages, but the mega-popular sites (Wikipedia, about.com, etc.) aren't there, like they weren't there back 15 or so years ago. It's almost homey again.
I fondly remember the golden days of Usenet. Eternal September was nothing compared to the devastation caused by blogs and twitter.
Instead of Twittering, everybody was chatting on IRC. And the internet wasn't corrupted by commercials yet, as the Web is today.
1. http://www.amazon.com/gp/product/1898275351?SubscriptionId=0... ...obscure enough that it's priced like a a rare book, ha ha.
That's why RSS must live on. It takes over 200 subscriptions to get a decent daily dose of reading out of blogs like that.
The central problem, I believe, is that we don't have a good model to predict a piece of information's relevance to a given reader. So far in human history the mechanisms that solve this problem have been ad hoc and error-prone - only by luck do you ever 'fall in with the right crowd' and start to get the information that you craved all along. Personally I feel strongly that this is a solvable problem, and when it is solved it will change human society forever.
Because search engines fell asleep at the wheel and now content farms are a sizable chunk of the economy?
And they can always release some features like curated/customized search to weasel around any regulation dangers. Maybe search engines just don't care about the health of the web.
I don't think OP meant content farms. I think he meant half of the sites that have articles repeatedly appearing on HN. All those smaller and bigger "news" services that write a lot of poor, shallow, lying or just plain wrong, controversial articles. The problem is with incentives - they do it for ad money, not to inform/educate people.
But hey, we're a clever species, I do believe that someone will figure out how to structure reality so that we get more of the things we really want without using proxy incentives that backfire when overdone.
 - http://www.youtube.com/watch?v=u6XAPnuFjJc
Ok, this really is not my area of expertise. I just feel that what we have now is both unfair to writers and disastrous for society :(
It could work for bloggers, but they're not the problem. The "news" services / aggregators / whatever are, and it's hard to put a cap on what a company should earn (I'm not even sure if it is a Right Thing to do).
> I just feel that what we have now is both unfair to writers and disastrous for society :(
Couldn't say it better myself.
Fell asleep at the wheel? The major search engines run their own ad networks. They're content farms' key customers, and content farms are a core element in their business strategy.
They definitely see the problem. The current ad/SEO setup incentives all sites to behave like content farms, duplicating content, quantity over quality, short articles, etc.
Sort of like how radio stations make a bunch of hubbub about doing blocks of X minutes with no advertisements. It's not because they don't like ads, it's because they've also got to deliver enough of what their audience is looking for to keep them coming back.
Google is used for what I would call "archive search" and research these days (how to do X, etc) in my experience, not current news.
You have to finely tune your subreddit subscriptions otherwise you get a meme whiteout from some of the other subs.
Crafted news content matching what I want, with little overhead, is an area I think you could see some disruption.
I know quite a few people that visit the Verge and similar sites on a daily basis, and they have no intention of reading all their content.
I think it's better to think of those high volume sites as newspapers rather than blogs. Most people do not have the time to read an entire newspaper, but that doesn't make it any less valuable.
What "top slot?" Blogs aren't ranked by a committee. I can only assume OP means traffic. The blogs with the most traffic have the most for a reason, and OP still doesn't realize why after many years in the industry.
I'm sure the author thinks the things he wants to have the "top slot" deserve the "top slot." The naivety.
"If I ran a search engine, I would ban these sites from the index."
I wish someone like you was around to curate the Internet for me.
ironically, OP's blog's purpose is also to sell ads and ebooks. In the early 00s he's talking about, people blogged just because they felt like it and didn't complain about monetizing strategies. techcrunch? what techcrunch? 2005 is not 2000.
I see a site like Mashable as totally optimized for social and ads. I remember a point in time when the site was interesting--when APIs seemed like new thing and hackers were "mashing" sites together on a scale like had never been done before. It chronicled the new web 2.0 trends (which I have to admit were really powerful--no one can deny that the landscape of the Internet was changing).
But now it's just a bunch of crap. Top ten lists (which, of course, make you click through each item so you'll refresh ads). And infographics: they used to be cool. Now they're just stupid charts with a graphical background and font. Every time I click one I think to myself, this didn't have to be graphical and it's not very informational. Like a movie trailer the headlines of these traffic-hoarders are catchy just to get you to click on it and once you arrive you're disappointed and 10 cookies have been dropped on you. Or you see some modal covering the content asking you to do something that will help the site proliferate itself. If you can stomach all this deception and read the meat of the article most of the time you don't feel very satisfied. You click the back button unless you're tricked into clicking another of their links.
I think the only real way to stop this is to significantly demote sites that have one than one or 2 ad units on the site.
I don't think there was any "mind" that kept quality high and volume low. Volume was just lower, and there were fewer interested parties, and they were less motivated by money (because there was less money to be had). There are ridiculous quantities of excellent lower-volume "blogs" though, you can hardly call them dead because you're using the populist channels to try to find them. Is that how you found them in the 00s?
Tripadvisors gaming is even worse. When I search for 'restaurants in place_x' I get results from every tripadvisor site, like tripadvisor.com, tripadvisor.es, tripadvisor.in and more.. the results are duplicates!
The quality of google results in certain niches is very very poor. I haven't found that bing is better though. =/ It's just amazing that they haven't been able to make considerable improvements to this with the amount of money they have. I guess that's the problem with lack of competition.
Zite seems to be on the right track to fix the web is a mess problem...
Hell, the quality of Google is so low that Bing is actually running ads right now with blind taste tests where people preferred bing. Of course this isn't scientific at all, but my point is-- no big scandal has erupted about how wrong this is. It's totally plausible for bing to do this because everyone realizes that google has gotten to the point where microsoft can plausibly compete with them!
Page rank was really cutting edge, but that was 10 years ago, yet it is still their primary mechanism. It's been gamed, but they seem uninterested in moving to more sophisticated mechanisms (they use them but the influence of better methods seems to be too low) ... meanwhile they've used their bully pulpit to influence the web to conserve page juice which has backfired in such a way that actual links to authoritative and useful sites are lower ranked than spam links, making it easier to game. (When wikipedia is using no-follow on relevant outbound links to pages that wikipedia is quoting or citing, things are fundamentally broken- no site on the web has a more favored ranking position than wikipedia. Not to mention hand curation of pages. You can't even correct errors there without having them reverted by some know-nothing whose sole accomplishment is rising in the ranks of wikipedia editors, so its not like they need this to prevent spam.)
This means the site that google unquestionably considers the most authoritative, when it cites a page that it considers authoritative, google gives that site no credibility. But let me create a web of sites that construct text that passes grammer parsers as "good english" but whose purpose is to spam keywords and link to each other and I can rank for those terms up close to wikipedia. (This is essentially what techcrunch is doing only they are having humans write low quality text instead of a computer.)
It's broken, and google broke it.
Your focus on pagerank is at least somewhat off base, considering that it's only a factor in ranking as noted by someone below, and it's a ranking that pretty much all the search engines use as well ("The Bing ranking algorithm analyzes many factors, including but not limited to: ... the number, relevance, and authoritative quality of websites that link to your webpages").
There is something to be said for no-follow links being a symptom of something broken, but, OTOH, page rank is still a good indicator of what people out on the web find to be useful and relevant content, allowing you to find popular content, cluster it by subject, etc...essentially crowd sourcing (a portion of) relevancy via something people do anyways. Gaming was inevitable, and no-follow is really more of a way to disincentivize spammers...the fact that with no-follow you get the spammers anyways (to get human eyeballs instead of crawlers') demonstrates that the motivation is always there. If a search engine trusts wikipedia's outbound links they don't have to obey no-follow in any case, but you still have the situation that everyone will have their own favorite "impartial" external links to add, not to mention people with a vested interest in the subject.
The possibility you forget in your "microsoft can plausibly compete with [google]" point (leaving aside the fact that most people are just ignoring it) was that bing has improved, and that has nothing to do with google breaking anything.
When I search for technical things or related to the news I feel like I can see the 'tint' actually. It's like google adds certain keywords to my searches, or reduces it to a much smaller subset.
When I search for something, sometimes it's because I want to find something I saw a while ago, but sometimes it is to get a new perspective on things. When you search for something and see the same opinion for the first 10 results you can tell how skewed it is.
Now I have to manually add 'criticism' or 'failure' to certain searches, or 'success' even. It's just weird.
Yes, it is not a scientific study, but Bing's marketing isn't entire scientific in what "nearly 2 to 1" means either. Nearly 2 can be 1.5. Hell, ceil(1.1) is 2.
Google's search results have been improving more and more over time, for my technical searches (specifically related to programming) no other search engine even comes close in getting me the results I want. I do have Google's Web History turned on, which most likely allows Google to improve their search results better to what I will most likely want.
There is a difference between quality and freshness. I agree some SERPS on DDG look higher quality, but then when you dig down in the results you find out why: They are all safe choices. They could be pages from 2005. They could be pages that were once authoritative, but now lack topicality and news.
Hell, the quality of Google is so low that Bing is actually running ads right now with blind taste tests where people preferred bing
This is more marketing than research.
Page rank was really cutting edge, but that was 10 years ago, yet it is still their primary mechanism
It is one of 200 factors. Also there is internal pagerank and world-visible pagerank. Besides Google has been doing a lot with author rank and mentions.
It's been gamed, but they seem uninterested in moving to more sophisticated mechanisms (they use them but the influence of better methods seems to be too low)
Latent semantic indexing, query deserves diversity, query deserves freshness, detecting spam by following links in spam emails etc. There is no shortage of sophisticated methods.
meanwhile they've used their bully pulpit to influence the web to conserve page juice which has backfired in such a way that actual links to authoritative and useful sites are lower ranked than spam links, making it easier to game
Pagerank hoarding is an old and crummy idea. Google webmaster guidelines even say it is not a good idea to hoard pagerank, as it reeks of manipulation. There is also a decay factor.
Even when I joined the company in 2000, Google was doing
more sophisticated link computation than you would observe
from the classic PageRank papers. If you believe that
Google stopped innovating in link analysis, that’s a
(When wikipedia is using no-follow on relevant outbound links to pages that wikipedia is quoting or citing, things are fundamentally broken-
Spammers still exist. Spammers try to game healthy systems. Search engines are not broken, because Wikipedia tries to combat spammers... And links are not all there is. Googlebot still follows nofollow Twitter links. Mentions (words without links) are still worth a popularity vote for the things or people they mention.
no site on the web has a more favored ranking position than wikipedia. Not to mention hand curation of pages. You can't even correct errors there without having them reverted by some know-nothing whose sole accomplishment is rising in the ranks of wikipedia editors, so its not like they need this to prevent spam.)
It is about adding spammy external sources. If Wikipedia links were dofollow, much more sources would be added, not because they are good sources, but because they would do well with marketing. Wikipedia is a shining example of a site that gets lots of inbound links, mentions, great content and top notch internal linking.
This means the site that google unquestionably considers the most authoritative, when it cites a page that it considers authoritative, google gives that site no credibility.
If all Wikipedia cites were worthless, no one would gain an unfair advantage over gaming Wikipedia. But Wikipedia cites are not worthless. If you Google for a company A and B and only company A appears on Wikipedia, what do think of the quality difference between company A and B? If company A has 10.000 search results and company B has 1000 search results, what does that say about the reach (social proof) of company A? Also, like the mention-algorithm, check out: http://www.seobythesea.com/2012/01/named-entity-detection-in... (Entity detection). Finally: Not only who links to you counts for your quality/popularity, but also who you link to. Pages get rewarded for linking to quality resources.
But let me create a web of sites that construct text that passes grammer parsers as "good english" but whose purpose is to spam keywords and link to each other and I can rank for those terms up close to wikipedia. (This is essentially what techcrunch is doing only they are having humans write low quality text instead of a computer.)
Reading level and quality of journalism on Techcrunch aside: The article talked of once-been blogs, who cling to their gained reputation, to produce spammy content. Firstly: They are playing with fire. Google or their users could say: Enough is enough, you just lost your reputation/got hit with Panda. Then they are just another spammy/low-quality blog ranking somewhere around #1024. Secondly: Blogs like Techcrunch have a big company and money behind them. They organize offline events, get mentioned in newspapers or are the root start of an online discussion about a start-up or SF drama. They employ well-known writers. All things equal, it would be bad for Google to rank Techcrunch under a single-author amateur blog that started out last month. Even in lieu of high quality, Techcrunch is relevant and popular.
It's broken, and google broke it.
It is how it is. Use it to your advantage. Keep adding new fresh content and enjoy your pageviews. I know Bing isn't sending them my way...