Hacker News new | comments | show | ask | jobs | submit login
The Web Is A Mess (mkronline.com)
91 points by gliese1337 1605 days ago | hide | past | web | 59 comments | favorite



One innovative "solution" is millionshort:

http://millionshort.com/

Basically the idea is that you take sites that perform too well on search metrics and remove them. Of course, that only works as long as a majority uses services like google.


Glad you like it. We'll be realeasing some interesting features in the coming days and weeks.


Cool idea. I tried it just now, I searched "how to make rocket", thought I'd get some cool old style pages on people launching rockets in their spare time. I did.

However the top result when I removed top 10k sites was the second result when I did it without removing anything. That was kind of disappointing.


Yeah, I just tried a sample search and it returned mostly the same results, no matter how many sites I removed. They were just in a different order.


I remember seeing this on HN a while back. It looks much more polished now.


I couldn't find it by searching, but I remember it being demoed on HN by the author.


http://www.hnsearch.com/search#request/submissions&q=mil...

HN's search won't return results with a domain if you don't put the TLD in. Not sure why.


> Basically the idea is that you take sites that perform too well on search metrics and remove them.

Some of the results remind me of the Web from the late 1990s: Definitely more than just the original group of geeks writing pages, but the mega-popular sites (Wikipedia, about.com, etc.) aren't there, like they weren't there back 15 or so years ago. It's almost homey again.


"You remember way back in the early ’00s when your favorite blogs posted a few times a day at most, had a handful of great writers, and were a joy to read."

I fondly remember the golden days of Usenet. Eternal September was nothing compared to the devastation caused by blogs and twitter.


Yup, and your files you got via FTP. You searched for them using Archie. There was no web; but there was Gopher, and Veronica was the search engine for Gopherspace.

Instead of Twittering, everybody was chatting on IRC. And the internet wasn't corrupted by commercials yet, as the Web is today.


Heh, fun times. I had a book [1] out on how to use all these different services around 1995, and it was obsolete within a year of publication. I thought the web was the hot new thing too, but I didn't expect it to so thoroughly destroy most of the other vectors.

1. http://www.amazon.com/gp/product/1898275351?SubscriptionId=0... ...obscure enough that it's priced like a a rare book, ha ha.


"Instead of"? People still chat on IRC. I'm on IRC right now.


mostly those who 'still'.


I completely don't see the problem he's writing about, my favourite blogs don't post daily, some authors post 1-3 times a month[1]. Others are bringing co-bloggers on board, and even then only come up with a few posts a week[2].

That's why RSS must live on. It takes over 200 subscriptions to get a decent daily dose of reading out of blogs like that.

[1] http://thelastpsychiatrist.com/ [2] http://www.overcomingbias.com/


We are entering the era where successful blogging will be the domain of already successful, high visibility figures like pg or Fred Wilson. The attempts of the popular blogs to 'scale' (which basically meant delegating content-creation to poorly-paid underlings who fight over ever-shrinking scoops) have failed miserably. (Personally, I don't touch lifehacker or techcrunch with a 10-foot pole these days, to name two I used to read).

The central problem, I believe, is that we don't have a good model to predict a piece of information's relevance to a given reader. So far in human history the mechanisms that solve this problem have been ad hoc and error-prone - only by luck do you ever 'fall in with the right crowd' and start to get the information that you craved all along. Personally I feel strongly that this is a solvable problem, and when it is solved it will change human society forever.


Why would banning content farms draw the eye of regulators when the content farms are gaming the search engines?

Because search engines fell asleep at the wheel and now content farms are a sizable chunk of the economy?

And they can always release some features like curated/customized search to weasel around any regulation dangers. Maybe search engines just don't care about the health of the web.


> Why would banning content farms draw the eye of regulators when the content farms are gaming the search engines?

I don't think OP meant content farms. I think he meant half of the sites that have articles repeatedly appearing on HN. All those smaller and bigger "news" services that write a lot of poor, shallow, lying or just plain wrong, controversial articles. The problem is with incentives - they do it for ad money, not to inform/educate people.


What are those sites repeatedly appearing on HN if not content farms? What do you want to call them? They game news aggregation sites and search engines. They push out so much content many users think farm content is normal and upvote it.


Is it impossible to give people incentive to inform/educate people? I think this even worked for a short while, but I'm no expert on this matter.


I'm not sure. For what I can tell, it's a difficult and delicate topic, as intrinsic incentives tend to be overrided by extrinsic ones (e.g. pay someone for doing what he loves, he might soon start doing a worse job at it). See the RSA Animate talk about motivation[0].

But hey, we're a clever species, I do believe that someone will figure out how to structure reality so that we get more of the things we really want without using proxy incentives that backfire when overdone.

[0] - http://www.youtube.com/watch?v=u6XAPnuFjJc


Wow. I get a feeling that I saw it before, but I'm happy I got to watch this again. Now, doesn't solution seem obvious? Shouldn't there be a cap on how much you can earn from ads on your site? Just enough "to take the issue of money off the table" and not a cent more?

Ok, this really is not my area of expertise. I just feel that what we have now is both unfair to writers and disastrous for society :(


> Shouldn't there be a cap on how much you can earn from ads on your site? Just enough "to take the issue of money off the table" and not a cent more?

It could work for bloggers, but they're not the problem. The "news" services / aggregators / whatever are, and it's hard to put a cap on what a company should earn (I'm not even sure if it is a Right Thing to do).

> I just feel that what we have now is both unfair to writers and disastrous for society :(

Couldn't say it better myself.


I agree with the "disastrous for society" part. Information if becoming very hard to verify and I have seen with my own eyes this year the news being censored even from the "fire hose" aggregators like Google news.


What particularly annoys me are the article titles in popular news sites, that often are plain lies (that articles themselves later correct), intended to lure people into viewing full articles. The point is, people don't read all the articles, but they do skim the list of them, and they remember lies from the titles.


> Because search engines fell asleep at the wheel and now content farms are a sizable chunk of the economy?

Fell asleep at the wheel? The major search engines run their own ad networks. They're content farms' key customers, and content farms are a core element in their business strategy.


Well it's hard to say how close the relationship is. Google changed algorithm to target content farms a while ago, caused a stir.

They definitely see the problem. The current ad/SEO setup incentives all sites to behave like content farms, duplicating content, quantity over quality, short articles, etc.


I think they're just trying to walk a line.

Sort of like how radio stations make a bunch of hubbub about doing blocks of X minutes with no advertisements. It's not because they don't like ads, it's because they've also got to deliver enough of what their audience is looking for to keep them coming back.


Who uses google to find current news articles though? I think this is mostly a non-issue as people use aggregators like HN here or Subreddits or other news reader apps to filter top stories.

Google is used for what I would call "archive search" and research these days (how to do X, etc) in my experience, not current news.


I get my daily headlines from Google News as a matter of course. I find Reddit unusable for day-to-day news, and only find aggregators such as HN valuable in proportion to their relatively narrow focus. Google News used to be pretty excellent for news dicovery, but it's being SEO'ed into the ground. In recent months I've started seeing letters to the editor presented as news stories.


Even Reddit can be difficult for daily news.

You have to finely tune your subreddit subscriptions otherwise you get a meme whiteout from some of the other subs.

Crafted news content matching what I want, with little overhead, is an area I think you could see some disruption.


I only use Google to find news I've heard about from in-person conversations.

I know quite a few people that visit the Verge and similar sites on a daily basis, and they have no intention of reading all their content.

I think it's better to think of those high volume sites as newspapers rather than blogs. Most people do not have the time to read an entire newspaper, but that doesn't make it any less valuable.


If I'm just looking for any news in whatever topic, then I'll use the sources you mentioned. However, if I'm looking for more information on a specific subject (e.g. features of Android 4.2), then I'll absolutely Google it.


I went online to escape the mindless reporting and tabloid news that had taken over television. But it looks like they chased me into the internet.


"Most top blogs don’t deserve the top slot anymore."

What "top slot?" Blogs aren't ranked by a committee. I can only assume OP means traffic. The blogs with the most traffic have the most for a reason, and OP still doesn't realize why after many years in the industry.

I'm sure the author thinks the things he wants to have the "top slot" deserve the "top slot." The naivety.

"If I ran a search engine, I would ban these sites from the index."

I wish someone like you was around to curate the Internet for me.


true blogs are dead (yes, scotsman). I sometimes miss the old internet. now get off my lawn.

ironically, OP's blog's purpose is also to sell ads and ebooks. In the early 00s he's talking about, people blogged just because they felt like it and didn't complain about monetizing strategies. techcrunch? what techcrunch? 2005 is not 2000.


I think the author is concerned exactly because his blog purpose is to sell ads. He even wrote it at the end of his post. I guess he just would like to sell ads using quality content - his own content - and not by aggregating tons of crap and producing even more of it. Maybe he even tries - I don't know, I didn't read anything besides this post. But the fact is that this doesn't work anymore - and that's just sad.


the whole premise of his post is completely flawed. he talks like he misses 'the old days', but in the old days nobody gave a crap about ads. he's complaining that he's too weak to make a living of blogging while back then nobody even thought of that. 'the blogosphere' meant linked blogs of friends and foafs, not cross-posting every piece of turd to 10 social networks and upvote sites so you get more likes and clicks.


Did anyone else get that awesome ad at the bottom just below where he talks about how crap ad money is? One of those classic "you have been selected to win an ipad" ones. It was so juxtaposed with the content I thought for a second perhaps the entire post was satire.


I actually think that this change is more from social media than it is from SEO. Both are excellent sources of traffic but optimized a bit differently. But the end result of both if these "optimizations" (including optimizing for ads) is a terrible experience and almost completely worthless content. Yet, sometimes, I cannot help but click...

I see a site like Mashable as totally optimized for social and ads. I remember a point in time when the site was interesting--when APIs seemed like new thing and hackers were "mashing" sites together on a scale like had never been done before. It chronicled the new web 2.0 trends (which I have to admit were really powerful--no one can deny that the landscape of the Internet was changing).

But now it's just a bunch of crap. Top ten lists (which, of course, make you click through each item so you'll refresh ads). And infographics: they used to be cool. Now they're just stupid charts with a graphical background and font. Every time I click one I think to myself, this didn't have to be graphical and it's not very informational. Like a movie trailer the headlines of these traffic-hoarders are catchy just to get you to click on it and once you arrive you're disappointed and 10 cookies have been dropped on you. Or you see some modal covering the content asking you to do something that will help the site proliferate itself. If you can stomach all this deception and read the meat of the article most of the time you don't feel very satisfied. You click the back button unless you're tricked into clicking another of their links.

I think the only real way to stop this is to significantly demote sites that have one than one or 2 ad units on the site.


Blogs are not the web. I didn't have any favorite blogs in the early `00s, and I don't have any now. I don't read TechCrunch, and I don't read LifeHacker. I don't pay attention to Twitter. The web is fine.


One key difference between now and "the early '00s" is the number of people using the web.


Wait. Blogs posting a few times a day at most? I think their definition of "blog" is now closer to a news website. Overall there I'd agree that quality has gone down - it happens anywhere there's substantial money to be made. But I have more personal blogs than I can keep up with, and most of them are what I'd call extremely high quality.

I don't think there was any "mind" that kept quality high and volume low. Volume was just lower, and there were fewer interested parties, and they were less motivated by money (because there was less money to be had). There are ridiculous quantities of excellent lower-volume "blogs" though, you can hardly call them dead because you're using the populist channels to try to find them. Is that how you found them in the 00s?


sounds like a plug for using blekko's slashtag system. You could always create your own, and then never have to see techcrunch/other spammy blogs ever again.


Google is gamed like crazy. Especially by sites like Tripadvisor, Skyscanner, etc. Whenever I make a search for something like 'flights from la paz to new york' - all I get is some landing pages.

Tripadvisors gaming is even worse. When I search for 'restaurants in place_x' I get results from every tripadvisor site, like tripadvisor.com, tripadvisor.es, tripadvisor.in and more.. the results are duplicates!

The quality of google results in certain niches is very very poor. I haven't found that bing is better though. =/ It's just amazing that they haven't been able to make considerable improvements to this with the amount of money they have. I guess that's the problem with lack of competition.


Video search is even worse - see this search result, offering endless pages of the exact same video: https://www.google.com/search?q=%22CHASTE+DANCING%22&hl=...


Exactly. Most people think that SEO spam is something that is only done by shady eastern european viagra-selling rings. In reality, the worst offenders are multi-billion dollar companies which capitalize on the strength of their domains to bloat the index with countless pages of keyword-rich fluff.


I'm mostly using Zite to read find relevant news. As I understand it, it takes my RSS feed, combines it with tweets from people I follow, and finds the most relevant news articles for me. Then I can up vote or dhow vote the articles to teach it what I find interesting.

Zite seems to be on the right track to fix the web is a mess problem...


At least to me the author seems like someone who values and wants more content with the "Slow Web" principles of interaction, timely based high quality content.



Just like the life may be. But both are important and beautiful too.


This is the failure of google. I stopped using google about 6 months ago and started using duckduckgo. But at the time I stopped using google, one of the reasons I stopped was that the quality was so low.

Hell, the quality of Google is so low that Bing is actually running ads right now with blind taste tests where people preferred bing. Of course this isn't scientific at all, but my point is-- no big scandal has erupted about how wrong this is. It's totally plausible for bing to do this because everyone realizes that google has gotten to the point where microsoft can plausibly compete with them!

Page rank was really cutting edge, but that was 10 years ago, yet it is still their primary mechanism. It's been gamed, but they seem uninterested in moving to more sophisticated mechanisms (they use them but the influence of better methods seems to be too low) ... meanwhile they've used their bully pulpit to influence the web to conserve page juice which has backfired in such a way that actual links to authoritative and useful sites are lower ranked than spam links, making it easier to game. (When wikipedia is using no-follow on relevant outbound links to pages that wikipedia is quoting or citing, things are fundamentally broken- no site on the web has a more favored ranking position than wikipedia. Not to mention hand curation of pages. You can't even correct errors there without having them reverted by some know-nothing whose sole accomplishment is rising in the ranks of wikipedia editors, so its not like they need this to prevent spam.)

This means the site that google unquestionably considers the most authoritative, when it cites a page that it considers authoritative, google gives that site no credibility. But let me create a web of sites that construct text that passes grammer parsers as "good english" but whose purpose is to spam keywords and link to each other and I can rank for those terms up close to wikipedia. (This is essentially what techcrunch is doing only they are having humans write low quality text instead of a computer.)

It's broken, and google broke it.


I disagree with a number of your points. I've noticed a huge reduction in the amount of spammy content I see in google results (over the last year, maybe), to the point that I actually make "productName reviews" searches again.

Your focus on pagerank is at least somewhat off base, considering that it's only a factor in ranking as noted by someone below, and it's a ranking that pretty much all the search engines use as well ("The Bing ranking algorithm analyzes many factors, including but not limited to: ... the number, relevance, and authoritative quality of websites that link to your webpages").

There is something to be said for no-follow links being a symptom of something broken, but, OTOH, page rank is still a good indicator of what people out on the web find to be useful and relevant content, allowing you to find popular content, cluster it by subject, etc...essentially crowd sourcing (a portion of) relevancy via something people do anyways. Gaming was inevitable, and no-follow is really more of a way to disincentivize spammers...the fact that with no-follow you get the spammers anyways (to get human eyeballs instead of crawlers') demonstrates that the motivation is always there. If a search engine trusts wikipedia's outbound links they don't have to obey no-follow in any case, but you still have the situation that everyone will have their own favorite "impartial" external links to add, not to mention people with a vested interest in the subject.

The possibility you forget in your "microsoft can plausibly compete with [google]" point (leaving aside the fact that most people are just ignoring it) was that bing has improved, and that has nothing to do with google breaking anything.


DDG works great for general searches, but for anything technical, I can plan on doing the DDG search and following up with "g!". I have DDG set as my default search engine and get tired of the double searches, but I want DDG to work better.


PageRank is no longer Google's primary mechanism. In fact, there is no single primary mechanism. Google has stated that the ranking you see when you search is the product of over 200 contributing factors.


I think the 'personalization' they have been doing is the primary culprit.

When I search for technical things or related to the news I feel like I can see the 'tint' actually. It's like google adds certain keywords to my searches, or reduces it to a much smaller subset.

When I search for something, sometimes it's because I want to find something I saw a while ago, but sometimes it is to get a new perspective on things. When you search for something and see the same opinion for the first 10 results you can tell how skewed it is.

Now I have to manually add 'criticism' or 'failure' to certain searches, or 'success' even. It's just weird.


Google went crap when it started favoring news and blog publishers over all other content. Almost all the results are blogs/news these days.


Wikipedia links are nofollow to discourage people from spamming it, not because they are worried about losing link juice.


In the Bing It On Challenge from Bing I picked Google 5 out of 5. Then I posted that on Facebook/Google+ as well, and more and more friends started posting telling me that they felt the exact same way AND that they too were getting better results from Google then Bing.

Yes, it is not a scientific study, but Bing's marketing isn't entire scientific in what "nearly 2 to 1" means either. Nearly 2 can be 1.5. Hell, ceil(1.1) is 2.

Google's search results have been improving more and more over time, for my technical searches (specifically related to programming) no other search engine even comes close in getting me the results I want. I do have Google's Web History turned on, which most likely allows Google to improve their search results better to what I will most likely want.


This is the failure of google. I stopped using google about 6 months ago and started using duckduckgo. But at the time I stopped using google, one of the reasons I stopped was that the quality was so low.

There is a difference between quality and freshness. I agree some SERPS on DDG look higher quality, but then when you dig down in the results you find out why: They are all safe choices. They could be pages from 2005. They could be pages that were once authoritative, but now lack topicality and news.

Hell, the quality of Google is so low that Bing is actually running ads right now with blind taste tests where people preferred bing

This is more marketing than research.

Page rank was really cutting edge, but that was 10 years ago, yet it is still their primary mechanism

It is one of 200 factors. Also there is internal pagerank and world-visible pagerank. Besides Google has been doing a lot with author rank and mentions.

It's been gamed, but they seem uninterested in moving to more sophisticated mechanisms (they use them but the influence of better methods seems to be too low)

Latent semantic indexing, query deserves diversity, query deserves freshness, detecting spam by following links in spam emails etc. There is no shortage of sophisticated methods.

meanwhile they've used their bully pulpit to influence the web to conserve page juice which has backfired in such a way that actual links to authoritative and useful sites are lower ranked than spam links, making it easier to game

Pagerank hoarding is an old and crummy idea. Google webmaster guidelines even say it is not a good idea to hoard pagerank, as it reeks of manipulation. There is also a decay factor.

http://www.mattcutts.com/blog/pagerank-sculpting/

  Even when I joined the company in 2000, Google was doing 
  more sophisticated link computation than you would observe
  from the classic PageRank papers. If you believe that
  Google stopped innovating in link analysis, that’s a
  flawed assumption.
Spam links are hell-banned by manual and algorithmic review.

(When wikipedia is using no-follow on relevant outbound links to pages that wikipedia is quoting or citing, things are fundamentally broken-

Spammers still exist. Spammers try to game healthy systems. Search engines are not broken, because Wikipedia tries to combat spammers... And links are not all there is. Googlebot still follows nofollow Twitter links. Mentions (words without links) are still worth a popularity vote for the things or people they mention.

no site on the web has a more favored ranking position than wikipedia. Not to mention hand curation of pages. You can't even correct errors there without having them reverted by some know-nothing whose sole accomplishment is rising in the ranks of wikipedia editors, so its not like they need this to prevent spam.)

It is about adding spammy external sources. If Wikipedia links were dofollow, much more sources would be added, not because they are good sources, but because they would do well with marketing. Wikipedia is a shining example of a site that gets lots of inbound links, mentions, great content and top notch internal linking.

This means the site that google unquestionably considers the most authoritative, when it cites a page that it considers authoritative, google gives that site no credibility.

If all Wikipedia cites were worthless, no one would gain an unfair advantage over gaming Wikipedia. But Wikipedia cites are not worthless. If you Google for a company A and B and only company A appears on Wikipedia, what do think of the quality difference between company A and B? If company A has 10.000 search results and company B has 1000 search results, what does that say about the reach (social proof) of company A? Also, like the mention-algorithm, check out: http://www.seobythesea.com/2012/01/named-entity-detection-in... (Entity detection). Finally: Not only who links to you counts for your quality/popularity, but also who you link to. Pages get rewarded for linking to quality resources.

But let me create a web of sites that construct text that passes grammer parsers as "good english" but whose purpose is to spam keywords and link to each other and I can rank for those terms up close to wikipedia. (This is essentially what techcrunch is doing only they are having humans write low quality text instead of a computer.)

Reading level and quality of journalism on Techcrunch aside: The article talked of once-been blogs, who cling to their gained reputation, to produce spammy content. Firstly: They are playing with fire. Google or their users could say: Enough is enough, you just lost your reputation/got hit with Panda. Then they are just another spammy/low-quality blog ranking somewhere around #1024. Secondly: Blogs like Techcrunch have a big company and money behind them. They organize offline events, get mentioned in newspapers or are the root start of an online discussion about a start-up or SF drama. They employ well-known writers. All things equal, it would be bad for Google to rank Techcrunch under a single-author amateur blog that started out last month. Even in lieu of high quality, Techcrunch is relevant and popular.

It's broken, and google broke it.

It is how it is. Use it to your advantage. Keep adding new fresh content and enjoy your pageviews. I know Bing isn't sending them my way...




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: