Hacker News new | past | comments | ask | show | jobs | submit login
Indexing the hidden web (manton.org)
56 points by J3L2404 on Apr 29, 2012 | hide | past | web | favorite | 31 comments



I think people are thinking too much about beating Google solely with technology, which users clearly don't give a crap about. The best play against Google lies in branding a search engine.

I'd love to see someone to pony up and purchase a first-generation search engine (ie: Altavista, Lycos, Hotbot) that still has some nostalgic branding left. AOL, Yahoo, MSN have all been around since that generation but they've all gone through failed re-branding issues in a struggle to keep up with Google.

Pair it with a solid technology backend (DuckDuckGo) and you might actually have a chance to pick up the biggest US internet user demographic (Gen-Y) that happened to grow up with those sites.


"It's time for a search engine that isn't all about ads."

How does a search engine make money then?


If you search for "HD monitor" [1] & [2], or any other highly competitive commercial product, the scale tips to more of an ad service than a search provider. Yeah, they've got to make money, but they also need to be careful about maintaining balance as to not alienate users. Without a strong competitor, there's not much impetus to keep that balance.

[1] Search: https://www.google.com/search?btnG=1&pws=0&q=HD+moni...

[2] Hat tip: http://01100111011001010110010101101011.co.uk/2012/04/matt-c...


the ads are targeted, relevant ads. why shouldn't they show you ads for HD monitors when you search for HD monitor? if you search for an overly broad term like that, you're either going to see google's ads or you're going to see an advertisement from some other website masquerading as a top ten list or something.


On my monitor, all the links above the fold are Google ads. I'm not suggesting they stop showing ads. That's how they make money. It's just a question of balance between ads and organic search. The amount of screen real estate dedicated to PPC has been increasing over the last few years, though, and a lot of people are taking notice. Plus, Google comes off a bit hypocritical when Brin criticizes Facebook for being a walled garden.

Bing isn't too much better for this search phrase, but kudos to DDG for the Super User post atop the search results.


Why would ads be the only way they can make money?


You didn't answer the question.


I'm curious: What could another viable business model be for a search engine?

Back in the days of something like the yellow pages, it was also an advertising supported model (or pay-for-placement, something that might damage the credibility of a search engine). What I also found very interesting is the term 'yellow pages' is in the top 5 highest revenue generating search terms. [1]

[1]:http://en.wikipedia.org/wiki/Yellow_Pages


How about "I'll pay you $50 if you stop trying to repeatedly screw me, and drop the +1 buttons while you're at it". It's not like the average consumer can live easily without search, so why not straight up charge for it, like we do every other utility on the planet.

There is some broken cultural obsession with making this stuff free, and that Next Big Search must have automatic "web scale" appeal to the entire planet and cater to all needs and use cases simultaneously.

I'd happily pay for a curated, spam free index if that index satisfied 90% of my needs, retaining privacy raping, up-selling free search for the remaining 10%. About 50% of my search traffic is already in the form of keyword searches pointed at IMDB/Wikipedia/eBay/Amazon site-specific search, so why not just wrap this up for me.

Hugely complex automation, web scale, index freshness, instant search, deep web blah blah, and all the other costly noise I care a good deal less about. Sometimes thinking small doesn't hurt that much.

And if you look at what I'm saying from the right angle, you might even see a thousand untapped niches for industry/lifestyle specific companies collecting money from consumers in their segment and passing it to retailers who charge for access to their (otherwise hidden behind proprietary apps) catalogues. I've no idea why this isn't more common already, it's as if the entire industry has been frightened into believing that using computers to find things we already know exist to be an insanely difficult task.


Downvoted for language, FYI.

"About 50% of my search traffic is already in the form of keyword searches pointed at IMDB/Wikipedia/eBay/Amazon site-specific search, so why not just wrap this up for me."

Do you like how DuckDuckGo handles this?


Fair enough point on the language, I've toned it down.

As for DDG, kinda but not really. There is a fundamental disconnect between my everyday needs on a computer and the kind of needs that Big Search apparently inexorably must accomplish (and fashionably heralded by the industry in forums like HN), that DDG attempts to emulate.

Most of the time I want to access my trusted vendors (bank, shop, eBay, ...) and a search engine is supposed to be my entry page to achieve that. But say, as I just searched on DDG now for "The Art Of Computer Programming", I get (from my perspective) trusted results from Wikipedia and Amazon, mixed in with crappy spam from freebookzone.com and so on.

My point is that I don't care about that freebookzone.com link, it's a (difficult) problem my ideal search engine would not be attempting to solve.

I am DMW, I pay you $50/year because you don't try to spelling correct programming language identifiers, and for the privilege of you knowing that 90% of the time when I search for a book I want you to send me to Amazon UK, even if I'm connected from Estonia, and apply these rules because I told you them somehow (say, during signup). If your results are incomplete, then provide me with access to some separate deep web mining/ranking service, who I'd expect are paid a cut of my subscription to gain access to.


I know that domain specific search engines for finance charge per usage.


Lots of things come to mind ...

1) paying for fine-grained index controls, eg you publish something new, head over to search engine and tell it to spider your site, or you tell it to spider it between 2 am and 3am, whatever. You could also use this to test updates you're making ... imagine being able to do a dry run on your new version and see this is going to cripple your SE traffic. Or get your new article analyzed before you publish it.

2) dress listings up ala ebay ... not sure if they still do it but they used to do cheesy crap that let you make your listing stand out more than the other guys, if there's a tasteful way to do that on a CPM basis it would print money

3) charging for an api like bing etc are doing

4) charge for reports on phrases, websites, industries

5) charge for telling you why your competitors are outranking you

6) charge sites a subscription ... AOL probably gets most of their traffic straight from Google, and they probably deserve almost none of it, so make them pay for that traffic. The large, eyeball-driven sites could easily be discriminated against.

7) charge low quality sites to un-penalize them. This is not a pardon, it's just a reset and it'll eat into their margins but whatever, they need your traffic.

This all revolves around two things: tax garbage sites, and provide tools for legitimate sites. These feel like low hanging fruits to me, there'd have to be much more interesting ways to monetize it than these.

The only hard part really is getting people to give a shit that you made / have / are a search engine.


There are two big problems with this:

1. You can get a lot of that data and those tools for free already. Webmaster Tools and Google Analytics.

2. Many of the things you'd charge for are things people don't understand anyway. In our little hacker bubble we know how valuable this stuff is and see a fair amount of companies use it but the vast majority of websites are operated by mom and pop shops and mom and pop can barely figure out how to turn on their computer. Expecting them to have any interest in getting or interpreting those reports is like trying to get them to learn quantum theory. You'll end up with a very limited customer base.

These paid options create an unfair advantage. It's the exact reason why Google was so successful. Google is trusted and popular because it isnt a pay-to-play system. People will quickly figure out that the rankings are biased and quit using the engine. This is a step backwards in search.

Saying that charging to unpenalize a site isn't a pardon but a "reset" is disingenuous. Call it whatever you'd like but in the end it really is a pardon. The idea is to discourage sites from gaming the system and your whole idea is to encourage them to. What we'll end up with in the end is that what you call "garbage sites" are just sites without a lot of money and "legit sites" are those with money.

I'm sorry but your plan just takes us back to the pre-google dark ages of search.


1) You can't get any of those things from Google at all. You can get a few little morsels of vagueness from GWT which is free because it doesn't do anything worth paying for. And GA is a whole other service that has little to do with anything I described. Probably the only decent tool they offer is the AdWords keyword research tool and again ... it's not worth paying for, you have to come up with the keywords yourself... that's not useful. There's a whole industry of SEO tools like http://ginzametrics.com/ and of course http://seomoz.org/ that aren't cheap and compensate for the lack of 1st party tools.

2) There's a whole SEO industry that operates on a hazy interpretation of what Google is supposed to be doing these days ... lots of companies know what SEO is, they know what it does, they know why they need it, and they pay out the arse for it. This brings clarity to that industry and those companies instead of letting them reverse engineer the changes you make and speculate on what matters. If they're willing to pay $100s/hr for SEO they'll surely pay $1000s for a roadmap straight from the source. That's like a printing press for money because that data expires when you act on it.

Money creates an unfair advantage right now. Pay people to spam backlinks to your website and you'll rate higher. Pay people to write summaries of blog posts and eventually you'll rate higher than those blogs you're sourcing your content from just because you can afford to generate more content faster. Pay people to submit and vote on digg, reddit, bla bla bla. Pay people to write about your product and create content. Pay people to market your site by writing content tailored for social media communities and get 1000s of backlinks. Pay people to do viral marketing stuff. Pay people to link to you. Pay Google to feature you above the search results.

The 'pardoning' is a little scammy and would be difficult to implement but the goal isn't to encourage them to take advantage of the system, the goal is to get your share because they're going to take advantage of it regardless. Google does this already via AdSense.


One of us doesn't get it. Maybe I'm not understanding but I don't see how what you describe would be any different than how it is today. If search is all about the most relevant results then the engines would still operate much the same as they do now so money would still create an unfair advantage and reward scammers who would still do everything you described in addition to using the paid features.

Furthermore I don't think there is a way to get data that is any less vague than it is now. Each site is so unique that this solution can't scale and you'd have to settle for analyzing the data yourself. Also, Analytics does have to do with what you described when it comes to seeing what's working as far as SEO goes and yes, AdWords would be more appropriate as an example when talking about competitor and keyword research. For some reason I thought those tools were in GA.

Generally though this really seems like a return to the bad old days except instead of keyword stuffing your meta tags you pay to play. Your whole plan would lead to the end of truly organic results. Yeah, the system a Ready gets gamed now but at least everyone has am equal shot of gaming it. All you need is the knowledge. The current paid techniques of gaming the system would simply shift from third parties to the search engines themselves. I also feel like what you describe is closer to a paid directory with search functionality than a search engine. I mean, even if it worked like search does today plus those paid features it wouldn't be long before the true search functionality became irrelevant and we'd be left with a directory where whoever paid the most came out on top.

To your credit, I agree that it would be nice to get some more data, better data, and data presented in a more human-friendly/layperson-friendly way but you lose me as soon as you get into a lot of these paid features that help you rank higher.


Today everything SEO is a combination of educated guesses and common consensus. Even with incomplete or flat out wrong information people still successfully manipulate rankings to push good or bad content higher.

There is nobody out there who knows exactly what is going on or whether your redesign is going to help or harm or whether your content is the best it can be. But they will charge you lots of money to apply what they've observed to work before or to automate processes and monitoring and performance.

All of this happens today without any specific clarity into how Google works, I don't think it would worsen the situation if the guesswork was taken out of the equation - sites with no SEO still won't matter, sites with SEO still will, and bad people/sites will still be an on-going game of whack-a-mole.


I think the next innovation in search is crowd-sourced search. Users contribute directly to the index through a browser extension or somesuch. That way, you can get the site itself, how popular a site is by how many people visit it, and you also get the referrers.

I experimented with this idea about a month ago. You can grab the source here (https://github.com/SeditiousTech/Avina) and visit the index here (avina.apphb.com). It's not a real search engine per-se, but it is/was a pretty cool experiment. One of the problems is that people will forget to turn the extension off when accessing personal information (banking, porn etc).


Google already use social signals. This is why they want +1 buttons everywhere. They have many more social signals than just the buttons though.


If it can be indexed then it's not hidden, right? Though I guess in this context hidden doesn't mean 'hidden on purpose', more that it's inaccessible.


Who has a leg up in this? Duck duck go maybe? The more I use them the more I love them...


I like DuckDuckGo for some uses, mainly those that let me specify the context of the search term (for example http://duckduckgo.com/?q=firefly ). However it's inferior to Google search for local content (from my country) or understanding strange error strings I get while programming.


Did anyone else read this as having a subtext that basically implies Google needs to be taken down?

Why is it that when we talk about making search better many (seriously, like gobs of people) talk like the only way to improve it is to overthrow Google. Improvements in search can come from anywhere. If Google can deliver on what the author talks about then that's great. If someone else can then that's great too. The point is to improve search not overthrow Google's dominance, right?

I'm all for the ideas in this article but I was totally turned off by the subtext that implied a need to take down Google. We don't really need a next Google. Google can be the next Google. It doesn't matter so long as the hidden web is indexed.

Why is it that as soon as a company is no longer the underdog we immediately throw them under the bus. Microsoft made PCs the norm in US households and now we love to tear them down (rightly so in many cases, admittedly). Facebook used to be the coolest thing ever and now we love to hate them too. Same with Google and Apple. Why do we hate incumbents so badly?

In any case, yes, let's index that hidden web. But let's focus on the indexing itself rather than who does it. If Google succeeds at doing this will it not count and will we still call for someone else to "disrupt" the new hidden web indexing industry?


I think we love hating incumbents because of the history. Modern incumbents arguably aren't that bad at all, but in the old days incumbents were always the ones putting a handbrake on progress.

For instance, a whole city rioting to break new looms because it was putting "honest weavers" out of business.

Or the publishing world rioting against anything that smells of sharing ... since forever.

Google surprisingly doesn't act like an incumbent at all. And that's good. We shouldn't hate on them because they are incumbents since they're doing a damn good job at it.

edit: Also the whole idea that "When a market is dominated by a single player. That market is ripe for disruption."


Google excels at some things but they are useless and should be replaced with others - you shouldn't have to come crying to HN after Google banned your account with years of email, or thousands in adsense revenue, or whatever, in the hope that a Google employee here might see it and act on it.


They should be replaced? Why? See, that's my whole point. Why can't they simply correct what's wrong? What they should really do is fix some of their problems. It doesn't have to take a competitor for this to happen and Google has historically been pretty good at getting better over time. Why do we place so much focus on who gets the job done when what we should really be focusing on is simply getting the job done no matter who does it?

And your examples of losing email or Adsense accounts isn't so solid. Those are really edge cases and its a problem endemic to creating applications that need to catch abuse especially when the user base is so enormous. We know computers aren't people and they can't exactly think so considering the amount of data Google has to filter through and knowing you'll never write code that's one-size-fits all I think they're doing a good job. I'd presume any competitor would have similar problems once they grow to a certain size.

I expect someone to call me out for saying you can never write one-size-fits-all code -- to them I'd say it's true; as long as humans continue to be fallible then so will the systems we create. There will always be an edge case and it'll take every last one to pop up before we can even conceive of trying to catch them all. But that's off track so I'll end it here.


It's far, far too generous to just forgive them and write off their problems as being an inevitable result of Google's scale - their scale makes support expensive, not impossible. Support is a problem plenty of giant companies have figured out already even if they do it poorly.

As for the 'who' ... doesn't really matter whether it's Google or someone else that fixes whatever problems but historically it's not in their DNA to care about individual users so it feels quite natural to assume they'd be replaced rather than repaired.


I'd like to see the actual hidden web indexed. I'm not talking about data behind apps, I mean a browser with built-in onion support, and onion-google.


yeah, when I saw the title I thought it's about tor indexing (which technically would be a pain)


Something that ignores robots.txt?


What? No, I'm talking about torweb here, what are you even




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: