
Indexing the hidden web - J3L2404
http://www.manton.org/2012/04/indexing_the_hidden.html
======
aslewofmice
I think people are thinking too much about beating Google solely with
technology, which users clearly don't give a crap about. The best play against
Google lies in branding a search engine.

I'd love to see someone to pony up and purchase a first-generation search
engine (ie: Altavista, Lycos, Hotbot) that still has some nostalgic branding
left. AOL, Yahoo, MSN have all been around since that generation but they've
all gone through failed re-branding issues in a struggle to keep up with
Google.

Pair it with a solid technology backend (DuckDuckGo) and you might actually
have a chance to pick up the biggest US internet user demographic (Gen-Y) that
happened to grow up with those sites.

------
ricksta
"It's time for a search engine that isn't all about ads."

How does a search engine make money then?

~~~
benologist
Why would ads be the only way they can make money?

~~~
chalmerj
I'm curious: What could another viable business model be for a search engine?

Back in the days of something like the yellow pages, it was also an
advertising supported model (or pay-for-placement, something that might damage
the credibility of a search engine). What I also found very interesting is the
term 'yellow pages' is in the top 5 highest revenue generating search terms.
[1]

[1]:<http://en.wikipedia.org/wiki/Yellow_Pages>

~~~
benologist
Lots of things come to mind ...

1) paying for fine-grained index controls, eg you publish something new, head
over to search engine and tell it to spider your site, or you tell it to
spider it between 2 am and 3am, whatever. You could also use this to test
updates you're making ... imagine being able to do a dry run on your new
version and see this is going to cripple your SE traffic. Or get your new
article analyzed before you publish it.

2) dress listings up ala ebay ... not sure if they still do it but they used
to do cheesy crap that let you make your listing stand out more than the other
guys, if there's a tasteful way to do that on a CPM basis it would print money

3) charging for an api like bing etc are doing

4) charge for reports on phrases, websites, industries

5) charge for telling you why your competitors are outranking you

6) charge sites a subscription ... AOL probably gets most of their traffic
straight from Google, and they probably _deserve_ almost none of it, so make
them pay for that traffic. The large, eyeball-driven sites could easily be
discriminated against.

7) charge low quality sites to un-penalize them. This is not a pardon, it's
just a reset and it'll eat into their margins but whatever, they need your
traffic.

This all revolves around two things: tax garbage sites, and provide tools for
legitimate sites. These feel like low hanging fruits to me, there'd have to be
much more interesting ways to monetize it than these.

The only hard part really is getting people to give a shit that you made /
have / are a search engine.

~~~
billpatrianakos
There are two big problems with this:

1\. You can get a lot of that data and those tools for free already. Webmaster
Tools and Google Analytics.

2\. Many of the things you'd charge for are things people don't understand
anyway. In our little hacker bubble we know how valuable this stuff is and see
a fair amount of companies use it but the vast majority of websites are
operated by mom and pop shops and mom and pop can barely figure out how to
turn on their computer. Expecting them to have any interest in getting or
interpreting those reports is like trying to get them to learn quantum theory.
You'll end up with a very limited customer base.

These paid options create an unfair advantage. It's the exact reason why
Google was so successful. Google is trusted and popular because it _isnt_ a
pay-to-play system. People will quickly figure out that the rankings are
biased and quit using the engine. This is a step backwards in search.

Saying that charging to unpenalize a site isn't a pardon but a "reset" is
disingenuous. Call it whatever you'd like but in the end it really is a
pardon. The idea is to discourage sites from gaming the system and your whole
idea is to encourage them to. What we'll end up with in the end is that what
you call "garbage sites" are just sites without a lot of money and "legit
sites" are those with money.

I'm sorry but your plan just takes us back to the pre-google dark ages of
search.

~~~
benologist
1) You can't get any of those things from Google at all. You can get a few
little morsels of vagueness from GWT which is free because it doesn't do
anything worth paying for. And GA is a whole other service that has little to
do with anything I described. Probably the only _decent_ tool they offer is
the AdWords keyword research tool and again ... it's not worth paying for, you
have to come up with the keywords yourself... that's not useful. There's a
whole industry of SEO tools like <http://ginzametrics.com/> and of course
<http://seomoz.org/> that aren't cheap and compensate for the lack of 1st
party tools.

2) There's a whole SEO industry that operates on a hazy interpretation of what
Google is supposed to be doing these days ... lots of companies know what SEO
is, they know what it does, they know why they need it, and they pay out the
arse for it. This brings clarity to that industry and those companies instead
of letting them reverse engineer the changes you make and speculate on what
matters. If they're willing to pay $100s/hr for SEO they'll surely pay $1000s
for a roadmap straight from the source. That's like a printing press for money
because that data expires when you act on it.

Money creates an unfair advantage right now. Pay people to spam backlinks to
your website and you'll rate higher. Pay people to write summaries of blog
posts and eventually you'll rate higher than those blogs you're sourcing your
content from just because you can afford to generate more content faster. Pay
people to submit and vote on digg, reddit, bla bla bla. Pay people to write
about your product and create content. Pay people to market your site by
writing content tailored for social media communities and get 1000s of
backlinks. Pay people to do viral marketing stuff. Pay people to link to you.
Pay Google to feature you above the search results.

The 'pardoning' is a little scammy and would be difficult to implement but the
goal isn't to encourage them to take advantage of the system, the goal is to
get your share because they're going to take advantage of it regardless.
Google does this already via AdSense.

~~~
billpatrianakos
One of us doesn't get it. Maybe I'm not understanding but I don't see how what
you describe would be any different than how it is today. If search is all
about the most relevant results then the engines would still operate much the
same as they do now so money would still create an unfair advantage and reward
scammers who would still do everything you described in addition to using the
paid features.

Furthermore I don't think there is a way to get data that is any less vague
than it is now. Each site is so unique that this solution can't scale and
you'd have to settle for analyzing the data yourself. Also, Analytics does
have to do with what you described when it comes to seeing what's working as
far as SEO goes and yes, AdWords would be more appropriate as an example when
talking about competitor and keyword research. For some reason I thought those
tools were in GA.

Generally though this really seems like a return to the bad old days except
instead of keyword stuffing your meta tags you pay to play. Your whole plan
would lead to the end of truly organic results. Yeah, the system a Ready gets
gamed now but at least everyone has am equal shot of gaming it. All you need
is the knowledge. The current paid techniques of gaming the system would
simply shift from third parties to the search engines themselves. I also feel
like what you describe is closer to a paid directory with search functionality
than a search engine. I mean, even if it worked like search does today plus
those paid features it wouldn't be long before the true search functionality
became irrelevant and we'd be left with a directory where whoever paid the
most came out on top.

To your credit, I agree that it would be nice to get some more data, better
data, and data presented in a more human-friendly/layperson-friendly way but
you lose me as soon as you get into a lot of these paid features that help you
rank higher.

~~~
benologist
Today everything SEO is a combination of educated guesses and common
consensus. Even with incomplete or flat out wrong information people still
successfully manipulate rankings to push good or bad content higher.

There is nobody out there who knows _exactly_ what is going on or whether your
redesign is going to help or harm or whether your content is the best it can
be. But they will charge you lots of money to apply what they've observed to
work before or to automate processes and monitoring and performance.

All of this happens today _without_ any specific clarity into how Google
works, I don't think it would worsen the situation if the guesswork was taken
out of the equation - sites with no SEO still won't matter, sites with SEO
still will, and bad people/sites will still be an on-going game of whack-a-
mole.

------
huragok
I think the next innovation in search is crowd-sourced search. Users
contribute directly to the index through a browser extension or somesuch. That
way, you can get the site itself, how popular a site is by how many people
visit it, and you also get the referrers.

I experimented with this idea about a month ago. You can grab the source here
(<https://github.com/SeditiousTech/Avina>) and visit the index here
(avina.apphb.com). It's not a real search engine per-se, but it is/was a
pretty cool experiment. One of the problems is that people will forget to turn
the extension off when accessing personal information (banking, porn etc).

~~~
nikatwork
Google already use social signals. This is why they want +1 buttons
everywhere. They have many more social signals than just the buttons though.

------
therobotking
If it can be indexed then it's not hidden, right? Though I guess in this
context hidden doesn't mean 'hidden on purpose', more that it's inaccessible.

------
RawData
Who has a leg up in this? Duck duck go maybe? The more I use them the more I
love them...

~~~
eternauta3k
I like DuckDuckGo for some uses, mainly those that let me specify the context
of the search term (for example <http://duckduckgo.com/?q=firefly> ). However
it's inferior to Google search for local content (from my country) or
understanding strange error strings I get while programming.

------
billpatrianakos
Did anyone else read this as having a subtext that basically implies Google
needs to be taken down?

Why is it that when we talk about making search better many (seriously, like
gobs of people) talk like the only way to improve it is to overthrow Google.
Improvements in search can come from anywhere. If Google can deliver on what
the author talks about then that's great. If someone else can then that's
great too. The point is to improve search not overthrow Google's dominance,
right?

I'm all for the ideas in this article but I was totally turned off by the
subtext that implied a need to take down Google. We don't really need a next
Google. Google can be the next Google. It doesn't matter so long as the hidden
web is indexed.

Why is it that as soon as a company is no longer the underdog we immediately
throw them under the bus. Microsoft made PCs the norm in US households and now
we love to tear them down (rightly so in many cases, admittedly). Facebook
used to be the coolest thing ever and now we love to hate them too. Same with
Google and Apple. Why do we hate incumbents so badly?

In any case, yes, let's index that hidden web. But let's focus on the indexing
itself rather than who does it. If Google succeeds at doing this will it not
count and will we still call for someone else to "disrupt" the new hidden web
indexing industry?

~~~
Swizec
I think we love hating incumbents because of the history. Modern incumbents
arguably aren't that bad at all, but in the old days incumbents were always
the ones putting a handbrake on progress.

For instance, a whole city rioting to break new looms because it was putting
"honest weavers" out of business.

Or the publishing world rioting against anything that smells of sharing ...
since forever.

Google surprisingly doesn't act like an incumbent at all. And that's good. We
shouldn't hate on them because they are incumbents since they're doing a damn
good job at it.

edit: Also the whole idea that "When a market is dominated by a single player.
That market is ripe for disruption."

~~~
benologist
Google excels at some things but they are useless and _should_ be replaced
with others - you shouldn't have to come crying to HN after Google banned your
account with years of email, or thousands in adsense revenue, or whatever, in
the hope that a Google employee _here_ might see it and act on it.

~~~
billpatrianakos
They _should_ be replaced? Why? See, that's my whole point. Why can't they
simply correct what's wrong? What they should _really_ do is fix some of their
problems. It doesn't have to take a competitor for this to happen and Google
has historically been pretty good at getting better over time. Why do we place
so much focus on _who_ gets the job done when what we should really be
focusing on is simply getting the job done no matter who does it?

And your examples of losing email or Adsense accounts isn't so solid. Those
are really edge cases and its a problem endemic to creating applications that
need to catch abuse especially when the user base is so enormous. We know
computers aren't people and they can't exactly think so considering the amount
of data Google has to filter through and knowing you'll never write code
that's one-size-fits all I think they're doing a good job. I'd presume any
competitor would have similar problems once they grow to a certain size.

I expect someone to call me out for saying you can never write one-size-fits-
all code -- to them I'd say it's true; as long as humans continue to be
fallible then so will the systems we create. There will always be an edge case
and it'll take every last one to pop up before we can even conceive of trying
to catch them all. But that's off track so I'll end it here.

~~~
benologist
It's far, far too generous to just forgive them and write off their problems
as being an inevitable result of Google's scale - their scale makes support
expensive, not impossible. Support is a problem plenty of giant companies have
figured out already even if they do it poorly.

As for the 'who' ... doesn't really matter whether it's Google or someone else
that fixes whatever problems but historically it's not in their DNA to care
about individual users so it feels quite natural to assume they'd be replaced
rather than repaired.

------
voxx
I'd like to see the actual hidden web indexed. I'm not talking about data
behind apps, I mean a browser with built-in onion support, and onion-google.

~~~
DanBC
Something that ignores robots.txt?

~~~
voxx
What? No, I'm talking about torweb here, what are you even

