

A search engine built by the crowd - brequinn
http://blog.archify.com/2013/07/11/a-search-engine-built-by-the-crowd-that-does-not-suck/

======
jeremybencken
Great idea, but I think Google already uses this data.

Google Toolbar and now Chrome report this data back to Google, and most search
pros believe "serp bounce back" and "time on site" are key signals Google
uses.

PageRank and DwellRank are not either-or choices.

Here's my theory: Google uses PageRank to decide what pages to "try out" for a
query (i.e. display a page in the SERP for a sampling of queries). If the page
gets clicks AND has good "DwellRank" then it gets progressively better and
better rankings. If a new page enters that beats it, it falls.

This approach is very Googly -- they love to test. They love to decide if
product features are good or not by giving them a sampling of traffic. It
would be insane of them not to extend this approach to search.

So the upshot is, use "PageRank" to decide which pages deserve an audition,
and use "DwellRank" to decide the winners.

Since 40% of the clicks go to #1, 10% to #2, 8% to #3 etc,Google can audition
pages using DwellRank without affecting the experience of the majority of
their users.

------
gabemart
I was a little surprised that they haven't included anything about spam or
gamification. One core advantage of pagerank is that it's (relatively) hard to
get links from high-authority websites. I can't force whithouse.gov or cnn.com
to link to me. If you rely on time-spent-on-page from millions of users and
treat everyone equally, how to you stop spammers from faking millions of hours
spent reading their content using spoofing or bots?

~~~
ufo
Another problem is that if you just count time browsed then sites such as
Facebook, Reddit and Kongregate get super high rankings.

~~~
geraldbaeck
We are counting the per unique URL, which is currently not a big advantage for
the popular sites, because they are hosting so many URLs. There no domain
based factor in it right now.

-Gerald Disclosure: I am the CTO of archify/Blippex

------
mcintyre1994
Hmm, this is an interesting algorithm, but I'd challenge its major assumption
for a lot of searches. I don't have metrics, so of course my own assumptions
can be challenged also, feel free to.

I think that a lot of search engine enquiries are essentially questions, with
an answer that can be considered correct. Absolutely not all, but I think
enough that they should certainly be considered. In that case, a site which
immediately and clearly answers the question should be given, I want my answer
within seconds, not minutes. If you give me the site that answers my question
and that users spend the most time on, that's the exact opposite of what I
want in this case.

Here's an example, I search "Population of America", your site's top result is
sporcle.com, a quiz site. I bet people spend ages on there guessing the
population of various countries etc, but I'd prefer to just get my answer.

That said, it appears such queries are handled outside the main algorithm by
your competitors. Both Google and DuckDuckGo will give a card, at the top of
the result, answering my query - I don't even have to visit a website.

I guess the tl;dr is that it's awesome that this is ambitious, but I challenge
the assumption your algorithm is desirable for the majority of search results.
Neither is Google's really though, so maybe this is an overly harsh criticism
of something Google probably did very poorly early on too.

~~~
geraldbaeck
I think you guesses are absolutely right. We never intended to compete against
other engines in the field of "Answers". I think you will always get a better
result for ""Population of America" if you search for ai at DuckDuckGo or
Google. But in the other hand if you search for example for "NSA" on Blippex
([https://www.blippex.org/?q=NSA](https://www.blippex.org/?q=NSA)), we are
assuming that you will get those articles about the NSA which is currently the
most interesting or the most read.

-Gerald Disclosure: I am the CTO of archify/Blippex.

~~~
mcintyre1994
That's fair enough, it seems like it would be really useful for article
searching, like the NSA one you gave (great, example, blew Google out the
water!). I can see it being great for research too, assuming academics spend
the most time on a good source which seems reasonable. I'll definitely be
giving this a go for some searches, really nice!

Oh, and a quick suggestion: Have you considered a Firefox search engine addon?
I do most searches from the omnibar, and I think more people would switch
search engine up there than manually go via blippex.com

~~~
geraldbaeck
Thanks. Yes we will implement a Firefox search engine addon soon.

------
drKarl
This algorightm might highly impact discoverability. It gives mover visibility
to already popular websites making them even popular, while not very well
known websites will never be discovered because very few people spend time on
them.

Also I don't like the idea of having to install a plugin on my browser so that
the urls I visit and how much time I spent on them is tracked, even if
suposedly my identity is never tracked. Once the plugin is installed how can I
know if a new version of the plugin won't track more parameters?

When I read the title I though it was referring to a distributed search engine
like YacY or Seeks.

~~~
karli
Well, if anyone is able for example to implement a TOR client in javascript we
would love to add it to the plugin, and the sourcecode of the plugins and
Android app is on github, so no cheating there.

~~~
drKarl
+1 because the source code for the plugins is on github which I didn't know

------
robrenaud
+1 for being super ambitious.

Full disclosure: I work at Google (though not on web search).

Your search actually sucks, perhaps because your index is woefully inadequate.
How many pages are in it? Maybe you should use common crawl?

~~~
karli
Its says it a t the bottom of the page, not a lot, more stats are here:
[https://www.blippex.org/status](https://www.blippex.org/status)

------
a3_nm
It's sad to think that, with Google Analytics, Google probably has this data
point already available for a lot of pages without having to ask people to
install stuff.

~~~
marban
Sure, but last time I checked there was an opt-in for sharing the data (for
industry benchmarks I think) so I assume this would require a new legal
foundation.

------
dwg
1\. PRISM? 2\. Often best sites are the ones I spend the least amount of time
on—because I got an answer quickly. Would hate to not be able to find those
site. Seems link the traditional form of ranking should still be an important
part of your solution.

~~~
karli
Maybe, we don't kmow yet, need more data :) we also weight the number of
visits in the dwell factor, but maybe we have to adapt this in the future.

~~~
dwg
Cool. The converse of this is that perhaps I spend a lot of time on a site
because it's difficult to use (and there isn't a better alternative for me).
How about I instead give you access to my camera so you can measure my mood
(through facial expression). If I look happy or intrigued, must be a good
site!

------
c54
Google is a good name for a search engine, and easily usable as a verb.

Yahoo is pretty good, and was used as a verb for all of the 90s and early
2000s. IMO not as good as "I'm going to google that", "I'm going to yahoo
that" sounds vaguely sexual.

DuckDuckGo and AskJeeves are terribad: "I'm going to duck duck go that"? No.

Blippex is better than ddg or ask jeeves, but still not too great. Coming up
with a good product name is hard but is crucial for usability / spread through
culture. Reminds me of blumpkin.

~~~
marban
The verb bonus of Google is indeed killer but Google sounds primarily better
because Google is Google.

------
josephpmay
I think this is a great idea! However, I am worried about privacy. I also feel
like this algorithm may inflate the importance of certain types of content
over others. For example, just because I spend more time on a news or social
media website does not mean that it has higher quality content, it just means
that the content takes longer to consume. Within content categories, however,
I think this could do a good job of weeding out the quality content from the
spam.

~~~
geraldbaeck
Thanks, that is a very interesting input. We should think about running
additional semantic analysis and relate them to the time spent on sites. We
are very sure that our algorithm needs a lot of fine tuning and this could be
a very important part of it.

-Gerald Disclosure: I am the CTO of archify/Blippex.

------
trickjarrett
I had a similar idea in college which was to take the actual traffic for pages
into account for search ranking (this was before Google bought whatever
Analytics had been called before, I can't remember.) I had thought of it as a
server side app which would benefit the hosts while feeding the search engine
traffic data.

After talking with friends we explored the idea of a user side traffic
tracking app as a way to feed the search engine, but I couldn't get enough
traction and no one wanted to challenge not only Google but also
IE/Firefox/Safari etc. because we felt it would be its own browser.

Alas.

Now a days I am more concerned about possible privacy issues, I feel for them
launching a search engine that actively asks you to be tracked (even if
anonymously), it's a hard sell during this current resistance to that entire
idea.

~~~
pdog
_> We felt it would be its own browser._

Why not a browser add-on or extension?

~~~
trickjarrett
At the time FF was still behind IE and IE hadn't really adopted extensions
yet, I think. It's a bit fuzzy how we got there, this wasn't like a formal
business plan and analysis, this was some college guys in the dorms chewing on
an idea for a few weeks.

------
Shank
It needs some sort of fallback for search results or it's useless to a
specialized user. My Google search history looks like random bits of
consciousness spread out across months. Half of those search terms bring 0
results on Blippex, and while I understand that they're early, it's hard to
beat something like PageRank when it's already got established experience.

It's a catch 22: the results won't get better unless people use the service,
but people aren't going to use it if the results are bad in the first place.
If I install the extension but use Google, it's a one-way relationship that
only they get data out of. Not very good for me.

------
DavidWanjiru
How do you differentiate between useful dwell and useless dwell? I often need
to spend some time on a page before I realize this is not what I'm looking
for. How will you tell? And now that we're talking about search, I had an
experience on google that I found very odd. I was looking for the Richard Marx
song, "Suspicion" from the album "My Own Best Enemy". I knew the song and the
album, but I couldn't remember the name Richard Marx. Problem was instead of
typing "My Own Best Enemy", I was typing "My Own WORST Enemy". Google had no
clue. Shouldn't a good search engine be able to tell it's just one word wrong?

~~~
geraldbaeck
Differentiating the quality of a dwell would be nice, but that would mean to
track search trails of our users, which is too much of a privacy issue. But we
are thinking about semantically evaluating the DwellRank. For example a useful
dwell would be for tutorials. But this is just an assumption, we simply need
more data about that.

Gerald (Disclosure: archify/Blippex CTO)

------
jspaetzel
The problem with this is that it's basically asking to be manipulated.

------
alooPotato
Thanks for building this. We need more stuff like this.

Out of curiousity, how do you prevent the case of some random malicious user
impersonating your chrome extension and just issueing a bunch of "dwells" to
your server. I.e. can I just curl what this javascript file
([https://github.com/blippex/blippex_plugin_chrome/blob/master...](https://github.com/blippex/blippex_plugin_chrome/blob/master/plugin/common/js/api/upload.js))
is requesting to boost my own pages ranking?

~~~
geraldbaeck
We have some rate limits at our API in place, but of course it not that
difficult to change an IP-address. But most important we wrote some algorithms
which checks submitted URLs and domains for suspicious or accelerating
behaviour. If that happens we simply suspend that domain for some time. We are
also planning to publish those suspensions.

-Gerald, CTO archify/Blippex

~~~
alooPotato
cool!

------
vmarsy
Interesting idea, but I tried simple searches :

Facebook

gmail

news ycombinator

countries in europe wiki

Did you gather enough data already ?

All of these seraches were not successful. There was no Facebook link in the
first search, no Gmail link in the second one , no news.ycombinator in the 3rd
one, and the only wikipedia link I got in the last search was :

[http://en.wikipedia.org/wiki/National_champions](http://en.wikipedia.org/wiki/National_champions)

~~~
geraldbaeck
I don't think that those "generic" term are the main advantage of Blippex. But
if you search for example for "NSA" on Blippex
([https://www.blippex.org/?q=NSA](https://www.blippex.org/?q=NSA)), we are
assuming that you will get those articles about the NSA which is currently the
most interesting or the most read. -Gerald (Disclosure: I am the CTO of
archify/Blippex.)

~~~
vmarsy
I see, if there is enough data It would then make a lot of sense to search for
:

{A language/framework/... you want to learn about} tutorial

As some comments said, the best website are not necessarily the one you spend
most of your time on. But tutorials are an exception.

------
lucb1e
I'm sorry for the offtopic, but on a page that's supposed to get people
involved, shouldn't you at least mind the difference between _its_ and _it
's_? In the very first paragraph it goes wrong already. I'm not a native
English speaker, but these mistakes always jump out for some reason.

~~~
karli
Thank you for the find, we will fix it!

~~~
lucb1e
Well that was a quick response, at least that's a positive thing :). I
installed the add-on. Testing the search engine, searches seem to take
forever. It keeps displaying the spinning icon in the orange square next to my
query.

Edit: Ah the niceness of asynchronous javascript. It returns an error (I can
see it in the JS console) but the page never displays that to me. Good ol'
page reloads wouldn't have done that </rant>. In any case, the issue is my
header modifying add-on. It injects "'\ into the x-forwarded-for header,
causing your application to error. You probably have an sql injection in your
code somewhere.

If you want to track the issue down, my IP is 83.161.210.237 or
2001:980:1f44::/48 if you support IPv6. Timestamp around 17:51 UTC+2.

~~~
geraldbaeck
Could you please send me a link to the header modifying addon. So I can fix
that.

gb@blippex.org

~~~
lucb1e
I sent you an e-mail together with the console log, containing response codes
I'm getting. Hope this helps :)

------
gavinpc
Doesn't work at all without cookies. Meaning, it doesn't work, and doesn't
tell you why. If you're targeting people who are looking for an alternative to
the major search engines, there's a better-than-average chance that they'll
have cookies disabled.

~~~
karli
Blippex don't use any cookies, neither the API nor the website. We don't store
any data from the people accessing Blippex.

~~~
gavinpc
I love that! All the more reason why the search should function properly with
cookies disabled.

Just disable cookies in your browser, load Blippex, and search. I searched for
"shakespeare" in Firefox 22 (where I have cookies turned off), and the result
(below the fold, incidentally), was

> Nothing found > We're all like "What the blip, man?" too. ...

Same search in Chrome 28 works fine (and interesting, too).

This is my pet peeve on the web, and so common with HN posts that I don't
usually bother to point it out. But this seems like something you'd want to
know.

------
undef1ned
It's some kind of AI or even some kind of neural network, people are involved
to train the search engine, so, more data users will contribute to the search
server - more proper and relevant results they will get. Good idea

