Hacker News new | comments | show | ask | jobs | submit login
A search engine built by the crowd (archify.com)
43 points by brequinn 1590 days ago | hide | past | web | 53 comments | favorite

Great idea, but I think Google already uses this data.

Google Toolbar and now Chrome report this data back to Google, and most search pros believe "serp bounce back" and "time on site" are key signals Google uses.

PageRank and DwellRank are not either-or choices.

Here's my theory: Google uses PageRank to decide what pages to "try out" for a query (i.e. display a page in the SERP for a sampling of queries). If the page gets clicks AND has good "DwellRank" then it gets progressively better and better rankings. If a new page enters that beats it, it falls.

This approach is very Googly -- they love to test. They love to decide if product features are good or not by giving them a sampling of traffic. It would be insane of them not to extend this approach to search.

So the upshot is, use "PageRank" to decide which pages deserve an audition, and use "DwellRank" to decide the winners.

Since 40% of the clicks go to #1, 10% to #2, 8% to #3 etc,Google can audition pages using DwellRank without affecting the experience of the majority of their users.

I was a little surprised that they haven't included anything about spam or gamification. One core advantage of pagerank is that it's (relatively) hard to get links from high-authority websites. I can't force whithouse.gov or cnn.com to link to me. If you rely on time-spent-on-page from millions of users and treat everyone equally, how to you stop spammers from faking millions of hours spent reading their content using spoofing or bots?

We have a lot of plans for this, but first we have to test how the searchresults are developing and of course we will adapt the ranking if we need too. Everyone is invited to help us in solving this!

Yes that could be a potential issue, expecially if you want to keep the data of the submitters as anonymous as possible. We are monitoring this closely, but honestly we don't have a single hammer solution against spam as always it needs my small steps. I will post some thoughts about it on our blog in the days.

-Gerald Disclosure: I am the CTO of archify/Blippex.

Another problem is that if you just count time browsed then sites such as Facebook, Reddit and Kongregate get super high rankings.

We are counting the per unique URL, which is currently not a big advantage for the popular sites, because they are hosting so many URLs. There no domain based factor in it right now.

-Gerald Disclosure: I am the CTO of archify/Blippex

Hmm, this is an interesting algorithm, but I'd challenge its major assumption for a lot of searches. I don't have metrics, so of course my own assumptions can be challenged also, feel free to.

I think that a lot of search engine enquiries are essentially questions, with an answer that can be considered correct. Absolutely not all, but I think enough that they should certainly be considered. In that case, a site which immediately and clearly answers the question should be given, I want my answer within seconds, not minutes. If you give me the site that answers my question and that users spend the most time on, that's the exact opposite of what I want in this case.

Here's an example, I search "Population of America", your site's top result is sporcle.com, a quiz site. I bet people spend ages on there guessing the population of various countries etc, but I'd prefer to just get my answer.

That said, it appears such queries are handled outside the main algorithm by your competitors. Both Google and DuckDuckGo will give a card, at the top of the result, answering my query - I don't even have to visit a website.

I guess the tl;dr is that it's awesome that this is ambitious, but I challenge the assumption your algorithm is desirable for the majority of search results. Neither is Google's really though, so maybe this is an overly harsh criticism of something Google probably did very poorly early on too.

I think you guesses are absolutely right. We never intended to compete against other engines in the field of "Answers". I think you will always get a better result for ""Population of America" if you search for ai at DuckDuckGo or Google. But in the other hand if you search for example for "NSA" on Blippex (https://www.blippex.org/?q=NSA), we are assuming that you will get those articles about the NSA which is currently the most interesting or the most read.

-Gerald Disclosure: I am the CTO of archify/Blippex.

That's fair enough, it seems like it would be really useful for article searching, like the NSA one you gave (great, example, blew Google out the water!). I can see it being great for research too, assuming academics spend the most time on a good source which seems reasonable. I'll definitely be giving this a go for some searches, really nice!

Oh, and a quick suggestion: Have you considered a Firefox search engine addon? I do most searches from the omnibar, and I think more people would switch search engine up there than manually go via blippex.com

Thanks. Yes we will implement a Firefox search engine addon soon.

This algorightm might highly impact discoverability. It gives mover visibility to already popular websites making them even popular, while not very well known websites will never be discovered because very few people spend time on them.

Also I don't like the idea of having to install a plugin on my browser so that the urls I visit and how much time I spent on them is tracked, even if suposedly my identity is never tracked. Once the plugin is installed how can I know if a new version of the plugin won't track more parameters?

When I read the title I though it was referring to a distributed search engine like YacY or Seeks.

Well, if anyone is able for example to implement a TOR client in javascript we would love to add it to the plugin, and the sourcecode of the plugins and Android app is on github, so no cheating there.

+1 because the source code for the plugins is on github which I didn't know

+1 for being super ambitious.

Full disclosure: I work at Google (though not on web search).

Your search actually sucks, perhaps because your index is woefully inadequate. How many pages are in it? Maybe you should use common crawl?

Its says it a t the bottom of the page, not a lot, more stats are here: https://www.blippex.org/status

It's sad to think that, with Google Analytics, Google probably has this data point already available for a lot of pages without having to ask people to install stuff.

Sure, but last time I checked there was an opt-in for sharing the data (for industry benchmarks I think) so I assume this would require a new legal foundation.

1. PRISM? 2. Often best sites are the ones I spend the least amount of time on—because I got an answer quickly. Would hate to not be able to find those site. Seems link the traditional form of ranking should still be an important part of your solution.

Maybe, we don't kmow yet, need more data :) we also weight the number of visits in the dwell factor, but maybe we have to adapt this in the future.

Cool. The converse of this is that perhaps I spend a lot of time on a site because it's difficult to use (and there isn't a better alternative for me). How about I instead give you access to my camera so you can measure my mood (through facial expression). If I look happy or intrigued, must be a good site!

Google is a good name for a search engine, and easily usable as a verb.

Yahoo is pretty good, and was used as a verb for all of the 90s and early 2000s. IMO not as good as "I'm going to google that", "I'm going to yahoo that" sounds vaguely sexual.

DuckDuckGo and AskJeeves are terribad: "I'm going to duck duck go that"? No.

Blippex is better than ddg or ask jeeves, but still not too great. Coming up with a good product name is hard but is crucial for usability / spread through culture. Reminds me of blumpkin.

The verb bonus of Google is indeed killer but Google sounds primarily better because Google is Google.

I kind of feel like "blip" is way, way better than all of the above. It's a word that's already a word, and its use on Blippex is almost congruous with the actual word. In my mind, that's a win.

Blippex sounds like a dubstep musician.

Despite dubstep being apparently totally uncool, I'm cool with that.

Wait for the bass drop.

I think this is a great idea! However, I am worried about privacy. I also feel like this algorithm may inflate the importance of certain types of content over others. For example, just because I spend more time on a news or social media website does not mean that it has higher quality content, it just means that the content takes longer to consume. Within content categories, however, I think this could do a good job of weeding out the quality content from the spam.

Thanks, that is a very interesting input. We should think about running additional semantic analysis and relate them to the time spent on sites. We are very sure that our algorithm needs a lot of fine tuning and this could be a very important part of it.

-Gerald Disclosure: I am the CTO of archify/Blippex.

I had a similar idea in college which was to take the actual traffic for pages into account for search ranking (this was before Google bought whatever Analytics had been called before, I can't remember.) I had thought of it as a server side app which would benefit the hosts while feeding the search engine traffic data.

After talking with friends we explored the idea of a user side traffic tracking app as a way to feed the search engine, but I couldn't get enough traction and no one wanted to challenge not only Google but also IE/Firefox/Safari etc. because we felt it would be its own browser.


Now a days I am more concerned about possible privacy issues, I feel for them launching a search engine that actively asks you to be tracked (even if anonymously), it's a hard sell during this current resistance to that entire idea.

> We felt it would be its own browser.

Why not a browser add-on or extension?

At the time FF was still behind IE and IE hadn't really adopted extensions yet, I think. It's a bit fuzzy how we got there, this wasn't like a formal business plan and analysis, this was some college guys in the dorms chewing on an idea for a few weeks.

How do you differentiate between useful dwell and useless dwell? I often need to spend some time on a page before I realize this is not what I'm looking for. How will you tell? And now that we're talking about search, I had an experience on google that I found very odd. I was looking for the Richard Marx song, "Suspicion" from the album "My Own Best Enemy". I knew the song and the album, but I couldn't remember the name Richard Marx. Problem was instead of typing "My Own Best Enemy", I was typing "My Own WORST Enemy". Google had no clue. Shouldn't a good search engine be able to tell it's just one word wrong?

Differentiating the quality of a dwell would be nice, but that would mean to track search trails of our users, which is too much of a privacy issue. But we are thinking about semantically evaluating the DwellRank. For example a useful dwell would be for tutorials. But this is just an assumption, we simply need more data about that.

Gerald (Disclosure: archify/Blippex CTO)

It needs some sort of fallback for search results or it's useless to a specialized user. My Google search history looks like random bits of consciousness spread out across months. Half of those search terms bring 0 results on Blippex, and while I understand that they're early, it's hard to beat something like PageRank when it's already got established experience.

It's a catch 22: the results won't get better unless people use the service, but people aren't going to use it if the results are bad in the first place. If I install the extension but use Google, it's a one-way relationship that only they get data out of. Not very good for me.

The problem with this is that it's basically asking to be manipulated.

Thanks for building this. We need more stuff like this.

Out of curiousity, how do you prevent the case of some random malicious user impersonating your chrome extension and just issueing a bunch of "dwells" to your server. I.e. can I just curl what this javascript file (https://github.com/blippex/blippex_plugin_chrome/blob/master...) is requesting to boost my own pages ranking?

We have some rate limits at our API in place, but of course it not that difficult to change an IP-address. But most important we wrote some algorithms which checks submitted URLs and domains for suspicious or accelerating behaviour. If that happens we simply suspend that domain for some time. We are also planning to publish those suspensions.

-Gerald, CTO archify/Blippex


Interesting idea, but I tried simple searches :



news ycombinator

countries in europe wiki

Did you gather enough data already ?

All of these seraches were not successful. There was no Facebook link in the first search, no Gmail link in the second one , no news.ycombinator in the 3rd one, and the only wikipedia link I got in the last search was :


I don't think that those "generic" term are the main advantage of Blippex. But if you search for example for "NSA" on Blippex (https://www.blippex.org/?q=NSA), we are assuming that you will get those articles about the NSA which is currently the most interesting or the most read. -Gerald (Disclosure: I am the CTO of archify/Blippex.)

I see, if there is enough data It would then make a lot of sense to search for :

{A language/framework/... you want to learn about} tutorial

As some comments said, the best website are not necessarily the one you spend most of your time on. But tutorials are an exception.

I'm sorry for the offtopic, but on a page that's supposed to get people involved, shouldn't you at least mind the difference between its and it's? In the very first paragraph it goes wrong already. I'm not a native English speaker, but these mistakes always jump out for some reason.

Thank you for the find, we will fix it!

Well that was a quick response, at least that's a positive thing :). I installed the add-on. Testing the search engine, searches seem to take forever. It keeps displaying the spinning icon in the orange square next to my query.

Edit: Ah the niceness of asynchronous javascript. It returns an error (I can see it in the JS console) but the page never displays that to me. Good ol' page reloads wouldn't have done that </rant>. In any case, the issue is my header modifying add-on. It injects "'\ into the x-forwarded-for header, causing your application to error. You probably have an sql injection in your code somewhere.

If you want to track the issue down, my IP is or 2001:980:1f44::/48 if you support IPv6. Timestamp around 17:51 UTC+2.

Could you please send me a link to the header modifying addon. So I can fix that.


I sent you an e-mail together with the console log, containing response codes I'm getting. Hope this helps :)

Strange, should work in under one second, but you need to have javascript enabled, the site is made in AngularJS (so no hidden things there too)

PS: i hope the it's problem is now solved :)

Actually no it's still there :(. I sent geraldbaeck an e-mail like he asked. Thanks for following up though.

Well, it uses jsonp to get the results, maybe the plugin you are using (can you tell me then name so we can check it?) has a problem with it.

Doesn't work at all without cookies. Meaning, it doesn't work, and doesn't tell you why. If you're targeting people who are looking for an alternative to the major search engines, there's a better-than-average chance that they'll have cookies disabled.

Blippex don't use any cookies, neither the API nor the website. We don't store any data from the people accessing Blippex.

I love that! All the more reason why the search should function properly with cookies disabled.

Just disable cookies in your browser, load Blippex, and search. I searched for "shakespeare" in Firefox 22 (where I have cookies turned off), and the result (below the fold, incidentally), was

> Nothing found > We're all like "What the blip, man?" too. ...

Same search in Chrome 28 works fine (and interesting, too).

This is my pet peeve on the web, and so common with HN posts that I don't usually bother to point it out. But this seems like something you'd want to know.

It's some kind of AI or even some kind of neural network, people are involved to train the search engine, so, more data users will contribute to the search server - more proper and relevant results they will get. Good idea

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact