
Show HN: Advertising-Free Search-Engine - deusu
https://deusu.org/
======
rileyteige
I searched for "History of aviation" (no quotes) and got

1) History in the HeadlinesThe Secrets of Ancient Roman Concret

2) Watch History Topics Videos Online - History.com

3) Ohio — History.com Articles, Video, Pictures and Facts

as my top three results. It's an ambitious problem to tackle, but it has to be
able to give me better relevance for me to consider using it.

EDIT: For a comparison, I ran the same query on Google:

1) History of aviation - Wikipedia, the free encyclopedia

2) History of Flight - NASA

3) history of flight | aviation | Encyclopedia Britannica

~~~
deusu
Please keep in mind that this is running on _one_ server. Google has what?
Several hundred thousand servers?

Yes, it needs to get better. But it's a good start, I think.

~~~
rileyteige
In no way do I intend to discourage your efforts. I was offering an example in
which your search engine seems unable to identify the context that I had in
mind.

What are some of your plans for search engine improvement? Ad-free aside, why
should I fund your project? DuckDuckGo seems to offer itself up as a great
alternative to Google and I have enjoyed its clean UI immensely for several
months now. What is my incentive to help drive your project?

~~~
deusu
It's open-source:
[https://github.com/MichaelSchoebel/OpenAcoon](https://github.com/MichaelSchoebel/OpenAcoon)

This is currently written in Pascal. I'm in the process of rewriting it
JavaScript/Node.js. First part for the rewrite is the ranking. That should be
done in about a week or two.

~~~
NhanH
Pascal! Now that's a name I haven't heard in a while.

You should write a blog post about how you write a search engine in Pascal.
That would at least reach top spot on HN!

~~~
deusu
Given that I'm in the process of rewriting it in JavaScript I'm not sure about
that... On the other hand it's probably worth a try. Thanks for the
suggestion.

------
concerto
I think relevance is the thing to work on. I did a search in a space I know a
lot about and the majority of first page results were domains for sale. The
company I have an interest in, which is the first result for certain keywords
on Google and Bing, wasn't to be found anywhere in the first 4 pages on here.
It would be interesting to hear a bit more about your plans for moving this
forward.

I have a few questions it would be great to hear about either here or in a
follow-up blog post:

* Is your reimplementation in JS a translation of your current from Pascal or a redevelopment?

* Why have you decided to move from Pascal to JS rather than spending the effort improving the current implementation?

* What lessons have you learned from your current implementation that you are attempting to overcome with your new version in JS?

~~~
deusu
The rewrite in JavaScript will be mixture of porting and redesign. Some parts
of the Pascal-sources are relatively new and can almost be translated as is.
But the older the Pascal-source is, the more a redesign is a better idea.

The old parts of the sources are in VERY bad shape. This project started more
than 15 years ago on a much, MUCH smaller scale (about 2 million pages
compared to 320 million now). Some design-choices simply aren't valid anymore.
Also the use of JS has the advantage that it will make the project attractive
to a lot more programmers than staying with Pascal could achieve.

The current implementation keeps the search-index as one big thing. I will
split that up into many smaller pieces. That way it can run multi-threaded or
even on multiple servers. Currently queries are running single-threaded.

Another shortcoming is that building an index involves several manual steps. I
want to automate that.

------
DanBC
I searched for my name and none of the first ten results was about me, or
about any of the other people called "Dan Beale". But nicely it didn't return
a bunch of filth based on my other surname, "Cocks", so that was nice.

It's a shame that a couple of replies or so harshly negative. Perhaps this
submission would have been better received if it had been a blog post about
how you created you engine; how it works; problems you have with it; and so
on.

Search is not --despite the Google behemoth-- a solved problem so there's
still space for creative thinking.

EG the search on a manufacturer's website is often hopeless. You emd up with a
list of 8,000 widgets and need to iust scroll through them page by page.
Amazon search is bafflingly poor. Ebay search has some sub-optimalities.
(People can list "case for mp3 player £1.99" and "mp3 player £35" in the same
listing, so a search and sort by price will sort by the cheaper case price.

------
frik
It reminds me of a similar open search engine with good search results:
[http://www.gigablast.com](http://www.gigablast.com),
[https://news.ycombinator.com/item?id=6152839](https://news.ycombinator.com/item?id=6152839)

Ten years ago, the EU sponsored Exalead to become a Google competitor.
Nowadays the company is owned by Dassault Systèmes (3D CAD Catia):
[https://www.exalead.com/search/](https://www.exalead.com/search/) ,
[http://www.heise.de/newsticker/meldung/Quaero-Erster-
Vorlaeu...](http://www.heise.de/newsticker/meldung/Quaero-Erster-Vorlaeufer-
der-europaeischen-Suchmaschine-110725.html)

Other mentioned DuckDuckGo, but one has to differ. DDG uses the Yahoo search
API, and Yahoo itself uses the Bing search from Microsoft. DDG parses the
query string and tries to add some snips from Wikipedia/Yelp/etc. There are
only 4 big search engines left: Google, Bing, Yandex, Baidoo. There are some
minor ones like Gigablast, Exalead and now Deusu. And there are meta search
engines like DDG and Yahoo.

@Deusu dev: Good luck with the refactoring from Pascal to JavaScript, sounds
like a good idea! Do you use a page-rank or how else do you score?

~~~
deusu
No pagerank. But I use backlink-counts and Alexa.com data. Also position of
keywords in url, title and snippet. Length of url and number of elements in
the url are also a part.

------
ignoramous
Query: that movie where jim carrey plays dad to three black men

DeuSu:
[https://deusu.org/query?q=that+movie+where+jim+carrey+plays+...](https://deusu.org/query?q=that+movie+where+jim+carrey+plays+dad+to+three+black+men)

Google:
[https://www.google.co.in/search?q=that+movie+where+jim+carre...](https://www.google.co.in/search?q=that+movie+where+jim+carrey+plays+dad+to+three+black+men)

For me queries like these is why Google wins. NB: Duckduckgo and Bing seem to
do fine on this query as well. Bing does better than Google[0] when you omit
phrases like 'three' or half-spell phrases like 'carrey' as 'car'. Surprising
since it has often been the case that Google is better at returning/ranking
results for most of programming related queries I try on daily basis [1].

[0] Probably, because Google takes into account the fact that the user ignored
the type-ahead suggestion (car -> carrey) for a reason and omits all 'jim
carrey' related results from the list.

[1]
[http://www.hanselman.com/blog/AmIReallyADeveloperOrJustAGood...](http://www.hanselman.com/blog/AmIReallyADeveloperOrJustAGoodGoogler.aspx)

------
Animats
A user-supported search engine has promise. A telco, a handset maker, or a
country could do an ad-free search engine. Cuil had about 30 people, and
eventually got a half-decent search engine. It's surprising that Apple doesn't
have their own search engine. Their negative experience with the map business
may have scared them.

The economics of search are strange. Search has negative market value - Google
paid about $100M/year to be the default on Firefox, and Google still pays to
be on iPhones. It's like the ad channels on cable. The Jewelry Channel pays to
be carried; the NFL gets paid to be carried. Google is in the Jewelry Channel
position.

As compute power increases, the cost of doing search goes down. But that
hasn't been reflected in the "selling price" of search, which is expressed as
ad density. Google is a high-margin business because of that. That makes them
vulnerable.

DuckDuckGo and Blekko, although tiny, are quite profitable. Even InfoSeek and
Ask apparently still make money. There's room at the bottom.

~~~
datacog
Even InfoSeek and Ask apparently still make money

> Ask is more of an 'ads search engine', and they make mainly because of their
> arbitrage model

------
qzcx
I searched for "XPS 13" and I got several pages in German.

------
orbifold
I have the suspicion that Google user tracking is not just for better
advertising. Their original page rank algorithm is a random walk on the web
graph, every user similarly performs some walk on this graph, in principle it
should be possible to use that walk to improve the relevance ranking. This is
just one example, there are many others how user interaction could be used as
feedback into the search engine.

~~~
ukandy
It does reduce your view of what's out there to some extent, giving seemingly
undue bias to certain services or opinions.

~~~
on_and_off
Google does indeed use my data in order to personalize search results. I
remember being impressed to see that for a query that could have several
different meanings, the search engine gave me results centered on my
engineering field, which was what I was looking for. I am not that convinced
that it blindfolds you to some services or opinions though. So far I have only
seen differences in understanding the context of the query, which I don't see
as an 'undue bias' but an added value (especially if you read the anecdote
about the 'history of aviation' query in this thread).

Some other thoughts on the bubble : [http://www.blindfiveyearold.com/the-
preference-bubble](http://www.blindfiveyearold.com/the-preference-bubble) .

------
jasonkostempski
I searched for nothing but got suggestions for things I've searched for on
Google this week. Since, for some, tracking is just as big a concern as ads,
can you explain why and how that is happening? Does DeuSu now know those
things or anything else about me even if that data is anonymized? If so, will
that data be sold?

~~~
deusu
That is probably your browser keeping track of what you entered into input-
fields that are named "q" for "query".

And no, this data does NOT get transmitted. So DeuSu does not know about what
you searched-for on Google.

------
bdcravens
I searched for "fedex refunds". On Google you get what you expect on the first
page: some links to Fedex's site, and a list of Fedex shipping auditors. On
Deusu.org, the first non-Fedex page is Cosco, and not a single shipping
auditor. The rest of the links are just as irrelevant.

------
leereeves
Needs some work to improve both relevance and ranking algorithms. For example:

I'm watching the movie JFK, so I tried a search for Jack Ruby.

I found a teacher named Jack Ruby's videos, the Ruby Fortune online casino,
the bar Ruby Tuesday, and a lot of other irrelevant results.

Nothing about the famous Jack Ruby in the first four pages.

------
empressplay
I wonder where the raw crawler data is coming from? Current events-related
stuff is months (years?) out of date. For example it seems to think Tony
Abbott is still the Australian opposition leader.

Did you snarf an old database from another search engine? Just curious.

~~~
deusu
The main crawl is several months old. There is a separate news-crawler which
checks a few dozen news sites every 10 minutes or so.

------
ilaksh
I believe we will probably eventually transition to some kind of distributed
peer-to-peer semantic-ish system rather than having a giant company crawl
plain text pages and control the vast majority of global advertising and quite
a bit of its data.

~~~
barryhunter
Seems to be the premise of
[http://www.majestic12.co.uk/](http://www.majestic12.co.uk/) \- looks like it
might be failing.

------
huhtenberg
I applaud the effort, but the search results quality is really poor. It's
basically an Altavista style laundry list of anything remotely relevant in no
particular order.

That said, do you do your own crawling or do you source results from someone?

~~~
deusu
I do the crawling myself. There are currently about 320 million pages in the
search-index.

~~~
frik
How many GB of data are 320 million (HTML) pages? How long does it take to
refresh the index (with a single crawler on a 1GB/s connection)?

~~~
deusu
The completed index takes up 337gb.

The incoming data to the crawler... not so sure... something on the order of
10-20tb. I didn't really measure this. And I don't keep all the data. There is
no "cache" function.

On a 1gbit/s connection it takes about a week to crawl and generate the index.

------
austenallred
Searched for my company, and the meta information is years old. Tried a few
other searches, and the results weren't even related. Sorry, but for me it's
not even close to worth switching for.

------
supercoder
[https://deusu.org/query?q=deusu](https://deusu.org/query?q=deusu)

Does not manage to return deusu as the first result, or in any of the results
for that matter.

~~~
deusu
And why should it? You don't need DeuSu to find itself. You are _already_ on
the site. :)

------
supercoder
Searched 'Google' got:

[http://www.google.com/chromeframe/?redirect=true](http://www.google.com/chromeframe/?redirect=true)

As first result

------
mdturnerphys
Tried searching for "BYU" and "UW" and got everything except the universities'
main websites. Do you just not crawl .edu addresses?

------
diegolo
Why reimplement a Search Engine from scratch instead of contributing to
lucene/solr/elastic search? did you add something new?

~~~
deusu
I started writing the software a LONG time ago. As far as I know Solr/Elastic
Search didn't exist back then.

Query logs are preserved, but they do NOT contain the IP-adress of the user.
So all I know is that _someone_ of the billions of Internet users searched for
something embarrassing. :)

------
zenincognito
Advertising free means it would not be able to support itself if it does not
show value instantaneously. No value proposition here.

------
hudell
I searched for my game (Orange Season) and the first result was Vladimir Putin

------
wbsun
I appreciate such work that people are working hard on, but how can this last
long? Google was started with Ad-free too. The search will need money to run
anyway. Have you figured out a better idea to get money without Ads?

~~~
afshin
Perhaps you would know the answer to that if you had actually visited the link
before commenting about it.

