
Show HN: Open-source search engine with 2bn-page index - deusu
https://deusu.org
======
throwaway13337
Alternative general purpose search engines are an exciting idea.

It seems a lot like we're about the time when yahoo was dominant and searching
was sort of awful. When you searched, what ranked highest was market-driven
sorts of stuff.

Right now, for topics normal people search for - not techies -all you get are
content farm sites with js-popups asking for your email address. Try searching
for anything health related, for example. We've regressed.

My own half-solution is to look only for sites that are discussions - reddit,
hn, etc. It could be better. A search engine that favored non-marketing
content could really steal some thunder.

This doesn't look like that, but maybe its a start?

~~~
dood
I'd love a search engine which only indexes forums. Something I've been
thinking of doing for years, but it'd be a lot of work.

~~~
dvirsky
There were a few attempts at that in the past, one being
[http://omgili.com/](http://omgili.com/) that now seems to return pretty much
garbage.

BTW About 12 years ago I was building this search engine, and I was toying
with the idea of building a classifier that classifies web pages based on
their "genre" rather than category, so you can limit your search for shopping
websites, forums, blogs, news sites, social media, etc. It was a bitch to
train, and my classifier's algorithm was pretty crappy, but it showed some
potential.

I think today modern search engine do that behind the scene, and try to
diversify the result to include pages from multiple genres, but they usually
don't let you choose.

~~~
dood
Heh, classifying by "genre" is exactly what I was thinking of doing.

Had some debate with myself if I should start by focusing on training for
shopping pages (product pages & product reviews) - because that might make
some money; or start by training for forums - which I'd enjoy a lot more. Or
build a more general system which would definitely never work and never get
finished.

Google actually let you filter by "discussions" until a few years ago, so they
certainly do this kind of classification. It didn't work perfectly but
sometimes did the trick. Don't know why they removed that feature.

~~~
petra
Google removed it because they aim at the mass market.

Another perspective: people who find answers in forums are less likely to be
interested in ads. And who knows, maybe making search shitty(in so many ways,
not just formus), ad revenues rise ?

------
CM30
Well, I admire the work behind it, and I think the idea is good (especially
how having this open source means multiple sites can build on the same data
set and get it more and more accurate over time).

But I have to be honest and say that it's just not working for me.

I type in Reddit, and it shows links to the NSFW subreddits instead of the
main site or anything else on it.

Typing in Wikipedia gives me the Dutch version of Wikipedia.

Mario Wiki? The page on Mario Wiki about Mario Clash, then the Smash Bros Wiki
and a bunch of SEO spam pages.

Pokemon Go gets me no relevant results at all. Certainly not anything
official, that's for sure.

It's a decent start, and having 2 billion pages indexed is pretty impressive
for a project like this as it is, but it's just not really usable as a search
engine just yet.

------
laurent123456
They need to filter porn out of their search results (even for common queries
like "hat", there's only porn) and perhaps be more resilient to SEO techniques
since it looks like there's lot of spam on top results. Queries with common
words such as "cat" return almost only irrelevant results.

I'd really like to see that kind of project working as a good alternative to
Google, but as it is it's not really usable.

~~~
deusu
I hadn't even thought about that. But it should be pretty easy to do in post-
processing. I just have to take a list of "porn" keywords. If none of them
occurs in the query, but in a search-result, then that result gets downranked.

~~~
laurent123456
Yes I guess filtering them out would at least make the website SFW, and it
would make it easier to show it to people. The issue seems to happen mainly
with common words (which results also appear to be polluted with heavily SEO-
ed websites).

I've also searched for less generic things like "xperia z5" and the results
looked good.

~~~
deusu
I have the filter implemented now. It's not perfect yet, but it already
filters out a lot of the NSFW stuff. Unless you explicitly search for it.

I'm gonna further improve this over the next days. Right now it's just a
quick'n dirty hack. :)

~~~
rasengan
It would be a interesting data point to see how many of the 2b pages indexed
are adult. lol.

~~~
deusu
I don't know. But I _do_ know that the end-of-year statistics from search-
engines about what people searched for, are complete BS. I have such a list
for the German DeuSu page:

[https://deusu.de/blog/2015-12-03-alle_jahre_wieder_wonach_de...](https://deusu.de/blog/2015-12-03-alle_jahre_wieder_wonach_deutschland_wirklich_gesucht_hat.html)

Warning! This is _definitely_ NSFW! :)

~~~
wila
LOL, seems that a big part of your users is searching for adult related
things.

So if you filter out all the adult stuff you might make those users unhappy.
Perhaps make it configurable?

Eg. add a checkbox for NSFW results or something better.

------
Taek
Is the two billion page index open source?

I've been thinking a lot about days recently. Seems to me like Pandora's box
is open. Google knows where you live, where you eat, what your fetishes are,
all of your sexual partners. Facebook knows most of those things to, via
different methods. And if you run Windows Microsoft probably has access to
most of that as well. Apple will too, because if they don't they won't be able
to compete. Tesla, Uber, Waze also have a huge amount of data on your life.

Everyone is pushing the envelope on how much data they are collecting, and the
companies which collect more data will compete better. As tech gets better we
will increasingly be unable to resist sharing our whole lives with the
companies who are powering modern living.

Even worse, there's a huge monopolization effect to having data. Nobody else
has anywhere near as much data as Google. That means nobody else can compete.
Nevermind the engineering, your algorithms can be 2x as good but you won't
have 0.1% the data as a company with billions of daily users.

So Google and Facebook are left untouchable. Microsoft, Apple, and maybe
Amazon can get in range. Is there anyone else?

We can fight back by giving up the privacy war and blowing the doors open
instead. Take your data (as much as you dare) and make it public. Let every
startup have access to it. Let every doctor have access to it. Give the small
players a fighting chance.

That does mean a massive cultural shift. It means your neighbors will be able
to look up your salary, your fetishes, your personal affairs. It's a big deal.

I don't see any other way out of this though. Surveillance technology is
getting better faster than privacy technology, because surveillance tech has
the entire tech industry behind it. Smarter phones, smarter TVs, smarter
grocery stores, smarter credit cards, smarter shoes... smarter everything.
Privacy is melting away and we aren't getting it back.

~~~
deusu
The software is already open-source.

A free search API will be fully available probably next week. It's in testing
already. It's just a matter of putting the finishing touches on the
documentation.

And the crawl- and index-data will be available for download in a few weeks.
It's also just a matter of documenting the data-format.

BTW: I disagree with your points about privacy. I see DeuSu as a way of
fighting back.

~~~
siegecraft
How does the search / indexing compare to sphinx or lucene?

~~~
deusu
I don't know sphinx at all, and my knowledge of lucene is very limited. Which
means I don't know how they would compare to DeuSu.

~~~
karussell
You should really have a look into both. I'm curious: Why did you create a new
software from scratch if you do not know existing?

If you'd base a project on lucene/solr/elasticsearch or sphinx you'd also
increase the chance of contributing to it.

------
fnord123
It's written in pascal. Neat.

However, it's not very good. If I search for "banana" I get information about
a sex shop rather than about bananas.

~~~
0xmohit
Appears to be a "smart" search engine. Tries to infer a lot (maybe based on
data collected earlier).

------
mstolpm
In addition to the lack of removing porn and the ordering of the results not
priorizing "quality" sources, some of the indexed site data is at least 4-6
months old and has heavily changed since the last crawl. I even got 404
errors. That makes it very hard to really find use in the project other than
for academic interest.

~~~
deusu
A fresh recrawl is currently running. Should take about 2-3 months. Newly
crawled data will gradually replace older data during that time.

~~~
webtechgal
Great work, congrats. :-)

Here is some input based on my experience building a similar project at my
former company. (We did not quite get to 2B pages, but were close to ~300M):

For creating a really viable (alternative) search engine, the freshness of
your index is going to be a fairly important factor. Now, obviously, re-
crawling a massive index frequently/regularly is going to need/consume some
huge amounts of bandwidth + CPU cycles. Here is how we had optimized the
resource utilization:

Corresponding to each indexed URL, store a 'Last Crawled' time-stamp.

Corresponding to each indexed URL, also store a sort-of 'crawl-history' (If
space is a constraint, don't store each version of the URL, store only the
latest one). On each re-crawl, store two data fields: time-stamp and a boolean
if the URL content has changed since last crawl. As more re-crawl cycles run,
you will be able to calculate/predict the 'update frequency' of each URL.
Then, prioritize the re-crawls based on the update frequency score (i.e. re-
crawl those with higher scores more frequently and the others less
frequently).

If you need any more help/input, let me know and I'll be happy to do what I
can.

HTH and all the best moving forward.

~~~
webtechgal
We had also (obviously) built a (proprietary) ranking algo that took into
account some 60+ individual factors. If it can be of any help, I'll create a
list and send it to you.

~~~
ddorian43
Why not write that list here ?

~~~
webtechgal
Good idea. However, I'll need to really exercise the gray cells to put
together the list so it might take me a couple of days. Once done, I'll post
it here.

------
jbb555
I think projects like this are really important because they help reduce the
impression that big server projects are only meant to be done by big
companies. The internet is becoming a content consumption medium for many
people.

I'm not sure I'll use this, but I'll try to... it all depends on how good it
is. But I approve of the project so I sent a (very) small bitcoin donation to
hopefully help fund it for a few more minutes :)

~~~
deusu
Thank you!

Depending on who you are (there were 2 bitcoin donations today), you funded
either about 18 or 28 hours of operations. :)

------
ccleve
You get really good performance on not much hardware. Can you share some
technical details?

\- file formats, particularly the postings

\- query evaluation strategy

\- update strategy

I poked around in the source code a bit, but couldn't find these things.

~~~
deusu
File formats will be documented when I publish the data-files in a few weeks.

What do you mean with postings?

The main index is split into 32 shards (there is also an additional news-index
which is updated about every 5-10 minutes). Each shard is updated and queried
seperately. The query actually runs 2/3 on a Windows server and 1/3 on a Linux
server. The latter in Docker containers. I want to move everything to Linux
over time.

Query has two phases. First only a rough - but fast - ranking is done. Then
the top results of all shards are combined and completely re-ranked. This is
basically a meta search engine hidden within.

First query phase is in src/searchservernew.dpr, and the second phase is in
src/cgi/PostProcess.pas.

~~~
ccleve
Thank you. "Postings" is another word for the format of the doc ids and
related information in the inverted file. A google for "inverted index
postings" will turn up a bunch of references.

------
pmontra
Written in Delphi. I might be wrong but I don't see many people downloading
and working on it. 30 day free trial and then you have to pay for the
development environment. IMHO it's a non starter for an open source project
but if it's the only language the author is comfortable with, well that's OK.

~~~
deusu
Originally it was written in Delphi. But I now use FreePascal for the
development. I'm even compiling both Windows and Linux versions on my Linux
machine.

~~~
pmontra
Great choice! Thanks.

------
NKCSS
Fun, but overal quality seems a bit lacking.

When I search myself; the top 10 results don't even have my last name
('Kusters') and just shows pages that have the word 'Nick'. I suppose you
don't use a form of LSA to score the search results? Maybe it's too specific,
but afaik mainstream search engines seem to give somewhat consistent results
here.

[https://deusu.org/query?q=nick+kusters](https://deusu.org/query?q=nick+kusters)

Looking at the code
([https://github.com/MichaelSchoebel/DeuSu/](https://github.com/MichaelSchoebel/DeuSu/))
I notice that you have ranking modifiers based on the .tld; why not store the
reported content language and score based on that? Isn't that more relevant?

~~~
deusu
In my experience this is usually caused by the fact that even 2bn pages aren't
that many nowadays. The index needs to get bigger to better find (and rank)
long-tail results like queries like this.

------
gkst
Pascal is an interesting language choice. I think it is the 1st time I see an
open source project that is actually used in production written in Pascal.

------
skykooler
It shows snippets of the web pages under each result; however, generally not
the particular snippets that contain the search term. I would think that would
be useful.

~~~
deusu
Yes, it would be better.

The snippets are currently the first 255 characters of the page's text. For
snippets to be customized to the search term, I would have to store all the
text of the page. And that would require a lot more disk space. Space that I
can't afford at the moment.

------
yati
Looking at the source code took me back to days when I used to do stuff in
Delphi :)

Neat project -- Loads of room for improvement, but a great initiative!

------
swiley
The site's interface is just incredibly pleasant compared to Google.com. I
really hope the author sticks with it. Unfortunately I'm not sure it's usable
right now, searching "group theory Wikipedia" never brings up a Wikipedia page
(although maybe I should just be directly searching Wikipedia if that's what I
wanted).

~~~
DanBC
DuckDuckGo's approach of !bang searches, making duckduckgo the place[1] I go
when I want to search another site, is really useful.

[1] It's my default search engine in Chrome, so I use bang searching in the
address bar.

~~~
Cyph0n
Same here. The problem is that I find myself using `!g` way too often... I
guess I'm not used to the DDG results page.

~~~
amirouche
ddg is my primary search engine, it takes time but you get use to it. If what
you are looking for is mostly on HN, SO or wikipedia it works quite well.

------
rshm
As of aug 16, common crawl has 1.73n pages. For the complimentary set of urls,
if any benefit you can use their data dump as seed.

If the metadata (such as last modified) size of your index is small enough to
upload to aws, you can also reduce your re-crawl efforts when they have a
fresh release.

~~~
greglindahl
It doesn't have to be small to donate to Common Crawl, they have a free S3
bucket.

------
supersan
Hi, I find the Blog more interesting right now since I hope to find write-ups
about how you were able to manage such a herculean task on your own?

Crawling 2bn pages could take forever and could generate a huge bandwidth
bills, so any lessons you learnt, pitfalls you faced, etc would be a great
read.

~~~
deusu
Some issues that appeared over the years:

Block outgoing connects to local IP nets in your firewall. Otherwise your
hosting provider might think you are trying to hack them. Apparently there are
a lot of links out there that point to hosts which resolve to private IP
ranges.

Another problem with following links is that you are bound to run across some
that are malware command & control servers. Had several complaints to my ISP
after authorities took over control of one and used the C&C server's domain as
a honeypot. My crawler is on a whitelist now.

I had one person who vehemently complained that I was trying to hack him,
because the software downloaded his robots.txt. I'm NOT kidding! :)

Make sure your robots.txt parsing is working correctly. I had an undiscovered
bug in the software at some time which basically caused it to think everything
is allowed. Luckily someone was nice enough to let me know. And he was _really
nice_ about it. And he would have had every right to be angry.

A major bottleneck is DNS queries. Run your own DNS server and even cache the
hostname/IP pairs yourself. Do not even think about using your IPS's DNS
server. If you bombard them with 100+ DNS requests/s then they WILL be angry.
:)

~~~
webtechgal
> Run your own DNS server and even cache the hostname/IP pairs yourself.

This[1] might be a useful resource to get started:

[1] [https://scans.io/](https://scans.io/)

(Register and download the IPv4 Address Space data file to use as an initial
cache and then append/update as you go.)

~~~
deusu
Bookmarked. Thanks!

------
ommunist
DeuSu seems not indexing Cyrillic part of the Internet, and cannot give you
insights for Greek, try
[https://deusu.org/query?q=ελιά](https://deusu.org/query?q=ελιά) . Is it Latin
ANSI only index?

~~~
deusu
Only ASCII and German umlauts (äöüß) at the moment. The parser needs
rewriting. It was originally written in pre-unicode times. :)

------
tychuz
And all javascript related questions still have w3schools as first result, god
dammit.

~~~
gkilmain
I think for newbs who want to learn the fundamentals of web dev w3schools is a
good resource. Even the people over at w3fools admit it. For a deeper dive
though clearly MDN is the winner.

------
kowdermeister
Strange, Wikipedia article is not on the first page and don't blame me for
searching something non German thing :)

[https://deusu.org/query?q=berlin](https://deusu.org/query?q=berlin)

~~~
semi-extrinsic
It's pretty obvious that Google et al. do a lot of "custom" filtering like
prioritising Wikipedia, removing porn from "obviously non-porn" searches etc.
(That "Berlin" search gives porn as the 8th result.)

~~~
gkst
I doubt that Google prioritizes Wikipedia deliberately. Wikipedia has tons of
backlinks, authority, trust, typically a high text to html ratio, probably a
low bounce rate. Moreover, it is fast, works well on mobile and on and on.
It's is just a very well done and useful site for users and search engines.

~~~
kowdermeister
My thoughts as well. They don't need special treatment to be in the top 3.

~~~
allendoerfer
Wikipedia being ranked high is even an indicator for SEOs that an affiliate
niche is not very competitive.

------
0xmohit
Earlier discussion:
[https://news.ycombinator.com/item?id=9122397](https://news.ycombinator.com/item?id=9122397)

------
ommunist
DeuSu does not crawl social pages it seems. No traces of linkedin profiles and
no facebook. From a certain point of view - this is a good thing.

------
billconan
I searched "meta programming c++" and the top returns are all about java.

I'm curious, is it expensive to run a search site like this?

~~~
deusu
Currently €300/month. More details on
[https://deusu.org/donate.html](https://deusu.org/donate.html)

------
vain
Google's secret ingredient to stay relevant and informational is Wikipedia.

Deusu on the other hand seems to weight words in urls highly.

If you search for scientology only on Deusu, you might end up wearing a funky
hat
[https://deusu.org/query?q=scientology](https://deusu.org/query?q=scientology)

------
amirouche
Did you think about database dump of popular services like HN, SO or Wikipedia
to speed up crawling and revelance?

~~~
deusu
Yes. I have downloaded several data dumps, but haven't gotten around to import
them yet.

------
outpan
Awesome job!

For the life of me I can't figure out how you manage to crawl over a billion
web pages (even in 2-3 months), index the data and run the server with €300
per month. Especially the crawler part...

------
rbjorklin
What makes this better than [https://duckduckgo.com](https://duckduckgo.com) ?

~~~
diggan
Not saying that it's better but one of the main selling-points of DeuSu seems
to be that it's fully open source and independent search index. Duckduckgo, if
I remember correctly, is not 100% open source and get their search index from
Yahoo (or maybe Bing, not sure)

~~~
kowdermeister
If it's not good, the it doesn't matter if it's OSS or not.

~~~
anewhnaccount
Good for what? Even though this isn't good for use as an every day general
purpose search engine, it could be good for a particular use case perhaps with
some adaptation or for learning from.

~~~
kowdermeister
I don't know why would people use it to be frank. Lot better alternatives
exists.

> it could be good for a particular use case

Namely?

> or for learning from.

The author admitted in the github readme that the code quality is rather bad.
I also don't see a link to the search index, the only valuable component of
this project.

~~~
deusu
I will publish the index for download in a few weeks. I'm currently working on
the documentation. Oh, and I will publish the raw crawl-data too. Everything
together is about 2.5tb.

There is also a free API in beta-test right now. Will probably be ready for
official release next week.

~~~
kowdermeister
That's great news, thanks for the info. Sorry for sounding harsh, for being a
side project this is impressive.

Have you also published the ranking mechanism? That way people might
contribute you to improve it.

~~~
deusu
It's all open-source. So, yes.

------
vcool07
Any specific reason you've used pascal ? I thought that language got extinct
long ago.

~~~
deusu
It's alive and well. The TIOBE index still lists it ahead of Ruby, Swift,
Objective-C, GoLang...

And I started this software 20 years ago. Granted, a LOT of the software has
changed since then. But I don't see a reason to throw away existing code
unless it is in need of so much change that rewriting from scratch would be
easier. And even then I might stick to what I know best, and what fits best
with other parts of the software.

------
malinens
works really fast!

~~~
deusu
Thx.

But all the traffic from here is currently driving the servers to their limit.
Queries are already slowing down a bit because of imminent overload. Usually
the average query takes about 250ms. Currently the average is at 334ms.

------
scandox
Every time I see new search engine projects I remember this:
[https://en.wikipedia.org/wiki/Cuil](https://en.wikipedia.org/wiki/Cuil)

I note that Dr Anna Patterson is back with Google. She wrote this in 2004:
[http://queue.acm.org/detail.cfm?id=988407](http://queue.acm.org/detail.cfm?id=988407)

~~~
hvo
I am not sure many of the issues Dr. Anna Patterson raised here are applicable
now.Web is way different now compare to 2016.

------
micwo
Deusu can't find deusu (or deusu.org)

[https://deusu.org/query?q=deusu](https://deusu.org/query?q=deusu)

~~~
deusu
And why should it? You are already at the destination. No need to find it. :)

~~~
micwo
Try to find any other site by url:

[https://deusu.org/query?q=news.ycombinator.com](https://deusu.org/query?q=news.ycombinator.com)

~~~
0xmohit
Try to find `2 + 2 = 4`:

[https://deusu.org/query?q=2+%2B+2+%3D+4](https://deusu.org/query?q=2+%2B+2+%3D+4)

Even
[https://deusu.org/query?q=2+%2B+2+%3D+5](https://deusu.org/query?q=2+%2B+2+%3D+5)
didn't yield any results. I was under the impression that it'd a message:

    
    
      2 + 2 = 5 for very large values of 2.

------
ashitlerferad
Another open source search engine:

[http://yacy.net/](http://yacy.net/)

~~~
ytjohn
Thanks, I was trying to remember that one. I think that for any new, non-
profit, search engine to be viable, it has to be decentralized. deusu.com
takes 2-3 months to crawl 2bn pages. Yacy claims to be at 1.4bn. I don't know
how long it takes for that index to get refreshed, but it has 600 peer
operators. Even if Yacy has a weaker indexing algorithm, I imagine that 600
peers, each crawling and contributing their own set of sites must be faster
than a single deusu node.

Yacy is also quite a bit more resilient.

I will say that I don't buy Yacy's "no censoring" statement. If I was a bad
actor, I could run yacy on a computer with false dns and false certificates,
and yacy could index my fake content with official looking URLs.

