Alternative general purpose search engines are an exciting idea.
It seems a lot like we're about the time when yahoo was dominant and searching was sort of awful. When you searched, what ranked highest was market-driven sorts of stuff.
Right now, for topics normal people search for - not techies -all you get are content farm sites with js-popups asking for your email address. Try searching for anything health related, for example. We've regressed.
My own half-solution is to look only for sites that are discussions - reddit, hn, etc. It could be better. A search engine that favored non-marketing content could really steal some thunder.
This doesn't look like that, but maybe its a start?
Speaking of which, it seems possible for a computer to detect content which is just mostly marketing, versus content which is not (based on how spam filters work). The search engine should just show a "marketing index" score right next to the result. Even better is to whitelist certain sites (Wikipedia,popular .edu and .org domains) to begin with and prioritize those results.
It would likely be really niche, but it could become the anti-Google, which would be great when people actually seek an alternative to all the noise.
I think it has to be a non-profit like Wikipedia itself, I cannot imagine a model where it can also make money. The submitted site is a candidate but it has to improve the search quality as other commenters pointed out.
I agree. Its a pretty tough problem, however it is good to cross the bridge when it comes. If the search engine stays really niche, perhaps it may not even be worth it for the spammers, while doing enough to cater to the somewhat self-selecting audience. For example, the number of people who want to get to the front page of HN is likely to be a really minuscule fraction of people wanting to get to the top of search results.
Also, I wonder if is it possible to detect promotional content by analyzing things like call to actions and such?
Interesting that if this were a parsing problem (i.e. https://news.ycombinator.com/item?id=12478538), folks would immediately suggest accepting known good output, instead of trying to blacklist specific problems. The analogue in search would be something that looks more like a directory than an internet wide search engine.
Of course what killed directories in the early web is that they had no hope of scaling.
If they hide their sell links, shipping baskets, closing pages and such then they'll kill the sales though.
You can have paid news, but it's not doing anything if the mark can't buy the product afterwards because you had to remove all associations with selling to get the news to rank.
It seems that you could just use Google's algorithms and modify the site trust metric using a front-page spam-score, whilst reducing the effect of link-juice from links with associated marketing keywords ("buy the doohickey on this link", or whatever).
Keeping marketing sites high in your SERPs would make you way more money on referrals though.
Solving algorithmic tasks by just building a ML model of your competitor's algorithm seems like a funny way to start. I imagine this to be the way "programming" will stop being a thing in a few hundred years.
At the moment for web search it probably would not work, because I imagine from feature extraction to result there are several models involved to create intermediate results.
Wow, the contrast between what this engine returns for that query and what google returns is amazing. Literally zero relevant links from the former and only relevant links from the latter. Search relevance is a serious high-science research problem, and it's going to be tough to compete with established players that have probably man-centuries' worth of proprietary research IP and some of the world's best scientists.
There were a few attempts at that in the past, one being http://omgili.com/ that now seems to return pretty much garbage.
BTW About 12 years ago I was building this search engine, and I was toying with the idea of building a classifier that classifies web pages based on their "genre" rather than category, so you can limit your search for shopping websites, forums, blogs, news sites, social media, etc. It was a bitch to train, and my classifier's algorithm was pretty crappy, but it showed some potential.
I think today modern search engine do that behind the scene, and try to diversify the result to include pages from multiple genres, but they usually don't let you choose.
Heh, classifying by "genre" is exactly what I was thinking of doing.
Had some debate with myself if I should start by focusing on training for shopping pages (product pages & product reviews) - because that might make some money; or start by training for forums - which I'd enjoy a lot more. Or build a more general system which would definitely never work and never get finished.
Google actually let you filter by "discussions" until a few years ago, so they certainly do this kind of classification. It didn't work perfectly but sometimes did the trick. Don't know why they removed that feature.
Google removed it because they aim at the mass market.
Another perspective: people who find answers in forums are less likely to be interested in ads. And who knows, maybe making search shitty(in so many ways, not just formus), ad revenues rise ?
IIRC I found that the easiest was to train on shops, forums, and porn. But another tricky bit was conceptual - genre and category overlap sometimes (e.g. porn). Anyway I couldn't get it to yield proper results. But today we have things we didn't back then like opengraph and schema.org tags, that give more semantic info.
Often times when searching for something, I would love this as well. I usually add phpbb, forums or discussion to the search keywords, but it's never 100%.
Oh, Is this the search wish-list thread? Ok: I just want a search engine that only indexes the sites that would be of interest to people like me.
For most of my searches over the last year, Google has been so broken that it's almost unusable. At this point, to get any relevant results, I have to anticipate how Google will work, and then trial and error 10-15 times until I find what I'm looking for.
But, if I happen to be looking for incorrect information from 2005, Google works like a charm.
I always had this, maybe stupid idea, of vouching referrals search engine. You pick a few sites you know are good and they (their 'webmasters') would vouch for new ones, and those could vouch for new ones, etc. Catch is, if one or two (whatever) of your child or grandchild vouched sites screws up, then they're toast, out of index, but so are you. That way you would pick wisely.
Same idea would probably work for online commenting. Vouch with a chain of responsibility. That's essentially how pagerank did its thing, but with no repercussions and vouching was automatic based on links from initial seed of what they thought was good. I'd do it with humans.
I always had an inkling for a Search Engine that ONLY indexed the root page of every domain. Not sure if I'm right about this, but it sure would sort the chaff from the wheat for general purpose queries.
Seems like that would just give you all those made-for-seo sites that tend to have second-rate content at best. ie, you search for 'best electric lawn mower', and you'll get bestelectriclawnmowers.com, 10bestelectricmowers.com, etc. Those sort of sites exist for every imaginable topic, and in my experience are rarely worth visiting.
I would almost want the opposite. The best content on most topics I've found tends to be a page on a discussion forum of some sort, followed by blogs and more general editorial sites.
What we really want is a system that classifies your query as being one of "forum/blog/shopping" and then makes a scan only over that class of pages.
So on launch-day you would have checkboxes. On the one-year anniversary you´d have those checkboxes removed.
Google probably implemented this a couple of decades ago, though, so what we _really_ need is someone to come up with a new business model more attractive than Google's.
The sheer scale required to attempt a new search engine is pretty staggering... it seems like one area where decentralisation might actually be worthwhile; the key obstacle being everyone's interest in gaming search results. I wonder if there's a useful application of ledgers that'd be useful in there somewhere...
Why not have a search engine with "sub-reddits" that can be subscribed to...
Whereby - a site would self-identify as being in a particular genre, say "healthcare" - and I could launch a tab to the engine and set my sub to "health, health-tech, healthcare, medicine, etc.." and then do my search and only those sites that set their category will show up in that search - but if I dont find my search, I can then easily slide out to other areas where I may not have thought what I was looking for would have identified with. Further - any post by any company/site could individually been given a topic to self-declare as... thus even if the company or site isnt necessarily in that space - their page or object could at least be a part of that result ranking....
While it is often told that competitors before Google did not use something like PageRank, which is not true, Google's PageRank algorithm was better and cheaper than the competitors' and effectively killed your idea 20 years ago.
But I find it slightly ironic that people are bitching about PageRank having slightly some issues with respect to the specificity of what they are searching for...
meaning that even though "killed this idea twenty years ago" we are coming back to the same problem...
Is that perhaps just due to the volume of info that is available on the web and the much more complex way we have categorized (mentally, not digitally) all the knowledge and information thats out there now?
A few years ago Google had a "discussions" category that you could pick alongside "images", "videos", etc. I wonder why they removed it. It was indexing forums, Google and Yahoo groups, etc.
How would this work? You could boost the query term for instance, like it possible to boost the column score in postgresql but that is all. Otherwise allowing user to provide their own ranking function (which is itself an art) would not be pratical performance wise. It should be noted that search engine interface, the search box is already a DSL for the underlying algorithm that support OR/AND and NOT.
That is a lovely idea. Unfortunately, a scoring scheme has one foot in the indexing process (that thing that the google bot does) and another in the querying part, so switching schemes would often mean you would need to re-index your data to cater for the new metrics you now need for a new type of scoring.
Neither indexing nor querying does the ranking. Ranking is done after indexing and can be either tf-idf , pagerank or combination of that.
Once the document similarity to the query is calculated, by for example vector space model, the documents are ranked by pagerank.
What OP is saying that instead of pagerank we can have other ranking methods which is surely plausible.
Sure, but what I was saying was that what good is a new ranking method, when you only have at your disposal the same set of metrics as the method you are trying to replace? A new ranking would quite often mean adding new metrics. For example, when Lucene when from tf-idf to bm25 they added lots of new metrics to be able to cater for the new algorithm.
Well, I admire the work behind it, and I think the idea is good (especially how having this open source means multiple sites can build on the same data set and get it more and more accurate over time).
But I have to be honest and say that it's just not working for me.
I type in Reddit, and it shows links to the NSFW subreddits instead of the main site or anything else on it.
Typing in Wikipedia gives me the Dutch version of Wikipedia.
Mario Wiki? The page on Mario Wiki about Mario Clash, then the Smash Bros Wiki and a bunch of SEO spam pages.
Pokemon Go gets me no relevant results at all. Certainly not anything official, that's for sure.
It's a decent start, and having 2 billion pages indexed is pretty impressive for a project like this as it is, but it's just not really usable as a search engine just yet.
They need to filter porn out of their search results (even for common queries like "hat", there's only porn) and perhaps be more resilient to SEO techniques since it looks like there's lot of spam on top results. Queries with common words such as "cat" return almost only irrelevant results.
I'd really like to see that kind of project working as a good alternative to Google, but as it is it's not really usable.
I hadn't even thought about that. But it should be pretty easy to do in post-processing. I just have to take a list of "porn" keywords. If none of them occurs in the query, but in a search-result, then that result gets downranked.
We mostly filtered out porn by using a two word phrase method. There were a lot of edge cases because many potentially dirty concepts are made up of words that are not bad when used alone. For example a text can have both "girls" and "nude" in it without being vulgar, but if it has the phrase "nude girls" the chance for it being pornografic is much higher.
Yes I guess filtering them out would at least make the website SFW, and it would make it easier to show it to people. The issue seems to happen mainly with common words (which results also appear to be polluted with heavily SEO-ed websites).
I've also searched for less generic things like "xperia z5" and the results looked good.
I don't know. But I do know that the end-of-year statistics from search-engines about what people searched for, are complete BS. I have such a list for the German DeuSu page:
I've been thinking a lot about days recently. Seems to me like Pandora's box is open. Google knows where you live, where you eat, what your fetishes are, all of your sexual partners. Facebook knows most of those things to, via different methods. And if you run Windows Microsoft probably has access to most of that as well. Apple will too, because if they don't they won't be able to compete. Tesla, Uber, Waze also have a huge amount of data on your life.
Everyone is pushing the envelope on how much data they are collecting, and the companies which collect more data will compete better. As tech gets better we will increasingly be unable to resist sharing our whole lives with the companies who are powering modern living.
Even worse, there's a huge monopolization effect to having data. Nobody else has anywhere near as much data as Google. That means nobody else can compete. Nevermind the engineering, your algorithms can be 2x as good but you won't have 0.1% the data as a company with billions of daily users.
So Google and Facebook are left untouchable. Microsoft, Apple, and maybe Amazon can get in range. Is there anyone else?
We can fight back by giving up the privacy war and blowing the doors open instead. Take your data (as much as you dare) and make it public. Let every startup have access to it. Let every doctor have access to it. Give the small players a fighting chance.
That does mean a massive cultural shift. It means your neighbors will be able to look up your salary, your fetishes, your personal affairs. It's a big deal.
I don't see any other way out of this though. Surveillance technology is getting better faster than privacy technology, because surveillance tech has the entire tech industry behind it. Smarter phones, smarter TVs, smarter grocery stores, smarter credit cards, smarter shoes... smarter everything. Privacy is melting away and we aren't getting it back.
A free search API will be fully available probably next week. It's in testing already. It's just a matter of putting the finishing touches on the documentation.
And the crawl- and index-data will be available for download in a few weeks. It's also just a matter of documenting the data-format.
BTW: I disagree with your points about privacy. I see DeuSu as a way of fighting back.
Interesting perspective. What about the other extreme, though? This circumstance has only really existed in the past 20 years or so, maybe less. Why not just revert some of your behaviors?
- Delete your Facebook account. If you really need it to keep in touch with people across the country, at least delete the app from your phone and don't leave it open in a browser tab.
- Don't place asinine Amazon orders just because they ship free. Stop by a drug store or hardware store on your way home from work for odds & ends. You can even pay cash and kill the CC data bird with the same stone. Bonus points for instant gratification.
- Don't use GMail. Use an email provider like Protonmail or Tutanota that doesn't index all of your emails for advertising or other purposes.
- Don't sign in to Google (or Waze or whoever else's site) when navigating or searching so all of your actions online aren't automatically tied to a single account.
- Don't buy some internet-of-shit appliance that inevitably phones home with a bunch of telemetry data just so you can get a push notification when your laundry is dry.
Want to go really extreme? Now that you've uninstalled Facebook and aren't getting push notifications from your toaster, ditch your smart phone. Pay $15/mo for calls/texts on a Nokia 3310 and save some cash while you're at it. Want to listen to music on the go? Buy an MP3 player like everyone did 10 years ago. Want to play games on the go? Buy a GameBoy and experience the wonder of physical buttons while gaming. Really need directions on the go? You can probably get by with a Garmin or even gasp a paper map. Want to read HN or Reddit on the toilet? Try a book. Yes, I've written mobile apps and my iPhone is sitting on the desk right in front of me, but I hate the fucking thing and its always-connected mobile data. I've done the rest of the above bullet points and don't plan on buying another smart phone when this one hits planned-obsolescence in another year or two. Aggressive data collection depends on user engagement. The easiest way to fight it is to just quit engaging. Usually it'll save you money, too, which is nice.
I am on your side. Except I need to contact some friends and - more importantly - customers, so I cannot entirely ditch everything. I cannot get rid of Facebook (Pages, API, relatives, even some customers), Skype (customers) and Whatsapp (friends). I am waiting for Whatsapp to become available on either Ubuntu Touch or Firefox OS. I would use these before buying a dumbphone and an MP3 player.
In addition to the lack of removing porn and the ordering of the results not priorizing "quality" sources, some of the indexed site data is at least 4-6 months old and has heavily changed since the last crawl. I even got 404 errors. That makes it very hard to really find use in the project other than for academic interest.
Here is some input based on my experience building a similar project at my former company. (We did not quite get to 2B pages, but were close to ~300M):
For creating a really viable (alternative) search engine, the freshness of your index is going to be a fairly important factor. Now, obviously, re-crawling a massive index frequently/regularly is going to need/consume some huge amounts of bandwidth + CPU cycles. Here is how we had optimized the resource utilization:
Corresponding to each indexed URL, store a 'Last Crawled' time-stamp.
Corresponding to each indexed URL, also store a sort-of 'crawl-history' (If space is a constraint, don't store each version of the URL, store only the latest one). On each re-crawl, store two data fields: time-stamp and a boolean if the URL content has changed since last crawl. As more re-crawl cycles run, you will be able to calculate/predict the 'update frequency' of each URL. Then, prioritize the re-crawls based on the update frequency score (i.e. re-crawl those with higher scores more frequently and the others less frequently).
If you need any more help/input, let me know and I'll be happy to do what I can.
We had also (obviously) built a (proprietary) ranking algo that took into account some 60+ individual factors. If it can be of any help, I'll create a list and send it to you.
Good idea. However, I'll need to really exercise the gray cells to put together the list so it might take me a couple of days. Once done, I'll post it here.
I think projects like this are really important because they help reduce the impression that big server projects are only meant to be done by big companies. The internet is becoming a content consumption medium for many people.
I'm not sure I'll use this, but I'll try to... it all depends on how good it is. But I approve of the project so I sent a (very) small bitcoin donation to hopefully help fund it for a few more minutes :)
File formats will be documented when I publish the data-files in a few weeks.
What do you mean with postings?
The main index is split into 32 shards (there is also an additional news-index which is updated about every 5-10 minutes). Each shard is updated and queried seperately. The query actually runs 2/3 on a Windows server and 1/3 on a Linux server. The latter in Docker containers. I want to move everything to Linux over time.
Query has two phases. First only a rough - but fast - ranking is done. Then the top results of all shards are combined and completely re-ranked. This is basically a meta search engine hidden within.
First query phase is in src/searchservernew.dpr, and the second phase is in src/cgi/PostProcess.pas.
Thank you. "Postings" is another word for the format of the doc ids and related information in the inverted file. A google for "inverted index postings" will turn up a bunch of references.
Written in Delphi. I might be wrong but I don't see many people downloading and working on it. 30 day free trial and then you have to pay for the development environment. IMHO it's a non starter for an open source project but if it's the only language the author is comfortable with, well that's OK.
Originally it was written in Delphi. But I now use FreePascal for the development. I'm even compiling both Windows and Linux versions on my Linux machine.
When I search myself; the top 10 results don't even have my last name ('Kusters') and just shows pages that have the word 'Nick'. I suppose you don't use a form of LSA to score the search results? Maybe it's too specific, but afaik mainstream search engines seem to give somewhat consistent results here.
Looking at the code (https://github.com/MichaelSchoebel/DeuSu/) I notice that you have ranking modifiers based on the .tld; why not store the reported content language and score based on that? Isn't that more relevant?
In my experience this is usually caused by the fact that even 2bn pages aren't that many nowadays. The index needs to get bigger to better find (and rank) long-tail results like queries like this.
Pascal is an interesting language choice. I think it is the 1st time I see an open source project that is actually used in production written in Pascal.
It shows snippets of the web pages under each result; however, generally not the particular snippets that contain the search term. I would think that would be useful.
The snippets are currently the first 255 characters of the page's text. For snippets to be customized to the search term, I would have to store all the text of the page. And that would require a lot more disk space. Space that I can't afford at the moment.
The site's interface is just incredibly pleasant compared to Google.com. I really hope the author sticks with it. Unfortunately I'm not sure it's usable right now, searching "group theory Wikipedia" never brings up a Wikipedia page (although maybe I should just be directly searching Wikipedia if that's what I wanted).
ddg is my primary search engine, it takes time but you get use to it. If what you are looking for is mostly on HN, SO or wikipedia it works quite well.
As of aug 16, common crawl has 1.73n pages. For the complimentary set of urls, if any benefit you can use their data dump as seed.
If the metadata (such as last modified) size of your index is small enough to upload to aws, you can also reduce your re-crawl efforts when they have a fresh release.
Hi, I find the Blog more interesting right now since I hope to find write-ups about how you were able to manage such a herculean task on your own?
Crawling 2bn pages could take forever and could generate a huge bandwidth bills, so any lessons you learnt, pitfalls you faced, etc would be a great read.
Block outgoing connects to local IP nets in your firewall. Otherwise your hosting provider might think you are trying to hack them. Apparently there are a lot of links out there that point to hosts which resolve to private IP ranges.
Another problem with following links is that you are bound to run across some that are malware command & control servers. Had several complaints to my ISP after authorities took over control of one and used the C&C server's domain as a honeypot. My crawler is on a whitelist now.
I had one person who vehemently complained that I was trying to hack him, because the software downloaded his robots.txt. I'm NOT kidding! :)
Make sure your robots.txt parsing is working correctly. I had an undiscovered bug in the software at some time which basically caused it to think everything is allowed. Luckily someone was nice enough to let me know. And he was really nice about it. And he would have had every right to be angry.
A major bottleneck is DNS queries. Run your own DNS server and even cache the hostname/IP pairs yourself. Do not even think about using your IPS's DNS server. If you bombard them with 100+ DNS requests/s then they WILL be angry. :)
DeuSu seems not indexing Cyrillic part of the Internet, and cannot give you insights for Greek, try https://deusu.org/query?q=ελιά . Is it Latin ANSI only index?
I think for newbs who want to learn the fundamentals of web dev w3schools is a good resource. Even the people over at w3fools admit it. For a deeper dive though clearly MDN is the winner.
It's pretty obvious that Google et al. do a lot of "custom" filtering like prioritising Wikipedia, removing porn from "obviously non-porn" searches etc. (That "Berlin" search gives porn as the 8th result.)
I doubt that Google prioritizes Wikipedia deliberately. Wikipedia has tons of backlinks, authority, trust, typically a high text to html ratio, probably a low bounce rate. Moreover, it is fast, works well on mobile and on and on. It's is just a very well done and useful site for users and search engines.
For the life of me I can't figure out how you manage to crawl over a billion web pages (even in 2-3 months), index the data and run the server with €300 per month. Especially the crawler part...
Not saying that it's better but one of the main selling-points of DeuSu seems to be that it's fully open source and independent search index. Duckduckgo, if I remember correctly, is not 100% open source and get their search index from Yahoo (or maybe Bing, not sure)
Good for what? Even though this isn't good for use as an every day general purpose search engine, it could be good for a particular use case perhaps with some adaptation or for learning from.
I don't know why would people use it to be frank. Lot better alternatives exists.
> it could be good for a particular use case
Namely?
> or for learning from.
The author admitted in the github readme that the code quality is rather bad. I also don't see a link to the search index, the only valuable component of this project.
I will publish the index for download in a few weeks. I'm currently working on the documentation. Oh, and I will publish the raw crawl-data too. Everything together is about 2.5tb.
There is also a free API in beta-test right now. Will probably be ready for official release next week.
I think such publication is very important. I have no idea at all if open source search engines could ever work.
But if they can, I think a big part of it would be separation of the crawl index and the UI / prioritisation etc. Different people can work on those two ends of the problem and apply different philosophies.
Search only forums? Reject porn using XYZ method? Great! But they can all use the same database, or pick from the a common community of databases.
It would be great if you can share (at least) some information about the kind of hosting setup you're using, how much of bandwidth and how long it took to crawl and index the 2B pages.
2 are used for crawling, index-building and raw-data storage. Quadcore, 32gb RAM, 4tb HDD and 1gbit/s internet connection on each of these. They are rented and in a big data-center. Crawling uses "only" about 200-250mbit/s of bandwidth.
2 servers for webserver and queries. Quadcore, 32gb RAM. One with 2x512gb SSD, the other with only 1x512gb SSD. These servers are here at home. I have cable internet with 200mbit/s down, 20mbit/s up. Static IPs obviously.
It's alive and well. The TIOBE index still lists it ahead of Ruby, Swift, Objective-C, GoLang...
And I started this software 20 years ago. Granted, a LOT of the software has changed since then. But I don't see a reason to throw away existing code unless it is in need of so much change that rewriting from scratch would be easier. And even then I might stick to what I know best, and what fits best with other parts of the software.
But all the traffic from here is currently driving the servers to their limit. Queries are already slowing down a bit because of imminent overload. Usually the average query takes about 250ms. Currently the average is at 334ms.
Thanks, I was trying to remember that one. I think that for any new, non-profit, search engine to be viable, it has to be decentralized. deusu.com takes 2-3 months to crawl 2bn pages. Yacy claims to be at 1.4bn. I don't know how long it takes for that index to get refreshed, but it has 600 peer operators. Even if Yacy has a weaker indexing algorithm, I imagine that 600 peers, each crawling and contributing their own set of sites must be faster than a single deusu node.
Yacy is also quite a bit more resilient.
I will say that I don't buy Yacy's "no censoring" statement. If I was a bad actor, I could run yacy on a computer with false dns and false certificates, and yacy could index my fake content with official looking URLs.
It seems a lot like we're about the time when yahoo was dominant and searching was sort of awful. When you searched, what ranked highest was market-driven sorts of stuff.
Right now, for topics normal people search for - not techies -all you get are content farm sites with js-popups asking for your email address. Try searching for anything health related, for example. We've regressed.
My own half-solution is to look only for sites that are discussions - reddit, hn, etc. It could be better. A search engine that favored non-marketing content could really steal some thunder.
This doesn't look like that, but maybe its a start?