It seems a lot like we're about the time when yahoo was dominant and searching was sort of awful. When you searched, what ranked highest was market-driven sorts of stuff.
Right now, for topics normal people search for - not techies -all you get are content farm sites with js-popups asking for your email address. Try searching for anything health related, for example. We've regressed.
My own half-solution is to look only for sites that are discussions - reddit, hn, etc. It could be better. A search engine that favored non-marketing content could really steal some thunder.
This doesn't look like that, but maybe its a start?
It would likely be really niche, but it could become the anti-Google, which would be great when people actually seek an alternative to all the noise.
I think it has to be a non-profit like Wikipedia itself, I cannot imagine a model where it can also make money. The submitted site is a candidate but it has to improve the search quality as other commenters pointed out.
But based on the spam race, marketers will then tune content so that it doesn't trip those filters. Paid news and journal articles, etc.
But yes, even here on HN there are problems distinguishing between legitimate articles and paid news.
Also, I wonder if is it possible to detect promotional content by analyzing things like call to actions and such?
Of course what killed directories in the early web is that they had no hope of scaling.
You can have paid news, but it's not doing anything if the mark can't buy the product afterwards because you had to remove all associations with selling to get the news to rank.
Keeping marketing sites high in your SERPs would make you way more money on referrals though.
At the moment for web search it probably would not work, because I imagine from feature extraction to result there are several models involved to create intermediate results.
A search for "java twitter bot" places more emphasis on "bot" then on Java and then on twitter which is what a tf-idf would do.
A good start like you said but it's miles away even from yahoo or bing.
We are working on this scenario at http://www.shoten.xyz using document clustering and apache spark graphx+ giraph.
Ain't been an easy task so far but we have made some headway
BTW About 12 years ago I was building this search engine, and I was toying with the idea of building a classifier that classifies web pages based on their "genre" rather than category, so you can limit your search for shopping websites, forums, blogs, news sites, social media, etc. It was a bitch to train, and my classifier's algorithm was pretty crappy, but it showed some potential.
I think today modern search engine do that behind the scene, and try to diversify the result to include pages from multiple genres, but they usually don't let you choose.
Had some debate with myself if I should start by focusing on training for shopping pages (product pages & product reviews) - because that might make some money; or start by training for forums - which I'd enjoy a lot more. Or build a more general system which would definitely never work and never get finished.
Google actually let you filter by "discussions" until a few years ago, so they certainly do this kind of classification. It didn't work perfectly but sometimes did the trick. Don't know why they removed that feature.
Another perspective: people who find answers in forums are less likely to be interested in ads. And who knows, maybe making search shitty(in so many ways, not just formus), ad revenues rise ?
I would love something like this as well...
For most of my searches over the last year, Google has been so broken that it's almost unusable. At this point, to get any relevant results, I have to anticipate how Google will work, and then trial and error 10-15 times until I find what I'm looking for.
But, if I happen to be looking for incorrect information from 2005, Google works like a charm.
Same idea would probably work for online commenting. Vouch with a chain of responsibility. That's essentially how pagerank did its thing, but with no repercussions and vouching was automatic based on links from initial seed of what they thought was good. I'd do it with humans.
I would almost want the opposite. The best content on most topics I've found tends to be a page on a discussion forum of some sort, followed by blogs and more general editorial sites.
So on launch-day you would have checkboxes. On the one-year anniversary you´d have those checkboxes removed.
Google probably implemented this a couple of decades ago, though, so what we _really_ need is someone to come up with a new business model more attractive than Google's.
Ad-free is quite refreshing.
Whereby - a site would self-identify as being in a particular genre, say "healthcare" - and I could launch a tab to the engine and set my sub to "health, health-tech, healthcare, medicine, etc.." and then do my search and only those sites that set their category will show up in that search - but if I dont find my search, I can then easily slide out to other areas where I may not have thought what I was looking for would have identified with. Further - any post by any company/site could individually been given a topic to self-declare as... thus even if the company or site isnt necessarily in that space - their page or object could at least be a part of that result ranking....
Or has this been tried/found to be stupid?
You are describing the keywords meta tag.
While it is often told that competitors before Google did not use something like PageRank, which is not true, Google's PageRank algorithm was better and cheaper than the competitors' and effectively killed your idea 20 years ago.
But I find it slightly ironic that people are bitching about PageRank having slightly some issues with respect to the specificity of what they are searching for...
meaning that even though "killed this idea twenty years ago" we are coming back to the same problem...
Is that perhaps just due to the volume of info that is available on the web and the much more complex way we have categorized (mentally, not digitally) all the knowledge and information thats out there now?
What OP is saying that instead of pagerank we can have other ranking methods which is surely plausible.
We use tf-idf too but augment with page rank and clustering. gets more relevant docs
Isn't this a description of any decent search engine?
But I have to be honest and say that it's just not working for me.
I type in Reddit, and it shows links to the NSFW subreddits instead of the main site or anything else on it.
Typing in Wikipedia gives me the Dutch version of Wikipedia.
Mario Wiki? The page on Mario Wiki about Mario Clash, then the Smash Bros Wiki and a bunch of SEO spam pages.
Pokemon Go gets me no relevant results at all. Certainly not anything official, that's for sure.
It's a decent start, and having 2 billion pages indexed is pretty impressive for a project like this as it is, but it's just not really usable as a search engine just yet.
I'd really like to see that kind of project working as a good alternative to Google, but as it is it's not really usable.
We mostly filtered out porn by using a two word phrase method. There were a lot of edge cases because many potentially dirty concepts are made up of words that are not bad when used alone. For example a text can have both "girls" and "nude" in it without being vulgar, but if it has the phrase "nude girls" the chance for it being pornografic is much higher.
I've also searched for less generic things like "xperia z5" and the results looked good.
I'm gonna further improve this over the next days. Right now it's just a quick'n dirty hack. :)
Warning! This is definitely NSFW! :)
So if you filter out all the adult stuff you might make those users unhappy. Perhaps make it configurable?
Eg. add a checkbox for NSFW results or something better.
I've been thinking a lot about days recently. Seems to me like Pandora's box is open. Google knows where you live, where you eat, what your fetishes are, all of your sexual partners. Facebook knows most of those things to, via different methods. And if you run Windows Microsoft probably has access to most of that as well. Apple will too, because if they don't they won't be able to compete. Tesla, Uber, Waze also have a huge amount of data on your life.
Everyone is pushing the envelope on how much data they are collecting, and the companies which collect more data will compete better. As tech gets better we will increasingly be unable to resist sharing our whole lives with the companies who are powering modern living.
Even worse, there's a huge monopolization effect to having data. Nobody else has anywhere near as much data as Google. That means nobody else can compete. Nevermind the engineering, your algorithms can be 2x as good but you won't have 0.1% the data as a company with billions of daily users.
So Google and Facebook are left untouchable. Microsoft, Apple, and maybe Amazon can get in range. Is there anyone else?
We can fight back by giving up the privacy war and blowing the doors open instead. Take your data (as much as you dare) and make it public. Let every startup have access to it. Let every doctor have access to it. Give the small players a fighting chance.
That does mean a massive cultural shift. It means your neighbors will be able to look up your salary, your fetishes, your personal affairs. It's a big deal.
I don't see any other way out of this though. Surveillance technology is getting better faster than privacy technology, because surveillance tech has the entire tech industry behind it. Smarter phones, smarter TVs, smarter grocery stores, smarter credit cards, smarter shoes... smarter everything. Privacy is melting away and we aren't getting it back.
A free search API will be fully available probably next week. It's in testing already. It's just a matter of putting the finishing touches on the documentation.
And the crawl- and index-data will be available for download in a few weeks. It's also just a matter of documenting the data-format.
BTW: I disagree with your points about privacy. I see DeuSu as a way of fighting back.
If you'd base a project on lucene/solr/elasticsearch or sphinx you'd also increase the chance of contributing to it.
- Delete your Facebook account. If you really need it to keep in touch with people across the country, at least delete the app from your phone and don't leave it open in a browser tab.
- Don't place asinine Amazon orders just because they ship free. Stop by a drug store or hardware store on your way home from work for odds & ends. You can even pay cash and kill the CC data bird with the same stone. Bonus points for instant gratification.
- Don't use GMail. Use an email provider like Protonmail or Tutanota that doesn't index all of your emails for advertising or other purposes.
- Don't sign in to Google (or Waze or whoever else's site) when navigating or searching so all of your actions online aren't automatically tied to a single account.
- Don't buy some internet-of-shit appliance that inevitably phones home with a bunch of telemetry data just so you can get a push notification when your laundry is dry.
Want to go really extreme? Now that you've uninstalled Facebook and aren't getting push notifications from your toaster, ditch your smart phone. Pay $15/mo for calls/texts on a Nokia 3310 and save some cash while you're at it. Want to listen to music on the go? Buy an MP3 player like everyone did 10 years ago. Want to play games on the go? Buy a GameBoy and experience the wonder of physical buttons while gaming. Really need directions on the go? You can probably get by with a Garmin or even gasp a paper map. Want to read HN or Reddit on the toilet? Try a book. Yes, I've written mobile apps and my iPhone is sitting on the desk right in front of me, but I hate the fucking thing and its always-connected mobile data. I've done the rest of the above bullet points and don't plan on buying another smart phone when this one hits planned-obsolescence in another year or two. Aggressive data collection depends on user engagement. The easiest way to fight it is to just quit engaging. Usually it'll save you money, too, which is nice.
However, it's not very good. If I search for "banana" I get information about a sex shop rather than about bananas.
Here is some input based on my experience building a similar project at my former company. (We did not quite get to 2B pages, but were close to ~300M):
For creating a really viable (alternative) search engine, the freshness of your index is going to be a fairly important factor. Now, obviously, re-crawling a massive index frequently/regularly is going to need/consume some huge amounts of bandwidth + CPU cycles. Here is how we had optimized the resource utilization:
Corresponding to each indexed URL, store a 'Last Crawled' time-stamp.
Corresponding to each indexed URL, also store a sort-of 'crawl-history' (If space is a constraint, don't store each version of the URL, store only the latest one). On each re-crawl, store two data fields: time-stamp and a boolean if the URL content has changed since last crawl. As more re-crawl cycles run, you will be able to calculate/predict the 'update frequency' of each URL. Then, prioritize the re-crawls based on the update frequency score (i.e. re-crawl those with higher scores more frequently and the others less frequently).
If you need any more help/input, let me know and I'll be happy to do what I can.
HTH and all the best moving forward.
I'm not sure I'll use this, but I'll try to... it all depends on how good it is. But I approve of the project so I sent a (very) small bitcoin donation to hopefully help fund it for a few more minutes :)
Depending on who you are (there were 2 bitcoin donations today), you funded either about 18 or 28 hours of operations. :)
- file formats, particularly the postings
- query evaluation strategy
- update strategy
I poked around in the source code a bit, but couldn't find these things.
What do you mean with postings?
The main index is split into 32 shards (there is also an additional news-index which is updated about every 5-10 minutes). Each shard is updated and queried seperately. The query actually runs 2/3 on a Windows server and 1/3 on a Linux server. The latter in Docker containers. I want to move everything to Linux over time.
Query has two phases. First only a rough - but fast - ranking is done. Then the top results of all shards are combined and completely re-ranked. This is basically a meta search engine hidden within.
First query phase is in src/searchservernew.dpr, and the second phase is in src/cgi/PostProcess.pas.
When I search myself; the top 10 results don't even have my last name ('Kusters') and just shows pages that have the word 'Nick'. I suppose you don't use a form of LSA to score the search results? Maybe it's too specific, but afaik mainstream search engines seem to give somewhat consistent results here.
Looking at the code (https://github.com/MichaelSchoebel/DeuSu/) I notice that you have ranking modifiers based on the .tld; why not store the reported content language and score based on that? Isn't that more relevant?
The snippets are currently the first 255 characters of the page's text. For snippets to be customized to the search term, I would have to store all the text of the page. And that would require a lot more disk space. Space that I can't afford at the moment.
Neat project -- Loads of room for improvement, but a great initiative!
 It's my default search engine in Chrome, so I use bang searching in the address bar.
If the metadata (such as last modified) size of your index is small enough to upload to aws, you can also reduce your re-crawl efforts when they have a fresh release.
Crawling 2bn pages could take forever and could generate a huge bandwidth bills, so any lessons you learnt, pitfalls you faced, etc would be a great read.
Block outgoing connects to local IP nets in your firewall. Otherwise your hosting provider might think you are trying to hack them. Apparently there are a lot of links out there that point to hosts which resolve to private IP ranges.
Another problem with following links is that you are bound to run across some that are malware command & control servers. Had several complaints to my ISP after authorities took over control of one and used the C&C server's domain as a honeypot. My crawler is on a whitelist now.
I had one person who vehemently complained that I was trying to hack him, because the software downloaded his robots.txt. I'm NOT kidding! :)
Make sure your robots.txt parsing is working correctly. I had an undiscovered bug in the software at some time which basically caused it to think everything is allowed. Luckily someone was nice enough to let me know. And he was really nice about it. And he would have had every right to be angry.
A major bottleneck is DNS queries. Run your own DNS server and even cache the hostname/IP pairs yourself. Do not even think about using your IPS's DNS server. If you bombard them with 100+ DNS requests/s then they WILL be angry. :)
This might be a useful resource to get started:
(Register and download the IPv4 Address Space data file to use as an initial cache and then append/update as you go.)
I'm curious, is it expensive to run a search site like this?
Deusu on the other hand seems to weight words in urls highly.
If you search for scientology only on Deusu, you might end up wearing a funky hat https://deusu.org/query?q=scientology
For the life of me I can't figure out how you manage to crawl over a billion web pages (even in 2-3 months), index the data and run the server with €300 per month. Especially the crawler part...
> it could be good for a particular use case
> or for learning from.
The author admitted in the github readme that the code quality is rather bad. I also don't see a link to the search index, the only valuable component of this project.
There is also a free API in beta-test right now. Will probably be ready for official release next week.
But if they can, I think a big part of it would be separation of the crawl index and the UI / prioritisation etc. Different people can work on those two ends of the problem and apply different philosophies.
Search only forums? Reject porn using XYZ method? Great! But they can all use the same database, or pick from the a common community of databases.
Have you also published the ranking mechanism? That way people might contribute you to improve it.
2 are used for crawling, index-building and raw-data storage. Quadcore, 32gb RAM, 4tb HDD and 1gbit/s internet connection on each of these. They are rented and in a big data-center. Crawling uses "only" about 200-250mbit/s of bandwidth.
2 servers for webserver and queries. Quadcore, 32gb RAM. One with 2x512gb SSD, the other with only 1x512gb SSD. These servers are here at home. I have cable internet with 200mbit/s down, 20mbit/s up. Static IPs obviously.
A full crawl currently takes about 3 months.
Wired you a small donation as I think it is important to have alternatives.
And I started this software 20 years ago. Granted, a LOT of the software has changed since then. But I don't see a reason to throw away existing code unless it is in need of so much change that rewriting from scratch would be easier. And even then I might stick to what I know best, and what fits best with other parts of the software.
But all the traffic from here is currently driving the servers to their limit. Queries are already slowing down a bit because of imminent overload. Usually the average query takes about 250ms. Currently the average is at 334ms.
I note that Dr Anna Patterson is back with Google. She wrote this in 2004: http://queue.acm.org/detail.cfm?id=988407
Even https://deusu.org/query?q=2+%2B+2+%3D+5 didn't yield any results. I was under the impression that it'd a message:
2 + 2 = 5 for very large values of 2.
Yacy is also quite a bit more resilient.
I will say that I don't buy Yacy's "no censoring" statement. If I was a bad actor, I could run yacy on a computer with false dns and false certificates, and yacy could index my fake content with official looking URLs.